Screaming frog spidering URLs that are disallowed...

Started by gm66, October 20, 2015, 10:26:50 AM

Previous topic - Next topic

gm66

Hi,

i have a client running Magento, so in his robots.txt i've blocked things like /wishlist/, /product_compare/ etc ..

When i run a spider (screaming frog, http://www.screamingfrog.co.uk/seo-spider/) it is still spidering the disallowed URLs.

I know that it's not the spider config, i also know that the robots.txt is seen by the spider.

What else could be causing it to spider disallowed URLs ?

Here is the content of robots.txt :

# all spiders

User-agent: *

# sitemap

Sitemap: http://myclient.co.uk/sitemap.xml

# no dev files

Disallow: /CVS
Disallow: /*. Svn $
Disallow: /*. Idea $
Disallow: /*. Sql $
Disallow: /*. Tgz $

# no admin

Disallow: /admin/

# no tech directory

Disallow: /app/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
Disallow: /lib/
Disallow: /pkginfo/
Disallow: /shell/
Disallow: /var/

# no shared files

Disallow: /api.php
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /get.php
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /README.txt
Disallow: /RELEASE_NOTES.txt

# no filtered pages

Disallow: /*? Dir *
Disallow: /*? Dir = desc
Disallow: /*? Dir = asc
Disallow: /*? Limit = all
Disallow: /*? Mode *

# seo urls?
# Disallow: /index.php/

# no session ids

Disallow: /*? SID =

# no checkout or account

Disallow: /checkout/
Disallow: /onestepcheckout/
Disallow: /customer/
Disallow: /customer/account/
Disallow: /customer/account/login/

# no catalog crap

Disallow: /catalogsearch/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/

# no wishlist or sendfriend

Disallow: /wishlist/
Disallow: /sendfriend/

# no tech

Disallow: /cgi-bin/
Disallow: /cleanup.php
Disallow: /apc.php
Disallow: /memcache.php
Disallow: /phpinfo.php
Civilisation is a race between disaster and education ...

Rupert

... Make sure you live before you die.

ergophobe

I don't know why it would be crawling within those directories, but there are a couple of things

QuoteDisallow: /*. Svn $

You have spaces there (possibly a cut and past artifact?).

Also, URLs are case-sensitive on *nix anyway. Do your Svn files really have an upper case S and lower vn?

QuoteDisallow: /admin/

Trailing slash. If the UA drops the trailing slash and tries to crawl http://example.com/admin it would be successful and show up as crawling it. It should, of course, be blocked from crawling files in that directory.

But other than that, the actual URLs in /wishlist/someURL should be blocked

Is it just Screaming Frog or other "good" crawlers too?

gm66

Thanks for replies.

Yes it's a cut/paste job, i don't know much about Magento.

I was wondering about the spaces in the URL ..

When i run the file through analysers and check the /wishlist/ and /sendfriend/ directories they tell me they are disallowed for all robots, yet SF scans them ...

Time to contact SF support i think.


Thanks again,

Gary.
Civilisation is a race between disaster and education ...

ergophobe

So just to be clear, it is crawling files and subdirectories within the excluded directories? Because I would expect it to request the directory itself if the trailing slash is getting dropped.

And then there's this

QuoteYou can choose to ignore the robots.txt (it won't even download it) in the premium version of the software by selecting the option. Configuration -> Spider -> Ignore robots.txt.

A couple of things to remember here  –

We only follow one set of user agent directives as per robots.txt protocol. Hence, priority is the Screaming Frog SEO Spider UA if you have any. If not, we will follow commands for the Googlebot UA, or lastly the 'ALL' or global directives.
To reiterate the above, if you specify directives for the Screaming Frog SEO Spider, or Googlebot then the ALL (or 'global') bot commands will be ignored. If you want the global directives to be obeyed, then you will have to include those lines under the specific UA section for the spider or Googlebot.
If you have conflicting directives (ie an allow and disallow to the same filepath) then a matching allow directive beats a matching disallow if it contains equal or more characters in the command.

gm66

Thanks ergo, i'm hoping that 'global' and 'all' equate to '*', will check it out.


All the best,

Gary.
Civilisation is a race between disaster and education ...