Hi,
i have a client running Magento, so in his robots.txt i've blocked things like /wishlist/, /product_compare/ etc ..
When i run a spider (screaming frog, http://www.screamingfrog.co.uk/seo-spider/) it is still spidering the disallowed URLs.
I know that it's not the spider config, i also know that the robots.txt is seen by the spider.
What else could be causing it to spider disallowed URLs ?
Here is the content of robots.txt :
# all spiders
User-agent: *
# sitemap
Sitemap: http://myclient.co.uk/sitemap.xml
# no dev files
Disallow: /CVS
Disallow: /*. Svn $
Disallow: /*. Idea $
Disallow: /*. Sql $
Disallow: /*. Tgz $
# no admin
Disallow: /admin/
# no tech directory
Disallow: /app/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
Disallow: /lib/
Disallow: /pkginfo/
Disallow: /shell/
Disallow: /var/
# no shared files
Disallow: /api.php
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /get.php
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /README.txt
Disallow: /RELEASE_NOTES.txt
# no filtered pages
Disallow: /*? Dir *
Disallow: /*? Dir = desc
Disallow: /*? Dir = asc
Disallow: /*? Limit = all
Disallow: /*? Mode *
# seo urls?
# Disallow: /index.php/
# no session ids
Disallow: /*? SID =
# no checkout or account
Disallow: /checkout/
Disallow: /onestepcheckout/
Disallow: /customer/
Disallow: /customer/account/
Disallow: /customer/account/login/
# no catalog crap
Disallow: /catalogsearch/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
# no wishlist or sendfriend
Disallow: /wishlist/
Disallow: /sendfriend/
# no tech
Disallow: /cgi-bin/
Disallow: /cleanup.php
Disallow: /apc.php
Disallow: /memcache.php
Disallow: /phpinfo.php
I rely on Google for checking them, or Aarons:
http://tools.seobook.com/robots-txt/analyzer/
I don't know why it would be crawling within those directories, but there are a couple of things
QuoteDisallow: /*. Svn $
You have spaces there (possibly a cut and past artifact?).
Also, URLs are case-sensitive on *nix anyway. Do your Svn files really have an upper case S and lower vn?
QuoteDisallow: /admin/
Trailing slash. If the UA drops the trailing slash and tries to crawl http://example.com/admin it would be successful and show up as crawling it. It should, of course, be blocked from crawling files in that directory.
But other than that, the actual URLs in /wishlist/someURL should be blocked
Is it just Screaming Frog or other "good" crawlers too?
Thanks for replies.
Yes it's a cut/paste job, i don't know much about Magento.
I was wondering about the spaces in the URL ..
When i run the file through analysers and check the /wishlist/ and /sendfriend/ directories they tell me they are disallowed for all robots, yet SF scans them ...
Time to contact SF support i think.
Thanks again,
Gary.
So just to be clear, it is crawling files and subdirectories within the excluded directories? Because I would expect it to request the directory itself if the trailing slash is getting dropped.
And then there's this
QuoteYou can choose to ignore the robots.txt (it won't even download it) in the premium version of the software by selecting the option. Configuration -> Spider -> Ignore robots.txt.
A couple of things to remember here –
We only follow one set of user agent directives as per robots.txt protocol. Hence, priority is the Screaming Frog SEO Spider UA if you have any. If not, we will follow commands for the Googlebot UA, or lastly the 'ALL' or global directives.
To reiterate the above, if you specify directives for the Screaming Frog SEO Spider, or Googlebot then the ALL (or 'global') bot commands will be ignored. If you want the global directives to be obeyed, then you will have to include those lines under the specific UA section for the spider or Googlebot.
If you have conflicting directives (ie an allow and disallow to the same filepath) then a matching allow directive beats a matching disallow if it contains equal or more characters in the command.
Thanks ergo, i'm hoping that 'global' and 'all' equate to '*', will check it out.
All the best,
Gary.