ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

Best Practices

January 08, 2008 8:03pm

Subscribe [4]
  • #1 / Jan 08, 2008 8:03pm

    ebohling

    53 posts

    Since this is not directly EE related I thought the Lounge was the appropriate location for this question…

    I know there is no perfect answer here, but I was looking for guidance (from smarter folks than me) on what a “good” robots.txt file contains. Do you refuse by default? Which ones do you allow? Just looking for opinions here. Using a robot.txt generator, I came up with this:

    —-
    # robots.txt generated at http://www.mcanerin.com
    User-agent: Googlebot
    Disallow:
    User-agent: MSNBot
    Disallow:
    User-agent: Slurp
    Disallow:
    User-agent: Teoma
    Disallow:
    User-agent: ia_archiver
    Disallow:
    User-agent: *
    Disallow: /
    Crawl-delay: 20
    Disallow: /cgi-bin/
    Sitemap: http://ebohling.com/bryce/sitemap
    —-

    good? bad? indifferent?

    thanks,
    —bb—

  • #2 / Jan 18, 2008 3:05pm

    ebohling

    53 posts

    No one willing to share their thoughts?

  • #3 / Jan 18, 2008 8:43pm

    Kellankade

    54 posts

    Ok I am not an expert but other than stopping 404 errors from the bots checking your file there is really is no reason for that robots.txt file it doesn’t do anything.

  • #4 / Jan 18, 2008 11:35pm

    Bruce2005

    536 posts

    Google - Webmaster Help Center

    You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).

  • #5 / Jan 18, 2008 11:50pm

    mayest

    293 posts

    It looks to me like it does a couple of things:

    1) Blocks all good bots, other than those listed, from the site. Bad bots don’t obey robots.txt anyway.
    2) Slows down bots crawling the site.
    3) Keeps some bots out of the cgi-bin directory. I’m not sure if keeps all bots out of this directory, or just those not listed. I think its just any bot not listed, but they should probably all be blocked from crawling that directory.

    Also, you are blocking the AdSense bot (MediaPartners-Google), so if you ever put AdSense on the site, you will have to remember to unblock it.

  • #6 / Feb 09, 2008 5:11am

    Ronin_23

    58 posts

    There are much more bad robots, my robots.txt is so long that webmaster central can’t analyze the file because it’s more than 5000 chars. But google downloads the file successful. You can find such lists when you have a look at search engine optimizer sites. But I feel certain that the real bad bots ignore the robots.txt.

    For protection of privacy you should prevent bot access to the folders which are not necessary to be published.

    User-agent: *
    Disallow: /images/avatars
    Disallow: /images/captchas
    Disallow: /images/forum_attachments
    Disallow: /images/gallery_batchfolder
    Disallow: /images/member_photos
    Disallow: /images/pm_attachments
    Disallow: /images/signature_attachments
    Disallow: /images/smileys
    Disallow: /images/world_flags
    Disallow: /system (choose your own path)
    Disallow: /themes

    And don’t forget to save the privacy of your forum members:
    Disallow: /index.php/forum/member/ (choose your own path)

  • #7 / Feb 09, 2008 4:22pm

    mayest

    293 posts

    Ronin_23,

    You are correct, but the bad bots aren’t going to pay attention to your robots.txt file. Much better to block them with .htaccess, which will keep them out of your site altogether.

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases