ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

Does WordPress scrape sites?

December 01, 2007 9:25pm

Subscribe [0]
  • #1 / Dec 01, 2007 9:25pm

    handyman

    509 posts

    I noticed in my httpd access logs a number of lines like this:
    212.24.48.15 - - [01/Dec/2007:20:19:07 -0500] “POST /econtent/index.php/trackback/91/ HTTP/1.1” 200 132 “-” “WordPress 1.9”
    212.24.48.34 - - [01/Dec/2007:20:19:05 -0500] “POST /econtent/index.php/trackback/91/ HTTP/1.1” 200 132 “-” “WordPress/2.0”

    Is this wordpress scraping my site for content?

    Have there been previous discussions here about trying to keep scraping bots off a site?

  • #2 / Dec 02, 2007 12:26pm

    Paul Burdick

    480 posts

    That looks more like a Wordpress blog trying to send a trackback to your site.

  • #3 / Dec 02, 2007 12:28pm

    handyman

    509 posts

    Update - to my knowledge these are scrapers - and they really use up a lot of bandwidth besides stealing your content. I’ll bet they are programmed to look for weblogs and certain CMS’es.

    I put some code into the .htaccess file of the main EE directory, and noticed my “guest” count dropping about 30% or more. Here is some basic “stop the bad bots and this wordpress” code. You can add bad guys to it, if your logs sense any.

    SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
    SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
    SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
    SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
    SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
    SetEnvIfNoCase User-Agent "^Teleport" bad_bot
    SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
    SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot
    SetEnvIfNoCase User-Agent "^Zeus" bad_bot
    SetEnvIfNoCase User-Agent "^WordPress/2.0" bad_bot
    SetEnvIfNoCase User-Agent "^WordPress1.9" bad_bot
    
    <Limit GET POST>
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    </Limit>
  • #4 / Dec 02, 2007 12:34pm

    handyman

    509 posts

    That looks more like a Wordpress blog trying to send a trackback to your site.

    Might be, but it was a number of various IP’s, none of which came up as httpd sites.They were coming in every 2 seconds or so! And most of the URLS they were requesting were not existing….maybe they simply follow the math?

    And, scraping is becoming big business - Wordpress had plug ins which can admitted be used (and ARE being used) to scrape.

    BTW, they did not respond to robots.txt

  • #5 / Dec 02, 2007 12:38pm

    Paul Burdick

    480 posts

    I am not sure how those Trackback URLs that you have in your first post could involved scraping.  If no data is received for a trackback, a simple XML error message is shown not your site. Go here to see that.  And when data is received, one simply gets a confirmation message in XML.

    Still, if they are hitting your site so hard, it probably is not a bad idea to .htaccess block them, which you can also do with the Blacklist module.

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases