Does WordPress scrape sites?
Posted: 01 December 2007 08:25 PM   [ Ignore ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  448
Joined  11-05-2002

I noticed in my httpd access logs a number of lines like this:
212.24.48.15 - - [01/Dec/2007:20:19:07 -0500] “POST /econtent/index.php/trackback/91/ HTTP/1.1” 200 132 “-” “WordPress 1.9”
212.24.48.34 - - [01/Dec/2007:20:19:05 -0500] “POST /econtent/index.php/trackback/91/ HTTP/1.1” 200 132 “-” “WordPress/2.0”

Is this wordpress scraping my site for content?

Have there been previous discussions here about trying to keep scraping bots off a site?

 Signature 

Craig Issod, Publisher
Hearth.com - Answers to all your Burning Questions
http://www.hearth.com

Profile
 
 
Posted: 02 December 2007 11:26 AM   [ Ignore ]   [ # 1 ]  
Research Scientist
Avatar
RankRankRankRankRankRank
Total Posts:  7516
Joined  08-05-2002

That looks more like a Wordpress blog trying to send a trackback to your site.

 Signature 


♖  ♘  ♗  ♔  ♕  ♗  ♘  ♖
♙  ♙  ♙  ♙  ♙  ♙  ♙  ♙
☐  ☐  ☐  ☐  ☐  ☐  ☐  ☐
☐  ☐  ☐  ☐  ☐  ☐  ☐  ☐
☐  ☐  ☐  ☐  ☐  ☐  ☐  ☐
☐  ☐  ☐  ☐  ☐  ☐  ☐  ☐
♟  ♟  ♟  ♟  ♟  ♟  ♟  ♟
♜  ♞  ♝  ♚  ♛  ♝  ♞  ♜

Profile
 
 
Posted: 02 December 2007 11:28 AM   [ Ignore ]   [ # 2 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  448
Joined  11-05-2002

Update - to my knowledge these are scrapers - and they really use up a lot of bandwidth besides stealing your content. I’ll bet they are programmed to look for weblogs and certain CMS’es.

I put some code into the .htaccess file of the main EE directory, and noticed my “guest” count dropping about 30% or more. Here is some basic “stop the bad bots and this wordpress” code. You can add bad guys to it, if your logs sense any.

SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User
-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User
-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User
-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User
-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User
-Agent "^Teleport" bad_bot
SetEnvIfNoCase User
-Agent "^EmailCollector" bad_bot
SetEnvIfNoCase User
-Agent "^LinkWalker" bad_bot
SetEnvIfNoCase User
-Agent "^Zeus" bad_bot
SetEnvIfNoCase User
-Agent "^WordPress/2.0" bad_bot
SetEnvIfNoCase User
-Agent "^WordPress1.9" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env
=bad_bot
</Limit>
 Signature 

Craig Issod, Publisher
Hearth.com - Answers to all your Burning Questions
http://www.hearth.com

Profile
 
 
Posted: 02 December 2007 11:34 AM   [ Ignore ]   [ # 3 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  448
Joined  11-05-2002
Paul Burdick - 02 December 2007 11:26 AM

That looks more like a Wordpress blog trying to send a trackback to your site.

Might be, but it was a number of various IP’s, none of which came up as httpd sites.They were coming in every 2 seconds or so! And most of the URLS they were requesting were not existing….maybe they simply follow the math?

And, scraping is becoming big business - Wordpress had plug ins which can admitted be used (and ARE being used) to scrape.

BTW, they did not respond to robots.txt

 Signature 

Craig Issod, Publisher
Hearth.com - Answers to all your Burning Questions
http://www.hearth.com

Profile
 
 
Posted: 02 December 2007 11:38 AM   [ Ignore ]   [ # 4 ]  
Research Scientist
Avatar
RankRankRankRankRankRank
Total Posts:  7516
Joined  08-05-2002

I am not sure how those Trackback URLs that you have in your first post could involved scraping.  If no data is received for a trackback, a simple XML error message is shown not your site. Go here to see that.  And when data is received, one simply gets a confirmation message in XML.

Still, if they are hitting your site so hard, it probably is not a bad idea to .htaccess block them, which you can also do with the Blacklist module.

 Signature 


♖  ♘  ♗  ♔  ♕  ♗  ♘  ♖
♙  ♙  ♙  ♙  ♙  ♙  ♙  ♙
☐  ☐  ☐  ☐  ☐  ☐  ☐  ☐
☐  ☐  ☐  ☐  ☐  ☐  ☐  ☐
☐  ☐  ☐  ☐  ☐  ☐  ☐  ☐
☐  ☐  ☐  ☐  ☐  ☐  ☐  ☐
♟  ♟  ♟  ♟  ♟  ♟  ♟  ♟
♜  ♞  ♝  ♚  ♛  ♝  ♞  ♜

Profile
 
 
   
 
 
Post Marker Legend
New Topic New posts Hot Topic Hot Topic with new posts New Poll New Poll Moved Topic Moved Topic Sticky Topic Sticky topic
Old Topic No new posts Hot Old Topic Hot Topic with no new posts Old Poll Old Poll Closed Topic Closed Topic Announcement Announcements
Theme
Change Theme
Visitor Statistics
The most visitors ever was 1743, on December 02, 2009 03:47 PM
Total Registered Members: 120494 Total Logged-in Users: 43
Total Topics: 126564 Total Anonymous Users: 33
Total Replies: 665427 Total Guests: 301
Total Posts: 791991    
Members ( View Memberlist )