web crawler

#1 / Feb 08, 2010 2:12pm

jcavard
130 posts

Hi!

I developped a web crawler with CI. It does the job, but my host keep blocking my IP over ‘Large connection amount’ (max allowed, 25). I guess this is a direct effect of the crawler, but I would like your input on this…

Is it not best to use PHP to code a crawler?
What causes concurrent connections?
Has anyone of you ever code a crawler (or anything similar)?

The main goal is to parse overs 60 000+ html pages, to retreive specific product information. Has anyone of you ever had the same ‘problem’??

thanks a lot!
#2 / Feb 08, 2010 3:14pm

Sbioko
382 posts

Yeah, I did it some years ago 😊 But, today I think that PHP is not the best programming language to develop a crawler, because this is scripting language and it is written on C. But, do you hear about HipHopPHP from Facebook? Try it 😊 It will turn your PHP to C++ code and increase perfomance up to 50%. About connections, I can’t say something concretically because I didn’t see the code.
#3 / Feb 08, 2010 4:17pm

jcavard
130 posts

yeah, I read about HipHop PHP, but I’m on an shared hosting.
I’m looking at rewriting the whole thing in JAVA so it can be multithreaded.
#4 / Feb 08, 2010 4:18pm

danmontgomery
1802 posts

Your host blocking your IP because of outgoing connections isn’t going to be affected by whether or not the process is threaded… It sounds like you need to limit the number of concurrent connections or talk to your host.

[edit]

Concurrent connections happen when you don’t wait for one connection to close before opening another, judging from the error message it means you have 25 outgoing connections open at once.
#5 / Feb 08, 2010 4:29pm

jcavard
130 posts

Thanks for piointing me that multithreaded won’t help with concurrent connection. It will save me lots of work, because concurrent connexions are my main (and sole) concern at this time.

How can I limit the number of connection then? Do you have any idea? I use cURL.. am I msissing any configuration on the cURL object?

Your host blocking your IP because of outgoing connections isn’t going to be affected by whether or not the process is threaded… It sounds like you need to limit the number of concurrent connections or talk to your host.

[edit]

Concurrent connections happen when you don’t wait for one connection to close before opening another, judging from the error message it means you have 25 outgoing connections open at once.
#6 / Feb 08, 2010 4:38pm

Sbioko
382 posts

Why not to rewrite it to C++. And, important question: why do you need this?(if not secret)
#7 / Feb 08, 2010 5:14pm

jcavard
130 posts

Well, I am evaluating all the possibilities right now, and if C++ is the way to go, well be it! I’m gonna rewrite it all!

The main purpose of this crawler has nothing secret. I need this to retreive information from different auction site onto one database. It helps us speed up the search, instead of searching on all the different site, I query one database that contains all the auctions available.

This script runs nightly, and it is an internal tool on a intranet.

Why not to rewrite it to C++. And, important question: why do you need this?(if not secret)

*Sexy edit: this thread has 69 views!*
#8 / Feb 08, 2010 5:23pm

danmontgomery
1802 posts

I don’t know if or what you’re missing without seeing the code… Are you calling curl_close()?

#9 / Feb 08, 2010 5:37pm

jcavard

130 posts

I don’t know if or what you’re missing without seeing the code… Are you calling curl_close()?

Well, I use Phil Sturgeon’s cURL lib for CI. In the execute() fonction there is a curl_close()

// End a session and return the results
public function execute()
{
    // Set two default options, and merge any extra ones in
    if(!isset($this->options[CURLOPT_TIMEOUT])) $this->options[CURLOPT_TIMEOUT] = 30;
    if(!isset($this->options[CURLOPT_RETURNTRANSFER])) $this->options[CURLOPT_RETURNTRANSFER] = TRUE;
    if(!isset($this->options[CURLOPT_FOLLOWLOCATION])) $this->options[CURLOPT_FOLLOWLOCATION] = TRUE;
    if(!isset($this->options[CURLOPT_FAILONERROR])) $this->options[CURLOPT_FAILONERROR] = TRUE;

    if(!empty($this->headers))
    {
        $this->option(CURLOPT_HTTPHEADER, $this->headers);
    }

    $this->options();

    // Execute the request & and hide all output
    $this->response = curl_exec($this->session);

    // Request failed
    if($this->response === FALSE)
    {
        $this->error_code = curl_errno($this->session);
        $this->error_string = curl_error($this->session);
        
        curl_close($this->session);
        $this->session = NULL;
        return FALSE;
    }
    
    // Request successful
    else
    {
        $this->info = curl_getinfo($this->session);
        
        curl_close($this->session);
        $this->session = NULL;
        return $this->response;
    }
}

Maybe I should try plain PHP (without CI), but the thing is, this script has been running fin for the past month, only to cause connection problem today. I can try new code, but it might work fine for a few times, before it causes the same problem…

#10 / Feb 08, 2010 6:05pm

Kamarg
241 posts

If you are wanting to continue with your php version, look into pooling. This link is for thread pooling but the same idea applies substituting sockets/connections for threads.
#11 / Feb 09, 2010 3:37pm

jcavard
130 posts

If you are wanting to continue with your php version, look into pooling. This link is for thread pooling but the same idea applies substituting sockets/connections for threads.

Well, at this point I’m considering all avenues, so if PHP is not best, I might just rewrite the whole thing.

The only requirements are:
- must run on a shared hosting
- possibility to limit concurrent connections

maybe the a C++ extension might be a good idea. but my main concern is the darn concurrent connection…

Anyone knows how I can monitor/limit the concurrent connection?

Thread

jcavard

Sbioko

jcavard

danmontgomery

jcavard

Sbioko

jcavard

danmontgomery

jcavard

Kamarg

jcavard

Username

Password

Thread

web crawler

ExpressionEngine News!

Username

Password

Email Address

Display Name

Password