ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

web crawler

February 08, 2010 2:12pm

Subscribe [3]
  • #1 / Feb 08, 2010 2:12pm

    jcavard

    130 posts

    Hi!

    I developped a web crawler with CI. It does the job, but my host keep blocking my IP over ‘Large connection amount’ (max allowed, 25). I guess this is a direct effect of the crawler, but I would like your input on this…

    Is it not best to use PHP to code a crawler?
    What causes concurrent connections?
    Has anyone of you ever code a crawler (or anything similar)?

    The main goal is to parse overs 60 000+ html pages, to retreive specific product information. Has anyone of you ever had the same ‘problem’??

    thanks a lot!

  • #2 / Feb 08, 2010 3:14pm

    Sbioko

    382 posts

    Yeah, I did it some years ago 😊 But, today I think that PHP is not the best programming language to develop a crawler, because this is scripting language and it is written on C. But, do you hear about HipHopPHP from Facebook? Try it 😊 It will turn your PHP to C++ code and increase perfomance up to 50%. About connections, I can’t say something concretically because I didn’t see the code.

  • #3 / Feb 08, 2010 4:17pm

    jcavard

    130 posts

    yeah, I read about HipHop PHP, but I’m on an shared hosting.
    I’m looking at rewriting the whole thing in JAVA so it can be multithreaded.

  • #4 / Feb 08, 2010 4:18pm

    danmontgomery

    1802 posts

    Your host blocking your IP because of outgoing connections isn’t going to be affected by whether or not the process is threaded… It sounds like you need to limit the number of concurrent connections or talk to your host.

    [edit]

    Concurrent connections happen when you don’t wait for one connection to close before opening another, judging from the error message it means you have 25 outgoing connections open at once.

  • #5 / Feb 08, 2010 4:29pm

    jcavard

    130 posts

    Thanks for piointing me that multithreaded won’t help with concurrent connection. It will save me lots of work, because concurrent connexions are my main (and sole) concern at this time.

    How can I limit the number of connection then? Do you have any idea? I use cURL.. am I msissing any configuration on the cURL object?

    Your host blocking your IP because of outgoing connections isn’t going to be affected by whether or not the process is threaded… It sounds like you need to limit the number of concurrent connections or talk to your host.

    [edit]

    Concurrent connections happen when you don’t wait for one connection to close before opening another, judging from the error message it means you have 25 outgoing connections open at once.

  • #6 / Feb 08, 2010 4:38pm

    Sbioko

    382 posts

    Why not to rewrite it to C++. And, important question: why do you need this?(if not secret)

  • #7 / Feb 08, 2010 5:14pm

    jcavard

    130 posts

    Well, I am evaluating all the possibilities right now, and if C++ is the way to go, well be it! I’m gonna rewrite it all!

    The main purpose of this crawler has nothing secret. I need this to retreive information from different auction site onto one database. It helps us speed up the search, instead of searching on all the different site, I query one database that contains all the auctions available.

    This script runs nightly, and it is an internal tool on a intranet.

    Why not to rewrite it to C++. And, important question: why do you need this?(if not secret)


    *Sexy edit: this thread has 69 views!*

  • #8 / Feb 08, 2010 5:23pm

    danmontgomery

    1802 posts

    I don’t know if or what you’re missing without seeing the code… Are you calling curl_close()?

  • #9 / Feb 08, 2010 5:37pm

    jcavard

    130 posts

    I don’t know if or what you’re missing without seeing the code… Are you calling curl_close()?

    Well, I use Phil Sturgeon’s cURL lib for CI. In the execute() fonction there is a curl_close()

    // End a session and return the results
    public function execute()
    {
        // Set two default options, and merge any extra ones in
        if(!isset($this->options[CURLOPT_TIMEOUT])) $this->options[CURLOPT_TIMEOUT] = 30;
        if(!isset($this->options[CURLOPT_RETURNTRANSFER])) $this->options[CURLOPT_RETURNTRANSFER] = TRUE;
        if(!isset($this->options[CURLOPT_FOLLOWLOCATION])) $this->options[CURLOPT_FOLLOWLOCATION] = TRUE;
        if(!isset($this->options[CURLOPT_FAILONERROR])) $this->options[CURLOPT_FAILONERROR] = TRUE;
    
        if(!empty($this->headers))
        {
            $this->option(CURLOPT_HTTPHEADER, $this->headers);
        }
    
        $this->options();
    
        // Execute the request & and hide all output
        $this->response = curl_exec($this->session);
    
        // Request failed
        if($this->response === FALSE)
        {
            $this->error_code = curl_errno($this->session);
            $this->error_string = curl_error($this->session);
            
            curl_close($this->session);
            $this->session = NULL;
            return FALSE;
        }
        
        // Request successful
        else
        {
            $this->info = curl_getinfo($this->session);
            
            curl_close($this->session);
            $this->session = NULL;
            return $this->response;
        }
    }


    Maybe I should try plain PHP (without CI), but the thing is, this script has been running fin for the past month, only to cause connection problem today. I can try new code, but it might work fine for a few times, before it causes the same problem…

  • #10 / Feb 08, 2010 6:05pm

    Kamarg

    241 posts

    If you are wanting to continue with your php version, look into pooling. This link is for thread pooling but the same idea applies substituting sockets/connections for threads.

  • #11 / Feb 09, 2010 3:37pm

    jcavard

    130 posts

    If you are wanting to continue with your php version, look into pooling. This link is for thread pooling but the same idea applies substituting sockets/connections for threads.

    Well, at this point I’m considering all avenues, so if PHP is not best, I might just rewrite the whole thing.

    The only requirements are:
    - must run on a shared hosting
    - possibility to limit concurrent connections

    maybe the a C++ extension might be a good idea. but my main concern is the darn concurrent connection…


    Anyone knows how I can monitor/limit the concurrent connection?

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases