ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

Suggestion: Search class?

April 13, 2011 6:49pm

Subscribe [8]
  • #1 / Apr 13, 2011 6:49pm

    Ninjabear

    54 posts

    I have a real problem getting a good search engine into my sites. There seem to be no good scripts for implementing one in php and sql and only a few alternatives. For example you can use the LIKE statement in sql to use wildcard matches on fields, or you can use fulltext matching in myisam table type in mysql. Or you can use a crawler (don’t like this idea as it doesn’t see my db or what to search), or if you have a dedicated server you can use something good like Sphinx search, but how many people are on dedicated? I’m open to suggestions on better approaches for shared servers.

    So this brings me to my main point: Wouldn’t a search class in CI be a really great idea? I’m ok at coding but I like the idea of handing it over to a specialist who’s a mathematical genius and can just code up something great we can all use.

  • #2 / Apr 13, 2011 9:58pm

    CI Coder

    44 posts

    It would be great, but it would take some doing. A LOT of doing.

  • #3 / Apr 13, 2011 10:48pm

    wh1tel1te

    68 posts

    Fulltext searching is the best option, it is extremely fast and has extra features (e.g. score based on keyword relevancy). I would think though that creating a generic library would be quite difficult due to different database table structures. Here’s a mini-tutorial on how to implement fulltext searching:

    Example table:

    CREATE TABLE IF NOT EXISTS `blog_posts` (
      `post_id` int(9) NOT NULL auto_increment,
      `title` varchar(128) default NULL,
      `contents` text NOT NULL,
      `status` enum('Draft','Publish') default NULL,
      `creation` datetime NOT NULL,
      PRIMARY KEY  (`post_id`)
    ) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;

    First, determine which fields you are going to search. Note that this only works on varchar and text field types (someone please correct me if I’m wrong).
    For this table, we will use title and contents.

    ALTER TABLE `blog_posts` ADD FULLTEXT `search_index` (`title`, `contents`);

    Your table is now all set up for fulltext searching! Now you just need to run your SQL query like so:

    $results = $this->db
        ->select('blog_posts.*')
        ->select("(MATCH(`title`, `contents`) AGAINST ('" . $keywords . "')) AS score")
        ->where("(MATCH(`title`, `contents`) AGAINST ('" . $keywords . "'))")
        ->order_by('score', 'desc')
        ->get('blog_posts')
        ->result_array();

    As an added bonus, these results are ordered by relevancy, the most relevant result will be at the top.

  • #4 / Apr 14, 2011 9:07am

    Ninjabear

    54 posts

    Thank you for taking the time to write this approach out.

    The problem I’ve found with fulltext is that whilst it’s a lot better than using LIKE() function, it won’t match certain searches. So if you have a search like “form” it will only find “form” and not “forms” or “form’s” for example. It has a lot of drawbacks. Also will not match searches less than 3 chars and has blocked words.

    I find it really strange that no one has gone to the trouble of explaining stemming properly on the web to fix this. I followed tutorials from various sites but it’s like once guy wrote some incorrect code and everyone else just copied his code for their tutorials.

  • #5 / Apr 14, 2011 2:18pm

    CI Coder

    44 posts

    If it’s just a single word search then you can use the inflector helper in CI to get what you want like this:

    $this->load->helper('inflector');
    //say you get the search term from a form
    $term = $this->input->post('search_term');
    
    //you may want to inspect the term for special characters and take them out here
    
    //get the singular form
    $singular = singular($term);
    //get the plural form
    $plural = plural($singular);
    $saxon_genitive = $singular . "\\'s " . $plural . "\\'";
    
    $keywords = $singular . ' ' . $plural . ' ' . $saxon_genitive;
    //then you can run the query from above in boolean mode
    $results = $this->db
        ->select('blog_posts.*')
        ->select("(MATCH(`title`, `contents`) AGAINST ('" . $keywords . "' IN BOOLEAN MODE)) AS score")
        ->where("(MATCH(`title`, `contents`) AGAINST ('" . $keywords . "' IN BOOLEAN MODE))")
        ->order_by('score', 'desc')
        ->get('blog_posts')
        ->result_array();

    Of course, this is a start. You can enhance it and tweak it to fit your desire. You can even allow the user to enter many words and then you would do the above for all of them. Of course you would have to have a list of stop words etc. etc. This kind of functionality however cannot be programmed generally like you suggested because of so many other things that would take me hours to list here.

  • #6 / Apr 14, 2011 5:35pm

    wh1tel1te

    68 posts

    So if you have a search like “form” it will only find “form” and not “forms” or “form’s” for example. It has a lot of drawbacks.

    Boolean fulltext searches. Basically allows you to append “*” to your words. e.g.

    $this->db
          ->where("MATCH(field1, field2) AGAINST ('word1* word2*' IN BOOLEAN MODE)")
          ->get('table')
          ->result_array();

    Also will not match searches less than 3 chars and has blocked words.

    Solution 1: Change your MySQL conf to allow words with 3 characters.
    Solution 2: If that is not an option, use a LIKE statement fallback if the word is 3 characters or less. i.e. First do a fulltext search, and if no results, do a LIKE search. Or optionally, do a fulltext search if the keywords is greater than 3 chars, otherwise do LIKE.

  • #7 / Apr 14, 2011 6:32pm

    wh1tel1te

    68 posts

    This has actually piqued my interest in building a generic search class, however difficult it may be. I love challenges 😛

  • #8 / Apr 15, 2011 8:52am

    Ninjabear

    54 posts

    Thanks for all suggestions here. I’ll definitely come back and look through this stuff when I decide to improve the search feature I have at the moment.

  • #9 / Apr 15, 2011 2:19pm

    CI Coder

    44 posts

    So if you have a search like “form” it will only find “form” and not “forms” or “form’s” for example. It has a lot of drawbacks.

    Boolean fulltext searches. Basically allows you to append “*” to your words. e.g.

    That is a good idea but you will get results that may be semantically irrelevant. With that approach, if you searched for “form” you’d get back documents that contain “formulas” or “formula” or “formulate” and basically every word that starts with “form,” which will be irrelevant to the search.

  • #10 / Apr 15, 2011 6:17pm

    Ninjabear

    54 posts

    ...
    That is a good idea but you will get results that may be semantically irrelevant. With that approach, if you searched for “form” you’d get back documents that contain “formulas” or “formula” or “formulate” and basically every word that starts with “form,” which will be irrelevant to the search.

    So stemming is a good cure for this? If so it would be great if someone could do quick tutorial on this which explained how to do it.

  • #11 / Apr 15, 2011 8:15pm

    wh1tel1te

    68 posts

    That is a good idea but you will get results that may be semantically irrelevant. With that approach, if you searched for “form” you’d get back documents that contain “formulas” or “formula” or “formulate” and basically every word that starts with “form,” which will be irrelevant to the search.

    Personally I would say that is acceptable if you still wanted to return results such as “forms”, because that is the only way this will work. If you order by the “score” that MySQL returns (relevancy) descending, results such as “formulas” will display at the bottom because it isn’t as relevant as results with an exact match of “form”.

    I guess it’s your choice really: non-boolean searching which will return exact matches only, or boolean searching which will return partial matches with a chance that it will return slightly irrelevant results.

  • #12 / Apr 15, 2011 11:51pm

    CI Coder

    44 posts

    Personally I would say that is acceptable if you still wanted to return results such as “forms”, because that is the only way this will work. If you order by the “score” that MySQL returns (relevancy) descending, results such as “formulas” will display at the bottom because it isn’t as relevant as results with an exact match of “form”.

    I guess it’s your choice really: non-boolean searching which will return exact matches only, or boolean searching which will return partial matches with a chance that it will return slightly irrelevant results.

    Here is a little experiment:

    Create the following table with the following data:

    DROP TABLE IF EXISTS `test_searches`;
    CREATE TABLE `test_searches` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `body` varchar(250) DEFAULT NULL,
      PRIMARY KEY (`id`),
      FULLTEXT KEY `fulltext_search_index` (`body`)
    ) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=latin1;
    
    -- ----------------------------
    -- Records of test_searches
    -- ----------------------------
    INSERT INTO `test_searches` VALUES ('1', 'This is about your tax form');
    INSERT INTO `test_searches` VALUES ('2', 'I am formulating a new theory about searching and about different forms it can take');
    INSERT INTO `test_searches` VALUES ('3', 'what is the formula for the area of a circle?');
    INSERT INTO `test_searches` VALUES ('4', 'New forms of life');
    INSERT INTO `test_searches` VALUES ('5', 'some text that does not contain the word');
    INSERT INTO `test_searches` VALUES ('6', 'irrelevant text');
    INSERT INTO `test_searches` VALUES ('7', 'etc');
    INSERT INTO `test_searches` VALUES ('8', 'form, form form form form');

    Now I run the following query:

    SELECT *, MATCH (body) AGAINST( 'form*' IN BOOLEAN MODE ) AS score FROM test_searches;

    Here are the results:

    +----+-------------------------------------------------------------------------------------+-------+
    | id | body                                                                                | score |
    +----+-------------------------------------------------------------------------------------+-------+
    |  1 | This is about your tax form                                                         |     1 |
    |  2 | I am formulating a new theory about searching and about different forms it can take |     1 |
    |  3 | what is the formula for the area of a circle?                                       |     1 |
    |  4 | New forms of life                                                                   |     1 |
    |  5 | some text that does not contain the word                                            |     0 |
    |  6 | irrelevant text                                                                     |     0 |
    |  7 | etc                                                                                 |     0 |
    |  8 | form, form form form form                                                           |     1 |
    +----+-------------------------------------------------------------------------------------+-------+

    How do you differentiate those records?

    I think that in this particular case, if you only need a one-term search, my initial example with the inflector helper is actually best, because it gives you only those docs that contain the term or its plural or its Saxon genitive. Here is an example:

    SELECT *, MATCH (body) AGAINST( 'form forms form\'s forms\'' IN BOOLEAN MODE ) AS score FROM test_searches;
    +----+-------------------------------------------------------------------------------------+-------+
    | id | body                                                                                | score |
    +----+-------------------------------------------------------------------------------------+-------+
    |  1 | This is about your tax form                                                         |     2 |
    |  2 | I am formulating a new theory about searching and about different forms it can take |     2 |
    |  3 | what is the formula for the area of a circle?                                       |     0 |
    |  4 | New forms of life                                                                   |     2 |
    |  5 | some text that does not contain the word                                            |     0 |
    |  6 | irrelevant text                                                                     |     0 |
    |  7 | etc                                                                                 |     0 |
    |  8 | form, form form form form                                                           |     2 |
    +----+-------------------------------------------------------------------------------------+-------+

    In this case “what is the formula for the area of a circle?” has a score of 0, which is to be expected because someone who’s looking for “form” is certainly not looking for “formula,” or they would type in “formula.”

    However, if you were to program a generic search library you would have to use all of these techniques, and many more, and combine them into something that would give relevant results in most situations. But then what if the tables are not MyISAM tables? For instance I use only InnoDB tables in my projects to ensure data integrity and guard against orphan records and other things like that which MyISAM doesn’t support. What if the database is not MySQL but Oracle, Postage or even MongoDB (like I used in my last two projects)? Just thinking about the task only “what ifs” pop up in my mind.

    I do however enjoy the discussion. That’s why I pulled my glass of Pinot Noir close and I spent about an hour and a half on a Friday evening doing this experiment. 😊 Another reason is that this idea of “relevancy,” if solved in an unbiased way, has potential. So far Google’s Page Rank is the best solution to relevancy, but only inasmuch as democracy is the best solution to human society.

  • #13 / Apr 16, 2011 12:00am

    wh1tel1te

    68 posts

    You have to do your WHERE statement using boolean, not your SELECT. Non-boolean select will return a decimal number with which you can work with.

    You do raise an interesting point about the “formula” result returning a 0 score - this may be useful for determining relevancy 😊. I wish I could do some experiments as well, but no time at the moment! Thanks for your experiments though, very interesting results! 😊

    As for tables that are not MyISAM formatted - you could probably create a dedicated search table that is MyISAM formatted, and populate a keywords field with the other table’s contents.

  • #14 / Apr 16, 2011 1:30am

    CI Coder

    44 posts

    You have to do your WHERE statement using boolean, not your SELECT. Non-boolean select will return a decimal number with which you can work with.

    You do raise an interesting point about the “formula” result returning a 0 score - this may be useful for determining relevancy 😊. I wish I could do some experiments as well, but no time at the moment! Thanks for your experiments though, very interesting results! 😊

    As for tables that are not MyISAM formatted - you could probably create a dedicated search table that is MyISAM formatted, and populate a keywords field with the other table’s contents.

    😊 OK, but please next time post some code so I don’t feel alone. Add 2 more rows to the table so I can ask you a question after that.

    INSERT INTO `test_searches` VALUES ('9', 'My tax forms are late');
    INSERT INTO `test_searches` VALUES ('10', 'My tax form is late');

    Run the query:

    SELECT *, MATCH (body) AGAINST( 'form*' ) AS score FROM test_searches
    WHERE MATCH (body) AGAINST ('form*' IN BOOLEAN MODE )
    ORDER BY score DESC;
    +----+-------------------------------------------------------------------------------------+-------------------+
    | id | body                                                                                | score             |
    +----+-------------------------------------------------------------------------------------+-------------------+
    |  1 | This is about your tax form                                                         | 0.837664723396301 |
    |  8 | form, form form form form                                                           | 0.837664723396301 |
    | 10 | My tax form is late                                                                 | 0.828248143196106 |
    |  2 | I am formulating a new theory about searching and about different forms it can take |                 0 |
    |  3 | what is the formula for the area of a circle?                                       |                 0 |
    |  4 | New forms of life                                                                   |                 0 |
    |  9 | My tax forms are late                                                               |                 0 |
    +----+-------------------------------------------------------------------------------------+-------------------+
    7 rows in set

    Question: why is “My tax form is late” relevant ( score > 0 ) and “My tax forms are late” is not ( score = 0 )? Shouldn’t they be at least close? Or, put another way, why is “My tax forms are late” just as relevant as “what is the formula for the area of a circle?” when you are searching for “form?”

    And now my way:

    SELECT *, MATCH (body) AGAINST( 'form forms form\'s forms\'' ) AS score FROM test_searches
    WHERE MATCH (body) AGAINST ('form forms form\'s forms\'' IN BOOLEAN MODE )
    ORDER BY score DESC;
    +----+-------------------------------------------------------------------------------------+------------------+
    | id | body                                                                                | score            |
    +----+-------------------------------------------------------------------------------------+------------------+
    |  1 | This is about your tax form                                                         |  1.6753294467926 |
    |  8 | form, form form form form                                                           |  1.6753294467926 |
    |  4 | New forms of life                                                                   | 1.65649628639221 |
    |  9 | My tax forms are late                                                               | 1.65649628639221 |
    | 10 | My tax form is late                                                                 | 1.65649628639221 |
    |  2 | I am formulating a new theory about searching and about different forms it can take | 1.62007236480713 |
    +----+-------------------------------------------------------------------------------------+------------------+
    6 rows in set

    You see that if you do it my way you get possible relevant results being signaled as such by their score, and “My tax form is late” has the same relevance as “My tax forms are late.” And the “formula” doesn’t even appear in the recordset. Anyway, let’s keep going with this. I got the weekend free anyway 😊

  • #15 / Apr 16, 2011 1:00pm

    Ninjabear

    54 posts

    ...
    Question: why is “My tax form is late” relevant ( score > 0 ) and “My tax forms are late” is not ( score = 0 )? Shouldn’t they be at least close?

    Or, put another way, why is “My tax forms are late” just as relevant as “what is the formula for the area of a circle?” when you are searching for “form?”

    I read the whole boolean fulltext search article and could see no clue as to why “My tax form is late” is not close to “My tax forms are late” given that they both contain the word “form” which is main requirement of the * operator.

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases