Entry

Enabling fast, high-powered search in ExpressionEngine with Elasticsearch

by: Derek Jones on: 5/27/2015

This is a guest post from Matt Weinberg, Co-Founder, and Ben Smith, Technical Lead, of the New York City based interactive digital agency, Vector Media Group. The Vector team has immense experience with ExpressionEngine, including on high profile and high traffic sites. Find out below how they implemented Elasticsearch1 for blazing and powerful site searching in ExpressionEngine for a recent client.

We love ExpressionEngine and are always looking for new and exciting ways to integrate it with other technologies and services. Elasticsearch is an amazingly fast and accurate search technology that we frequently leverage in our client work, including ExpressionEngine projects. Regardless of the current search platform used on a website, Elasticsearch almost always provides some easy wins for business owners and end users. Some of the world’s largest brands, such as Netflix, LinkedIn, eBay, and Wikipedia use Elasticsearch. We recently integrated Elasticsearch with ExpressionEngine for one of our clients, a large multinational company that we cannot name due to a non-disclosure agreement (but who gave us permission to share this anonymized information).

Why we chose Elasticsearch

Like many who switch to Elasticsearch, our client outgrew their original search implementation, and wanted to boost the performance and relevance of the search results on their website. The site gets around 30 million pageviews per month (with additional traffic via mobile apps), and search plays a big part in the user experience.

Many MySQL-based solutions include some form of FULLTEXT indexing, which we used in production for over a year before replacing it. Others just use LIKE queries with wildcards. Our initial implementation used a combination of the two with some additional custom modifications to optimize query performance. We offloaded search queries to a dedicated read-only MySQL slave instance, which took search load off of the main DB. The slave database had a tweaked schema and additional alterations that were specifically optimized for search, including InnoDB tables and FULLTEXT indices. Even after hours of tweaking to get the most out of what we had, MySQL is not designed to be a search engine, and we began evaluating alternatives.

Ultimately, Elasticsearch was chosen over Sphinx and raw Solr/Lucene implementations due to the ease of setup, ability to scale, and its nice fit within our existing stack. We also determined that ExpressionEngine and Elasticsearch could integrate in such as way that gave us the control we needed, without spending unnecessary time recreating basic functionality.

Managing Elasticsearch in the ExpressionEngine CMS

One of EE’s strengths is its excellent admin interface, including the ability for authorized users to be able to add new channels and fields on the fly in the CMS. Although EE adds these fields to the DB backend, Elasticsearch data indexing is another story. When new fields are added to EE, the Elasticsearch schema needs to be updated as well, and then of course we need to reindex the data when the schema is updated. Doing this in production adds additional complications (such as uptime and availability concerns).

To solve these issues, it was important to us to have a unified interface to manage Elasticsearch indices and data. We built a custom module with a control panel to allow us to perform basic Elasticsearch functions to ease the integration process. There is an interface for creating new indices with custom schema mappings, and the ability to switch between various indices when serving data to the EE frontend. Using index aliases, we are able to have our application point to a single index alias, while managing which index is being served from that alias - all from the EE control panel. This allows multiple mappings to coexist, and gives us the ability to quickly stage new mappings and see how our application responds (in a staging environment of course!). We are able to create new indices, add data, and promote indices to production with zero downtime.

Elasticsearch index management

Indexing Entries

EE’s MySQL database is still is our primary data source, and all Elasticsearch data is considered ephemeral and can be recreated from what EE stores in MySQL at any time. We wrote a custom EE extension that fires when entries are updated in relevant channels, using post save hooks to index (or delete) that data in Elasticsearch – so Elasticsearch is always up to date.

Although we first created an ExpressionEngine-based queue management system for indexing our data, we ultimately replaced it with a Beanstalkd instance on a separate server. Our original EE indexing queue was adequate in theory but it did not scale well in our load testing, and we never put it into production. Instead, we chose to rely on each piece of software in our stack to do only what it does best.

When adding new fields or channels (initiating a full reindexing), those jobs are sent to Beanstalkd in batches. We have close to 70,000 entries that we index in Elasticsearch, and Beanstalkd queries that data via a custom REST API wrapper EE module we built. That API serves several consumers (such as the iOS and Android Apps and some third-parties we syndicate data to) however we have special endpoints that are meant solely to index data in Elasticsearch. Entries are submitted to Elasticsearch in bulk to gain efficiencies. The entire reindexing process takes about one hour and does not impact database/site performance for our end users because it is running the background.

Tooling and Monitoring

Basic Elasticsearch health information is available in the custom ExpressionEngine CMS module we created. We can see a list of indices with stats information, as well as which indices and shards are problematic. As always, we can hot swap to “green” indices if anything goes awry.

We use Logstash and Kibana to centralize and visualize our log data from HAProxy, which sits in front of half a dozen Apache web servers running ExpressionEngine. Streaming our logs allows us to quickly trace and fix any errors as they arise. The ExpressionEngine admin server is also logged in the same way, so we have real time information to cross reference bug reports from the dozens of authenticated users who are actively creating content in the CMS.

Benefits

We have seen significant benefits since integrating ExpressionEngine with Elasticsearch. Load on our MySQL servers has dropped significantly, and the server resources used for Elasticsearch are far less than MySQL needed to support the slave DB and the search queries. (Our old MySQL slave DB server had eight cores and 52 GB of RAM; our Elasticsearch setup supports millions of queries per day with only four cores and 15 GB of RAM. You can read more about our technical setup here.)

The benefits for users have been immense: the entire search experience is much faster and we’re able to deploy advanced search functions like auto-complete, synonyms, spell correction, faceting, and geo boundaries much more quickly. We know that no matter what data is in ExpressionEngine, it’ll be searchable and filterable in a quick, scalable way.

Don’t reinvent the wheel

We chose to use Elasticsearch because we determined it to be the best fit for our users and technology stack. We chose ExpressionEngine for the same reason. Integrating both has been a huge win for our client and their users, and we continue to experiment with ways to enhance the setup.


  1. Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries. ↩︎

.(JavaScript must be enabled to view this email address) or share your feedback on this entry with @ellislab on Twitter.

ExpressionEngine News

#eecms, #events, #releases