TaskRabbit is Hiring!

We’re a tight-knit team that’s passionate about building a solution that helps people by maximizing their time, talent and skills. We are actively hiring for our Engineering and Design teams. Click To Learn more

Pablo Jairala

Working with Elastic Search in Ruby: Waistband

@ 14 Mar 2014

elasticsearch ruby database


The need

As time goes by, we at TaskRabbit have been using Elastic Search more and more.

At first, it started as a place where we could store our bus messages in an easy to search format to make them available for Product, debugging, data-analysis, etc.

After that, we used it to replace our search system for increased speed and better results.

It now powers TaskRabbit recommendations on our London offering. As you can imagine, it has served us well.

On the road to using Elastic Search more and more, we’ve come up with a couple of tools that have made our jobs easier, some of which we’ve abstracted into an open-source gem called Waistband. This blog post discusses how it works and some of the decisions made when coming up with it.

What it does

Waistband basically serves only to load up config files and provide you with some quality-of-life methods and classes that should make working with ES much easier, or at the very least, less repetitive.

At the very minimum, you will have two config files: waistband.yml and waistband_my_index.yml, where “my_index” is the name of your index.

The general connection settings (host, port, protocol, number of retries, etc.) get loaded up from the general waistband.yml, but the index-specific settings come from waistband_my_index.yml, including mappings, index-level settings, etc.

In normal Rails fashion, you set-up the settings to be divided by environment.

Under the hood, Waistband uses the Elasticsearch gem to use the API, as well as the transport layer.

Config files

waistband.yml normally looks like:

development:
    retries: 5
    timeout: 2
    reload_on_failure: true
    servers:
        server1:
            protocol: http
            host: localhost
            port: 9200

These settings get passed to the Waistband::Configuration class which instantiates a client connection to the Elastic Search server. Note that you may (and probably should) have several servers/nodes in your config file for your production/staging environments. The names (server1 in the example) are not used for anything yet, but eventually we would like to bubble them up in some way when a connection gets blacklisted, etc.

An index setting YML probably will look like:

development:
    stringify: false
    settings:
        index:
            number_of_shards: 4
            number_of_replicas: 1
            analysis:
                analyzer:
                    default:
                        type: snowball
    mappings:
        mytype:
            _source:
                includes: ["*"]
            properties:
                title:
                    type: string
                    index: not_analyzed
                description:
                    type: string

The stringify determines wether Waistband internally conducts a recursive stringification process of Array and Hash types, we’ve found this useful in some projects, although for most indeces where you know the exact mapping you’ll be using (most indeces in general), you’ll probably want to leave this as false.

The rest of the settings are built in Elastic Search index options.

If you’re not familiar with the analysis settings, you can read up a bit on them in the Elastic Search analyzers documentation.

Creating and deleting indeces

Index creation and deletion is pretty simple. Lets say you have an index called search. You’d define the index’s settings in a file called waistband_search.yml, then you can work with it directly in Ruby:

index = Waistband::Index.new('search')
index.create
 => true
index.create!
 => Waistband::Errors::IndexExists: Index already exists

The Index#create method will create the index and return true, or just return true if the index already exists. The Index#create! method will actually throw a Waistband::Errors::IndexExists exception if the index already exists. Most methods in the gem have equivalent ! methods that throw exceptions when the expected behaviour doesn’t happen.

Delete an index:

index.delete
 => true
index.delete!
 => Waistband::Errors::IndexNotFound: Index not found

Writing and reading data

Writing data is pretty similar. Lets assume we have a User model we want to denormalize to Elastic Search for searching purposes:

user = User.last
index.save(user.id, user.attributes.slice('email', 'name'))
 => true

Then we can read the record:

index.find user.id
 => {'email' => 'test@gmail.com', 'name' => 'Testo McTesterson'}

You can also retrieve the raw Elastic Search record with all its metadata using the #read method:

index.read user.id
 => {"_index" => "search", "_type" => "search", "_id" => "user_1", "_version" => 1, "found" => true, "_source" => {'email' => 'test@gmail.com', 'name' => 'Testo McTesterson'}}

If you’re storing a single type that’s easily deduced from your index name, then this setup is all you need basically. However, lets say you’re storing multiple types into a single index, then this is not gonna work with the default settigns. You should pass in the _type option into the #save method:

index.save(user.id, {:name => user.name, :email => user.email, :_type => 'user')
 => true
index.read user.id
 => {"_index" => "search", "_type" => "user", "_id" => "user_1", "_version" => 1, "found" => true, "_source" => {'email' => 'test@gmail.com', 'name' => 'Testo McTesterson'}}

Note that for this type of usage you should define the various type-mappings to use in your index specific yml file.

Search

The Index class exposes the #search method:

search = index.search(:sort => {:email => 'desc'})
 => #<Waistband::SearchResults:0x007f880f954248 ...>

It returns an instance of the SearchResults class, which provides a couple of short-hand methods for dealing with search results.

search.hits
 => [{"_index" => "search", "_type" => "search", "_id" => "1", "_score" => nil, "_source" => {'email' => 'test@gmail.com', 'name' => 'Testo McTesterson'}}]
search.total_results
 => 1

The #hits methods provides you pretty much the raw results from the search result hash down on the ['hits']['hits'] path. There’s also a very simple method_missing interface that’ll allow you to access any part of the search result hash directly on the SearchResults object. For example, you can access [‘aggregations’] by invoking search.aggregations, etc.

Another method you might find useful, is the #results method, which goes through your hits and wraps them in the Waistband::Result class, which is just another method_missing interface to access to each of the hits. So you could do stuff like:

result = search.results.first
result.name
 => Testo McTesterson
result.email
 => test@gmail.com

We’ve found some usefulness in using this method and its paginated sibling (more on that below) when dealing with gems like jbuilder and general in providing readability to the code. However, if you’re dealing with large enough page sizes, or if you want to speed up every piece of your code, we’re recommend using the #hits or #paginated_hits methods as opposed to the results variety.

If you’re using the Kaminari gem, you can use the #paginated_hits and #paginated_results methods to paginate hits and results easily.

Pagination is simple as well:

search = index.search(:sort => {:email => 'desc'}, :page => 2, :page_size => 40)

This transcribes the page and page_size options to its corresponding from and size equivalents. You can obviously pass in from and size directly if you prefer.

Problems with big indexes

As I mentioned previously, one of our main usages of Elastic Search is storing all our bus messages for logging, debugging, analysis, etc. We were storing this data in a single index initially. Eventually, the need arose to delete the older events after moving them to a more permament storage. This deletion becomes slow, because you have to search for the objects you want do delete, and then loop through them and delete them either one by one, or in bulk via the ES API. Both options are not ideal when you have a large enough index.

The approach we’re following nowadays is creating an index per month, following the naming convention of “index_name_year_month”, so our indexes look something like:

  bus_events_2014_01, bus_events_2014_02, ...

These indexes are then exposed via an index as a single entity for search purposes using an alias called ‘bus_events’.

To aid in this pattern, we’re providing some syntactic sugar for alias creation and manipulation. The idea is that you create a single yml index config file called waistband_events.yml, and all sub-indeces use this same file:

# create the index
index = Waistband::Index.new('bus_events', subs: %w(2014 01))
index.create
 => true

# create the alias
index.alias('bus_events')
 => true

index.save('some_event_01', {'data' => true})
 => true

The object would get saved onto the index ‘bus_events_2014_01’, but you can query the ‘bus_events’ alias directly to get results from this subindex or any of its siblings:

index = Waistband::Index.new('bus_events')
index.search(:sort => {:data => 'desc'})

Next steps

As we keep looking into new ways to use Elastic Search, we’ll keep expanding the gem to provide goodies to make usage simple. Hope you found this to be a fun read.

Comments

Coments Loading...