TaskRabbit is Hiring!

We’re a tight-knit team that’s passionate about building a solution that helps people by maximizing their time, talent and skills. We are actively hiring for our Engineering and Design teams. Click To Learn more

Pablo Jairala

Elasticsearch geo_shapes: geohash vs quadtree

@ 09 Jun 2015

elasticsearch ruby


At work we use Elasticsearch to determine where a Job is occurring in the world. The basic idea is we get a freeform address from the Client, we get the best lat/lng coordinates for this freeform address, and from that lat/lng we query our Elasticsearch index of geographies to get the most accurate geography we can find. Geographies are our representations of metropolitan areas, cities, neighborhoods, countries.

Up to last week, we have been storing this areas as geo_shapes in our index as geohash trees. However, after attending the Geospatial Applications with Elasticsearch webinar offered by the Elastic team, we heard that newer versions of Elasticsearch had improved quadtrees’ performance quite a bit, so we figured we’d do some testing to see how much faster they were for our use case.

To do this, I wrote a little script that created the same geographies index with a different name using quadtree trees:

mappings:
  geography:
    properties:
      geometry:
        type: geo_shape
        tree: quadtree

As well as a little Rake task to populate the index from our old geographies index. Note: we use Waistband a little gem we wrote to interact with Elasticsearch.

require 'waistband'

old_index = Waistband::Index.new('geographies')
new_index = Waistband::Index.new('geographies_quadtree')
page = 1

loop do
  puts "Doing page: #{page}"
  hits = old_index.search(page: page).hits

  break if hits.empty?

  hits.each do |hit|
    begin
      new_index.save(hit['_id'], hit['_source'].merge('_type' => hit['_type']))
    rescue Exception => ex
      puts "unable to save #{hit['_id']} -- #{ex.inspect}"
    end
  end

  page += 1
end

The first discovery here was that it was much faster to populate the index. So that’s a very nice addition. However, we were not too concerned with populating speed, since as you can imagine, this index seldom changes. So I wrote a little script to benchmark performance of the two indexes by taking 3 sets of lat/lng points (one in the SF Bay Area, one in NY, and one in London) and pinging the indexes 1000 times with each set of points (so 3000 times total).

require 'benchmark'

def es_query_hash(lat, lng)
  {
      _source: {exclude: %w(geometry)},
      query: {
        bool: {
          must: [
            {
              geo_shape: {
                geometry: {
                  relation: "intersects",
                  shape: {
                    type: "Point",
                    coordinates: [lng.to_f, lat.to_f]
                  }
                }
              }
            }
          ]
        }
      },
      page: 1, page_size: 10
    }
end

benchmark_times = 1000
old_index = Waistband::Index.new('geographies')
quadtree_index = Waistband::Index.new('geographies_quadtree')
points = [
  [37.7667902, -122.4204521],
  [40.715977, -73.962342],
  [51.5547043, -0.1631155]
]

Benchmark.bmbm(7) do |bm|
  bm.report('quadtree_index') do
    benchmark_times.times do
      points.each do |point|
        lat, lng = point
        results = quadtree_index.search(es_query_hash(lat, lng)).results
      end
    end
  end

  bm.report('old_index') do
    benchmark_times.times do
      points.each do |point|
        lat, lng = point
        results = old_index.search(es_query_hash(lat, lng)).results
      end
    end
  end
end
Results
Rehearsal --------------------------------------------------
quadtree_index   6.590000   1.270000   7.860000 ( 31.690741)
old_index        6.260000   1.210000   7.470000 ( 77.444194)
---------------------------------------- total: 15.330000sec
                    user     system      total        real
quadtree_index   6.550000   1.250000   7.800000 ( 32.144781)
old_index        6.430000   1.250000   7.680000 ( 79.067848)

Real time includes user + system + IO + network, which is the one we care about since most of this stuff is in the network part (ES).

As you can see, the results in my local laptop were very promising, showing a 2.5 decrease in real time when using the quadtree index. I then proceeded to staging to check the benchmark, and the results were very similar:

Rehearsal --------------------------------------------------
quadtree_index   5.320000   0.700000   6.020000 ( 22.864896)
old_index        5.630000   0.650000   6.280000 ( 51.214021)
---------------------------------------- total: 12.300000sec

                     user     system      total        real
quadtree_index   5.440000   0.610000   6.050000 ( 23.260431)
old_index        5.560000   0.630000   6.190000 ( 52.932093)

2.3 times faster! So all in all, a very nice and welcome performance improvement by the Elastic team.

Comments

Coments Loading...