TaskRabbit is Hiring!

We’re a tight-knit team that’s passionate about building a solution that helps people by maximizing their time, talent and skills. We are actively hiring for our Engineering and Design teams. Click To Learn more

Saba Zuberi

Maps for Data Exploration

@ 04 Dec 2014

python pandas maps elasticsearch


Here at TaskRabbit, the Data Science team often uses python with the pandas library in the ipython notebook environment for data exploration and analysis. It provides a great environment where we can easily slurp in data from MySQL or ElasticSearch, slice and dice to our hearts content and quickly generate visualizations.

What we were missing was a way to easily explore geographic trends within our metros, for example visualizing variations in the number of Client searches by zip code. We were looking for a tool that would integrate with pandas and allow us to bind our data to a map.

Vincent

The first thing we tried was Vincent, which is a tool that sits on top of Vega, a visualization grammar, which in turn sits on top of D3.

Vincent requires topo json files in order to render a map. To obtain topo json files of the zip codes in our metros, we grabbed the shapefiles with zip code boundaries provided by census.gov. We can convert the shapefile for California zip codes, for example, to geo json and then to topo json using the following

$ ogr2ogr -f "GeoJSON" CA_zip.json tl_2010_06_zcta510.shp tl_2010_06_zcta510
$ topojson -p -o CA_zip.topo.json CA_zip.json

and then trim these down to zip codes in our metros.

To produce the map we run the following code in python that produces a vega.json file, which is parsed by vega_template.html, contained in the Vincent repo.

import vincent
vincent.core.initialize_notebook()

# Code goes here to get our data on searches by zip code in to a pandas dataframe, d_SF_search

# Set up the geographic data to be used
zip_topo = r'zips_geojson/CA_tl_2010_06_zcta510/CA_zip.topo.json'
geo_data = [{'name': 'zips',
             'url': zip_topo,
             'feature': 'CA_zip'}]

# Bind our data to the map
vis = vincent.Map(data=d_SF_search, geo_data=geo_data, scale=25000, projection='albersUsa',
          data_bind='number_of_searches', data_key='ZCTA5CE10',
          map_key={'zips': 'properties.ZCTA5CE10'},brew='RdPu')
vis.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.5)

vis.legend(title='Number of Searches')
vis.to_json('vega.json')

In this example of Client searches by zip code, our data is contained in the pandas dataframe d_SF_search and is of the form

	 ZCTA5CE10   number_of_searches 
0    94109		 1234.0 
1    94110		 5678.0 

The name ‘ZCTA5CE10’ is inherited from the census.gov shapefile naming of CA zip codes and matches the properties key in the topo json file.

While other Vincent plot types will render in the ipython notebook environment no problem, we did not find it straightforward to achieve the same with maps. Instead to view the map, we run a simple server $ python -m SimpleHTTPServer 8000 and point the web browser to vega_template.html to see the visualization.

Folium

The maps generated by Vincent are static images and we quickly found we’d like more dynamic viewing of our metros to be able to zoom in to our most active zip codes. Leaflet is a tool we already use at TaskRabbit for interactive map visualizations, so we tried out Folium next, which allows you to bind data from pandas dataframes to both geo and topo json areas, rendering them on Leaflet maps.

import folium

# As before, code goes here to get our data into a pandas dataframe, d_SF_search

# We use the geo json file for our metro we generated above
geo_path = r'zips_geojson/CA_tl_2010_06_zcta510/CA_zip.json'

# Set threshold values for the color scale on the map
min_val = d_SF_search.number_of_searches.min()
q1 = d_SF_search.number_of_searches.quantile( .25)
q2 = d_SF_search.number_of_searches.quantile( .5)
q3 = d_SF_search.number_of_searches.quantile( .75)

# Create map object and bind our data to it
map = folium.Map(location=[37.769959, -122.448679], zoom_start=9)
map.geo_json(geo_path=geo_path, data=d_SF_search, data_out = 'd_SF_search.json',
             columns=['ZCTA5CE10', 'number_of_searches'],
             threshold_scale=[min_val, q1, q2, q3],
             key_on='feature.properties.ZCTA5CE10',
             fill_color='BuPu', fill_opacity=0.9, line_opacity=0.9,
             legend_name='Number of Searches')

map.create_map(path='sfmetro_number_of_searches.html')

The data that is bound to the map and read by Leaflet is stored in d_SF_search.json, in the form

[{"94109": 1234.0, "94110": 5678.0, ...}]

The map can be viewed as before by pointing the browser to sfmetro_number_of_searches.html.

Conclusion

Folium provides a way for us to bring together the data manipulating powers of pandas and python with quick, iteratable map visualizations. We found exploring the supply and demand in our marketplace in this way so useful, we decided to share it in a recent communication with our Taskers in the SF Bay Area to help them better capture the holiday demand.

Invitation map

Comments

Coments Loading...