TaskRabbit is Hiring!

We’re a tight-knit team that’s passionate about building a solution that helps people by maximizing their time, talent and skills. We are actively hiring for our Engineering and Design teams. Click To Learn more

Aaron Binns

More Maps for Data Exploration

@ 08 Jun 2015

maps


Here at TaskRabbit, propinquity matters. Most of our marketplace is focused on work performed at a physical location, such as hiring a Tasker to clean your home, or assemble some Ikea furniture.

As such, understanding our marketplace in terms of phyical geography is important. A while back, my colleague Saba discussed some Python tools the Data Science team uses to analyze and visualize information with geo-maps. Since then, we have added a few tools to our geo-mapping toolkit.

The first is CartoDB, a sweet website for layering data on maps. Evan came across them a couple months ago and since then I’ve been using their service to build supply & demand maps for internal TaskRabbit marketplace analysis. It’s a pretty slick service, so I encourage you to check it out.

Second, is the aggregation of zipcodes at the city level. TaskRabbit uses geographic data for tracking marketplace activity – task location, Tasker service map, etc. For analysis purposes, we usually aggregate data at the zipcode level. For example, I can see the assignment rate in my neighborhood (94109) for the past week (91.1%). I can also aggregate all the San Francisco zipcodes into a city-level report (89%).

But, simply generating a report with city names and numbers isn’t nearly as cool as having, say, a city-level heat map of marketplace data. The rest of this blog post will detail how I built city-level outlines for use with CartoDB.

DISCLAIMER: I’m not a GIS expert. This is just something I hacked together one afternoon. I couldn’t find city-level shapes (GeoJSON or otherwise) readily available, so I figured I could hack it together myself.

Zipcodes, Cities and Laptop Limitations

Even though the USPS does not publish zipcode shapes/boundaries (because zipcodes are not shapes), the US Census produces something that is close enough for many common purposes. In the step-by-step instructions below, I’ll show you were you can download the US Census data with the zipcode shapes.

The zipcode->city mapping is a little more complicated. I’ll provide details below, but in the end I cobbled together a list from a couple different sources. If you want to follow along at home, you’ll have provide your own zipcode->city mapping file with the same format as the one I used (or change the instructions accordingly).

In theory, aggregating zipcode GeoJSON shapes at the city-level is rather simple:

  1. Load US-wide zipcode GeoJSON file into memory
  2. Load zipcode->city mapping file into memory
  3. Group zipcode shape objects by city, creating new GeoJSON structure
  4. Write out new city-level GeoJSON structure to disk

In practice, life is more complicated. The US-wide GeoJSON file extracted from the US Census data is too big to fit into memory on my developer laptop (8GB). I suppose I could have spun-up an EC2 instance with 100+GB of memory, but this was an afternoon hack project and I wanted to do it locally and find a solution that anyone could use. Also, Unix command-line tools are fun!

In the detailed sections below, there are some super long Bash command-lines. For readability, I’ve added some line-wraps, which might break the script if you were to cut/paste into your terminal window. You can grab the entire thing as a single script from this gist: build.sh

Required tools/packages

I have a Mac laptop. If you do too, then I’ll assume you have already have Homebrew installed. If you’re using something else, then you might have to hunt down the following packages/tools on your own.

Geospatial Data Abstraction Library

The ogr2ogr command-line tool is used to convert between various geo-formats: GeoJSON, shapefile, etc.

$ brew install gdal

Mapshaper

This NodeJS tool is what actually combines the zipcode-level GeoJSON shapes into a single city-level GeoJSON shape. It also “simplifies” the resulting city shapes so that they lose a bit of detail/precision, but are much, much smaller in memory and on disk.

$ npm install -g mapshaper

Obtain Zipcode->City Mapping

As mentioned above, this one requires a bit of work by the reader. Here are some links to get you started

In the end, make your list of the form:

city_id[TAB]zip

where the city_id field is of the form:

{2-letter State}_${City name}

with any spaces replaced with _ and punctuation removed.

For example:

OH_Cleveland    44106
DC_Washington_Navy_Yard 20374
CA_San_Francisco        94109

and name the file us_city_zips.tsv

Download US Census Zipcode Shapes

This one is easy, you can grab the latest and greatest from the US Census website:

https://catalog.data.gov/dataset/tiger-line-shapefile-2014-2010-nation-u-s-2010-census-5-digit-zip-code-tabulation-area-zcta5-na

The file is named tl_2014_us_zcta510.zip

Unzip that file so that we can generate a GeoJSON file based on the shapefile contained therein.

$ mkdir tl_2014_us_zcta510
$ unzip tl_2014_us_zcta510.zip -d tl_2014_us_zcta510

Convert to GeoJSON

Based on this advice:

$ ogr2ogr -f GeoJSON -t_srs crs:84
          tl_2014_us_zcta510.geojson
          tl_2014_us_zcta510/tl_2014_us_zcta510.shp

It takes about 6-7mins on my laptop and the resulting GeoJSON file is ~1.3GB in size.

Fortunately for us, ogr2ogr writes each zipcode-level GeoJSON feature on a separate line in the output file. This makes it easy to use sed and awk and some Bash wizardry to manipulate the GeoJSON data.

Add City IDs and Group

Now, we’ll create a “partial” GeoJSON file for each city, with that city’s zipcode GeoJSON features in the file. These files are not fully-formed GeoJSON files as they are just the features (which is why I call them “partials”). But don’t worry, We’ll make them into fully-formed GeoJSON in the next step.

Again, note that I added some line-wrapping to make the command readable. If you want to cut/paste it into your own terminal, I suggest you cut/paste from the gist.

$ mkdir per_city_partials
$ join -1 2 -2 1
       <(sort -k 2,2 us_city_zips.tsv)
       <(cat tl_2014_us_zcta510.geojson
          | gawk '{ match( $0 , /ZCTA5CE10\": \"([0-9]+)\"/, arr );
                    if ( arr[1] != "" ) print arr[1], $0 }'
          | sort
        )
    | sed 's/,$//g'
    | gawk '{ city="per_city_partials/" $2; $1=$2=""; print $0 >> city; close(city) }'

This took about 5-6mins on my laptop.

In the per_city_partials/ subdir, we’ll have files looking like:

$ ls per_city_partials/ | head
 AK_Adak
 AK_Akiachak
 AK_Akiak
 AK_Akutan
 AK_Alakanuk
 AK_Aleknagik
 AK_Allakaket
 AK_Ambler
 AK_Anaktuvuk_Pass
 AK_Anchor_Point

 $ wc -l per_city_partials/CA_San_Francisco
 28

Merge Shapes and Add Properties

Here comes the fun part – merging the zipcode shapes into a city-level shapes!

This step actually takes quite a while, since we’re calling mapshaper on each of the per-city “partials” to do both the shape merging and shape simplification. You might try playing with the simplification level. I chose 10% somewhat arbitrarily and it seems to work pretty well.

For each file in per_city_partials/ there will be corresponding fully-formed GeoJSON file in per_city_geojson/ subdir:

$ mkdir per_city_geojson
$ for i in per_city_partials/*
  do
     id="${i#*/}"
     state="${id%%_*}"
     city="${id#*_}"
     city="${city//_/ }"
     cat <( echo '{ "type":"FeatureCollection", "features":[' ;
            gawk '{ if ( NR > 1 ) print "," ; print $0 }' ${i} ;
            echo "]}" )
       | mapshaper -
                   -dissolve2
                   -simplify 10%
                   -each "\$.properties = { id: \"${id}\", 
                                            state: \"${state}\", 
                                            city: \"${city}\" }"
 		  -o per_city_geojson/${id}.json
  done

Again, note the line-wrapping.

Once this is finished, you’ll have a nice GeoJSON file for each city. If you like, you can just grab the cities you want from the per_city_geojson/ subdir and go on your merry way. But, for a full US-wide city-level map, there are a few steps to go.

Combine City-Level GeoJSON into US-Level

This is pretty straightforward, just catenate the files with a GeoJSON header/footer and commas in the right place.

echo '{"type":"FeatureCollection","features":[' > us_cities.geojson
for i in per_city_geojson/*
do
    cat ${i}
    echo
done | sed 's/^[{]"type"[:]"FeatureCollection"[,]"features"[:][[]//'
     | sed 's/\]\}$//'
     | awk '{ if ( NR > 1 ) printf( "%s", ", ") ; print $0 }' >> us_cities.geojson
echo ']}' >> us_cities.geojson

And now you have a single GeoJSON file with all the cities in the US:

us_cities.geojson

Convert to Shapefile

The us_cities.geojson is perfectly usable, but the last step is to convert it to ESRI Shapefile format, which is a bit more compact.

$ mkdir shapefile
$ cd shapefile
$ ogr2ogr -F "ESRI Shapefile" us_cities.shp ../us_cities.geojson OGRGeoJSON
$ zip -qq ../us_cities.zip us_cities.dbf us_cities.prj us_cities.shp us_cities.shx
$ cd ..
$ rm -rf shapefile

producing:

us_cities.zip

which you can load into your GIS tool of choice.

Such as…

Use in CartoDB

I uploaded the us_cities.zip file into my account on CartoDB and started visualizing some of our data at a city-level.

CartoDB has this neat feature where you can mark a dataset as public, so anyone can slurp it into their account and use it.

Here you go: US cities dataset

Enjoy!

Comments

Coments Loading...