Forklift is a ruby gem that can help you collect, augment, and save copies of your mySQL databases. This is often called an “ETL” tool as the steps in this process mirror the actions of “Extracting the data,” “Transforming the data,” and finally “Loading the data” into its final place.
With Forklift, you create a Plan which describes how to manipulate your data. The process for this involves (at least) three databases:
The “Live Set” is first loaded into the “Working Set” to create a copy of your production data we can manipulate without fear of breaking replication. Then, any transformations/manipulations are run on the data in the working set. This might include normalizing or cleaning up data which was great for production but hard for analysts to use. Finally, when all of your transformations are complete, that data is loaded into the final database.
Forklift is appropriate to use by itself or integrated within a larger project. Forklift aims to be as fast as can be by using native mySQL copy commands and eschewing all ORMs and other RAM hogs.
Can extract data from both local and remote databases
Can perform integrity checks on your source data to determine if this run of Forklift should be executed
Can run each Extract either each run or at a frequency
Can run each Transform either each run or at a frequency
Data kept in the woking database after each run to be used on subsequent transformations
Only ETL’d tables will be copied into the final database, leaving other tables untouched
Emails sent on errors
What does TaskRabbit use this for?
At TaskRabbit, the website you see at www.taskrabbit.com is actually made up of many smaller rails applications. When analyzing our site, we need to collect all of this data into one place so we can easily join across it.
We replicate all of our databases into one server in our office, and then use Forklift to extract the data we want into a common place. This gives us the option to both look at live data and to have a more accessible transformed set which we create on a rolling basis. Our “Forklift Loop” also git-pulls to check for any new transformations before each run.
Example Annotated Plan
In Forklift, you build a plan. You can add any action to the plan in any order before you run it. You can have 0 or many actions of each type.
Forklift allows you to create both Ruby transformations and SQL transformations
SQL Transformations are kept in a file ending in .rb
Ruby Transformations should define a class which matches the name of the file (IE: class MyTransformation would be in a file called my_transformation.rb
logger.log(message) is the best way to log but logger.debug is also available
database is a string containing the name of the working database
connection is an instance of Forklift::Connection and connection.connection is a raw mysql2 connection
Classes need to define a transform(connection, database, logger) IE:
SQL Transformations are kept in a file ending in .sql
You can have many SQL statements per file
SQL will be executed linearly as it is written in the file
You can launch forklift in “debug mode” with --debug (we check ARGV["--debug"] and ARGV["-debug"]). In debug mode the following will happen:
- verbose = true
- no SQL will be run (extract, load)
- no transforms will be run
- no email will be sent
- no mySQL dumps will be created