Quick Left

This is a blog

GIFs, tech and stuff.

Handling the Headaches of Big Data with Rails


We recently worked on a mapping application that lets users monitor their driving habits using real time data from their cars. The data came in the form of latitude/longitude points, from which we would build a trip and calculate statistics for that trip (speeding, distance, etc).

Because the car sends lat/long data every second, a Trip object can contain thousands of points. Each of these points in turn stores the standard rails timestamps, the latitude and longitude themselves, and some relational keys.

We wanted to present the user with a map of his trips for a given day. A faint line would trace each trip's path, with a darker line indicating the trip the user currently has selected.

For this, we decided to use Google's excellent maps API, which you can feed location points and get back a polyline mapping the trip. The plan was to use HTML5's data attribute to store the point objects on each list item representing a trip. Then, we'd simply grab the data with JavaScript, run it through the API, and render out the trip trace.

But it didn't turn out to be that easy. Since a map displays several trips, and each trip contains thousands of points, and each point contains a half dozen pieces of data, we realized that bootstrapping in the data on page load would cause unbearably slow load times. Between the giant database query, writing the data to the page itself with ERB, and then downloading the markup once it got to the client, we saw page load times as slow as a full minute.

We sped this process up by attacking three core headaches of Big Data: volume, load speed, and caching.

Volume: one of the simplest ways to make working with big data easier is to shrink the amount of data you have to work with. Rendering an ActiveRecord object out via JSON includes every one of the object's DB fields by default. This almost always includes id, created_at, and updated_at timestamps. All Google's maps API needs from our points, however, is a latitude and longitude. By using a custom presenter to include only these essential data pieces, instead of relying on ActiveRecord's default as_json method, we were able to reduce the volume of the JSON payload by over half.

Load speed: most users consider a page "loaded" when the styles have rendered and they can begin reading the page. This is why it's best-practice to put JavaScript at the bottom of your page; even though processes are still happening in the background, they don't hold up the rendering of the CSS or HTML because they are dealt with after both of those are rendered. We can approach loading big data sets in the same way. The slow method would be to fetch the data in your controller and write it to the view with an instance variable. While this is happening, the user is left looking at a blank screen. The other way--letting the page load without the data, then grabbing it asynchronously with JavaScript once the page has rendered--will appear faster to the user. Both approaches might take about the same time from start to finish, but in the second example the user perceives the page to load much faster because it's visible and accessible after just a few seconds.

Caching: for most slow pages, the culprit is usually a big, slow database query. For us, the bottleneck was actually the next step: after we grabbed the points from the database, we had to do calculations on them to arrive at the data for the whole trip. For example, a trip's distance is calculated by adding the distances between each point. Speeding is calculated by determining the velocity between points, and so on. While it would be simple to cache the results of those calculations the first time they're made, our app had an extra feature that allowed users to set what speed they considered "speeding," and what rate they considered a "hard brake." When a user changes either of those options, the number of speeding or hard brake events in a trip changes. Our solution was to cache a trip's entire "show" endpoint by using a caches_action at the top of the controller. We provided our own cache key based on the trip's ID, which allowed us to easily fetch and expire the cache of individual trips with an observer when the user changed his settings.

After reducing the volume of data, lazy-loading the data once the page rendered, and caching the controller action, the page refreshed in only a couple of seconds.

Tags