Our Senior Software Developer, Tatsuya Ono, shares his experience on implementing a data-driven development process.

At Geckoboard we make more than 10 million external API calls every day to show real time data on users’ dashboards. We support more than 60+ service providers - each with a different API. They could be using HTTPS, LDAP, REST or SOAP along with OAuth or Basic Auth and each service will have their own response format and data structures. We carefully design and develop our widgets to ensure that accurate and meaningful data is shown across all dashboards.

Tackling API limits

One of the challenges we face is that some APIs perform rate limiting (e.g. Twitter). Many service providers have limitations to their API usage to prevent malicious users from making a large number of requests. Fortunately for us, most service providers understand that we’re not malicious and Geckoboard being able to access their API is valuable to their users so we often get a looser rate limit. Of course we still have a responsibility not to flood their API with requests and make sure we’re using it as efficiently as possible.

Understanding the problem

One day, our support team received emails from a number of customers saying that some widgets were not updating properly at a specific time of the day. On further investigation we found that these widgets were making API calls to one specific service provider (let's call them Acme corp for the purpose of this post). Our suspicion was that the errors were caused by an API rate limit being reached.

We monitor our API calls with Librato through Statsd. The data from Librato (see image below) showed us that our suspicion was correct.

errors

Acme Corp, it turned out, is based in the San Francisco Bay Area. Like many of our partners their API rate limit is reset at the same time each day. We were reaching this limit late in their day (PST) meaning errors for our customers in Europe were ocurring between 6am and 7am UTC (PST is 8 hours behind UTC).

Discussing a solution

The next step was for the team to discuss the problem and find a solution. Here are the actions we came up with:

  1. Reduce the number of API calls by increasing the time between widget reloads

  2. Talk to Acme corp and ask them to increase our rate limit

  3. Improve our widget code to make fewer API calls

The first step was a very temporary solution until we could work on the other actions (increasing the refresh time would be detrimental to customer experience).

Our next step was to negotiate with Acme corp over the rate limit. They kindly offered to increase the limit by 50 percent, but they also wanted us to reduce unnecessary API calls (which we intended to do as a third action anyway).

For these widgets, the user can select the time period of the data to be displayed. If a user wants today's data we have to update the widget more frequently. If a user wants yesterday's data, we only have to update the data once in a day. Unfortunately we were making the same number of API calls whether the user was requesting data for today or yesterday. Regardless of the user’s configuration, our system updated the data in same cycle and made same number of API calls for those widgets. Our solution was to cache data for widgets whose configuration didn’t mean that the data need to be fetched all of the time.

Getting the metrics

Prior to joining Geckoboard I would’ve started working on the code straight away. But at Geckoboard we have a slightly different culture: we’re data-driven, and we want to be convinced that we’re taking the right approach by using data. In this case, we wanted to know how many API calls could be cached.

I updated our library to count API calls grouped by the date range (i.e. what time period this widget covered) of the data.

cachable

From this graph we learnt that nearly 50 to 55 percent of API calls could be reduced simply by caching the result for a longer period.

Focusing on the solution

We found that we needed to fix several different codebases. At first, we needed to tweak a library for widgets to cache API responses in an easy way. Then we had to update the code for each widget, which took a bit of time. A few different developers had to be involved to do this.

Although the graph on Librato gives us great information, it’s not easy to judge if we actually achieved our desired goal. We wanted to share a goal within the team that was easy, clear and intuitive. We needed to focus on this goal and that's where Geckoboard helped. I created a little script which exported data from Librato, aggregated it and then pushed the data to a dashboard.

 

librato-geckoboard-script

 

It shows some simple numbers that immediately tell us if we achieved our goal or not. Normally we tend to forget about the purpose of what we’re working towards, and are satisfied with delivering some functionality or a hotfix. This dashboard helps to prevent us falling into this particular pitfall.

Achieving our goal

After I set up the dashboard, our integration team started working on the fix, however things were more complicated than we thought. The data processed by Acme Corp, it turned out, would suffer from unpredictable time lags. That, coupled with the fact that their API doesn't reveal the status of their data processing, confuses matters

The cache strategy we planned was not able to work out as we expected. Nevertheless, we continued to tweak our approach whilst opening a dialogue with Acme corp around how we can help improve things from both sides, a process helped by sharing the dashboard we made showing the efficiencies we’d made.

acme-corp-efficiencies

Implementing a data-driven culture

In software development, the need for data, analytics and data insights are like the need for an accurate map for someone driving a car. We know that no matter how quickly you try to drive, it's impossible to reach your destination if you're driving in the wrong direction. But we (as developers) sometimes forget that and go down a blind alley - wasting time and money.

When we started this process, I was uncomfortable spending time not coding. It took dicipline to take the time to set up the necessary tools for this process, and I found myself initially choosing the wrong metrics. I felt unease at the thought that I may have been creating an inaccurate map for myself (which might be worse than not having a map at all)... But a couple of months later I suddenly noticed that things got better. We’ve got a good set of tools/services and libraries to communicate our data with, and gradually we learnt how to choose the right metrics and communicate these to the team. Now I can get the metrics I want quickly, and I rarely find myself choosing the wrong metrics.

So don't worry if you're just getting started with updating your processes - take it from us it’s perfectly normal. Try and apply the data-driven mindset to three or four sprints: analyse the problem you're tackling and share a clear goal. You and your team will soon enjoy the benefits of data-driven culture.

 

Got any questions around data-driven development? Don't hesitate to contact us at support@geckoboard.com or leave a comment below.