May 5, 2010

Managing big data sets with MySQL

I’m currently working on a database which stores, literally, a million of new data registers every day. They come from several hundred weather stations spread along the world, and need to be processed on real time. That’s a huge amount to be managed with a simple – but powerful – MySQL server, so I’ve had to design and implement a few tricks in order to succeed.

In this post I’d like to talk about the main three techniques we’ve used to engage our high performance database, and the reasons to choose these solutions.

First of all, MySQL innodb tables start degrading at some million registers  (7-10M on my current server, but not tested further), so the first though I had was: we need to divide if we want to conquer. Then, after some thinking about, I decided to split the big table into small chunks, and use a table for each station. That made the whole problem more affordable, and tables smaller enough not to become scary: the new problem was how to store about 5 M registers per year per table in our database.

Of course, some logic in both the application controller and the data input server was needed to correctly place data registers. Additionally, tables get often fragmented so some cron jobs were needed to keep the database running smoothly. But there was still the problem of data growing. As long as registers were separated depending on the station, the scalability problem was solved: more tables can be added, even more servers with ranges of stations and a hash function to calculate what server is a given station hosted in. Once again were we talking about several million registers per year per table, and with a standart MySQL 5.1 server (the main problem with an start-up company tends to be funding, so we didn’t have anything better).

Then a second solution came out: since 80% of queries are performed over the last few years of data, let’s move oldest registers to depot tables! This way, data used most often will be kept on relatively small (affordable, at least) tables, and the rest will be just stored.  So a big problem had turned into a matter of designing some queries over a bunch of tables, in a way that can be easily automated. Great! And it’s been so for a while, until we’ve realized that even with small tables, we are handing huge sets of data to represent evolution of physical variables along time. I’m talking about high resolution data, let’s say a measure every minute for some types of sensor, which makes sense to study minor variations along the day.

It’s now when we start thinking about cubes – resumes. The real headache with big data sets starts when comparing a whole month range of values, which will produce several thousand registers to represent on a 800px wide plot graph… that’s basically a waste of server processor and memory resources, since we are providing a resolution which is not adapted to the circumstances.

So now we have a third solution which is based on averages. MySQL – as other database engines – allows creating averages on the fly which is further better than relying these operations on the application layer. The problem beneath is that a big data set must be evaluated on demand, and this for every user on the website (we barely can cache MySQL responses because values change every 10 minutes). To fix it, we’ve decided to run these operations just once every hour, and store the results on cubes – tables with resumes – which would be consulted instead of the high density original ones. The algorithm to determine the table to look is simple: if the range of data is larger than several days, just use the hourly resolution instead. To automate this, we’ve added some naming convention, let’s say, all data tables are named the same way (based on the station ID), and only the database name (based on the data resolution) changes.

To finish with, I’d like to advance the construction of this cubes generation process:

  1. A stored procedure calculates averages for a given period for a given table
  2. Another stored procedure launches the first one for all the tables; it is defined to be called from whatever time event handler we use
  3. The event handler is defined, well in MySQL engine, well as a cron job

I’ll talk about the implementation details in further posts…