I’ve already talked about how I solved the problem of managing huge amounts of data in my last post. Now, I’m going to explain how to implement one of the solutions found in order to comfortably face this continuously increasing avalanche of information.
Let’s imagine that, as I explained before, I have separated data input into tables coded by a station ID. All of them are stored in the same database for maintenance simplicity sake. So, we have the ‘database’ in which there are hundreds of tables called ‘Station_XYZ’. Every table has the same structure: to simplify, SensorID, Time (UNIX), Value. All right then, time to make cubes!
First of all, I define a MySQL stored procedure to extract maximum, minimum and average values from these tables and to save them in a second ‘database_cubes_1h’, named so because this process will run every hour. Also, there is a table Stations in ‘database_main’ (a third database to keep other application’s configurations), where all the stations ever installed are registered. We will use this table to know if a station exists and therefore its related table in database. A first draft would look like this:
-- -----------------------------------------------------
-- Procedure `fill_cubes_1h`
-- -----------------------------------------------------
DROP PROCEDURE IF EXISTS fill_cubes_1h;
DELIMITER //
CREATE PROCEDURE fill_cubes_1h($StationID INT)
BEGIN
IF Station_ID IN (SELECT StationID FROM `database`.`Stations`)
THEN
SET @strSQL = CONCAT('
INSERT INTO `database_cubes_1h`.`Station_',$StationID,'`
SELECT
SensorID,
UNIX_TIMESTAMP(DATE_FORMAT(FROM_UNIXTIME(Time),"%Y-%m-%d %H:00:00")) AS Hour,
AVG(Value) AS Value,
MAX(Value) AS ValueMax,
MIN(Value) AS ValueMin
FROM `database`.`Station_',$StationID,'`
WHERE
Time >= UNIX_TIMESTAMP(DATE_FORMAT(NOW() - INTERVAL 1 HOUR,"%Y-%m-%d %H:00:00"))
GROUP BY
HOUR, SensorID
ORDER BY
HOUR ASC
');
PREPARE statement FROM @strSQL;
EXECUTE statement ;
DEALLOCATE PREPARE statement ;
ELSE
SET @foo = "";
END IF;
END;//
DELIMITER ;
Basically, it composes a query to extract averages from one table and insert into another at once. This query is run as a prepared statement so that we can reuse the function for all the stations in our database, as long as tables are always named as Station_XYZ. But, what does exactly it do?
- Use a statement like ‘INSERT INTO table SELECT * FROM another_table. It will copy data automatically if the number and format of columns of the SELECT output and the INSERT INTO input are the same.
- As Time is stored in UNIX EPOCH time, it is converted to ISO time, then only the hour is extracted, then it is converted back to UNIX. The result of this process is be able to group the result set by hour, getting rid off minutes and seconds. “UNIX_TIMESTAMP(DATE_FORMAT(FROM_UNIXTIME(Time),”%Y-%m-%d %H:00:00″)) AS Hour”
- A WHERE condition filters the results for the last hour: ‘Time >= UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1 HOUR,”%Y-%m-%d %H:00:00″))’.
- Finally, the whole result set is grouped by Time, to be able to calculate averages, maximums and minimums. This statement does the grouping by respecting the different sensors that might have sent data on the same hourly interval “GROUP BY HOUR, SensorID” and the following ones perform calculations “AVG(Value) AS Value, MAX(Value) AS ValueMax, MIN(Value) AS ValueMin”.
Up to now we seem to have resolved the performance problem. Hourly cubes can be constructed and we only would need to add a sort of cron job. But, it is not so easy… I haven’t mentioned yet that data is not received synchronously. That means, within a time frame of three or four hours we could receive the data from a lazy or out of range station. That may be problematic, so…
I’ve changed my stored procedure to do as follows:
- Calculate the whole last day in hourly tranches, while keep running every hour. “WHERE Time >= UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1 DAY,”%Y-%m-%d %H:00:00″)) AND Time <= UNIX_TIMESTAMP(DATE_FORMAT(NOW() – INTERVAL 1 HOUR,”%Y-%m-%d %H:59:59″))”
- Modify the query to allow update in case that a hourly cube was already calculated. “ON DUPLICATE KEY UPDATE `Value` = VALUES(`Value`)”
Please note that in order to get it correctly running, there should be an effective way to detect a duplicate key. In my case, I’ve used all fields but the value as primary keys, instead of defining a new artificial key field. Thus, SensorID and Time are primary keys, so there should never be more than one value for each combination of both. Doing so, MySQL detects duplicate and updates the third column, value, by throwing no errors. The whole stored procedure would look like this:
-- -----------------------------------------------------
-- Procedure `fill_cubes_1h`
-- -----------------------------------------------------
DROP PROCEDURE IF EXISTS fill_cubes_1h;
DELIMITER //
CREATE PROCEDURE fill_cubes_1h($StationID INT)
BEGIN
IF Station_ID IN (SELECT StationID FROM `database`.`Stations`)
THEN
SET @strSQL = CONCAT('
INSERT INTO `database_cubes_1h`.`Station_',$StationID,'`
SELECT
SensorID,
UNIX_TIMESTAMP(DATE_FORMAT(FROM_UNIXTIME(Time),"%Y-%m-%d %H:00:00")) AS Hour,
AVG(Value) AS Value,
MAX(Value) AS ValueMax,
MIN(Value) AS ValueMin
FROM `database`.`Station_',$StationID,'`
WHERE
Time >= UNIX_TIMESTAMP(DATE_FORMAT(NOW() - INTERVAL 1 DAY,"%Y-%m-%d %H:00:00")) AND
Time <= UNIX_TIMESTAMP(DATE_FORMAT(NOW() - INTERVAL 1 HOUR,"%Y-%m-%d %H:59:59"))
GROUP BY
HOUR, SensorID
ORDER BY
HOUR ASC
ON DUPLICATE KEY UPDATE `Value` = VALUES(`Value`)
');
PREPARE statement FROM @strSQL;
EXECUTE statement ;
DEALLOCATE PREPARE statement ;
ELSE
SET @foo = "";
END IF;
END ;//
DELIMITER ;
The next step is to get this procedure running every hour for all the stations in the database. Now we are going to use our Stations table. The steps to follow are:
- Request the highest station ID in Stations table
- Since station ID’s are not sequential, get a way to determine if related table should exist
- Tell the procedure to keep going if any exception raises
- Run the loop until reach final station ID, and call the previous procedure passing current index as a parameter
The final result looks like:
-- -----------------------------------------------------
-- Procedure `climaps_main`.`fill_all_cubes_1h`
-- -----------------------------------------------------
DROP PROCEDURE IF EXISTS fill_all_cubes_1h;
DELIMITER //
CREATE PROCEDURE fill_all_cubes_1h()
BEGIN
DECLARE v INTEGER;
DECLARE m INTEGER;
DECLARE CONTINUE HANDLER FOR SQLEXCEPTION BEGIN END;
SET v = 0;
SET m = (SELECT StaID FROM climaps_main.Stations ORDER BY StaID DESC LIMIT 1);
WHILE v <= m DO
CALL fill_cubes_1h(v);
SET v = v + 1;
END WHILE;
END;//
DELIMITER ;
We’ve nearly finished! We have a way to build hourly cubes for a variable number of stations, and solving the problem of asynchronous data. We just need to create an event to run this last procedure, as we have stated above, every hour. This should work on MySQL 5.1
-- -----------------------------------------------------
-- Event `database_main`.`fill_cubes_1h`
-- -----------------------------------------------------
DROP EVENT IF EXISTS fill_cubes_1h;
DELIMITER //
CREATE EVENT fill_cubes_1h
ON SCHEDULE EVERY 1 HOUR
DO BEGIN
CALL fill_all_cubes_1h();
END //
DELIMITER ;
If for any reason we are not allowed to create events (all this stuff should be run as a user with whole access to the tables involved), or we are running a MySQL version lower than 5.1, a cron job should be run instead. Just get EXECUTE access on the database where the procedures have been created (on my case, database_main), and add this line to crontab:
0 * * * * mysql -e 'CALL database_main.fill_all_cubes_1h()'
Please note that to get this statement running without asking for a password, you need to create a .my.cnf file in your home directory (UNIX systems)
And that’s all!
So, what do you think? What’s your method? Comments and feedback are much appreciated 🙂