Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%
I don’t usually agree with this statement. It’s an old product of a past era where developers were greatly limited by the CPU and memory resources available.
We’ve now gone to the other end of the spectrum and nowadays some frameworks and patterns are way too abstract and disconnected from the underlying technological boundaries.
I’ve seen quite a few projects where the state of the art software engineering practices of Object Oriented Incapsulations & inheritance fail miserably in production because they yield production code which is too slow.
The java VM has been long optimized to run with hundreds of threads and sometimes millions of object instances per web app. That’s just insane!
ORMs are a typical example of abstraction that cause only headaches in production unless you are dealing with no load or no data, as it tries to bridge very different domains (relational and OO) with a one-size-fit-all approach.
The error #
Despite of all of the above I fell into the premature optimization trap myself at least once.
Tasked to build an energy management solution I designed a system that would aggregate raw 5min data by multiple intervals defined as buckets of seconds after epoch..
These intervals would look like something like this:
| intervals | examples | mysql tables |
| —- | —- |
| 5m | 04:00:00, 04:05:00, 04:10:00… | energy_5 |
| 15m | 04:00:00, 04:15:00, 04:30:00… | energy_15 |
| 30m | 04:00:00, 04:30:00, 05:00:00… | energy_30 |
| 1hr | 04:00:00, 05:00:00, 06:00:00… | energy_hour |
| 1day | 06/11/2015, 00:00:00, 06/12/2015, 00:00:00… | energy_day |
| 1week | …. | energy_week |
| 1month | … | energy_month |
| 1year | …. | energy_year |
The main idea behind my architecture was to pre-aggregate the data around these intervals for fast data querying from the APIs
The aggregation would rely on Redis to constantly update a bucket until its data would become stale and could be flushed to mysql
And indeed it worked quite well, not without issues though.
The drawbacks of preaggregating are quite easy to identify though:
- The buckets are timezone dependent, hence changing timezone for a circuit is not supported (as it would screw up the bucket start time)
- Integral error caused by bugs in the software is not easy to rectify and is especially annoying when it impacts on the yearly buckets as it requires to extract 1 year worth of data from the DB to recalculate the correct bucket values
- Numeric error while adding very small or very big values over and over again (as there are 105k+ 5 minutes intervals in a year)
Recently I was tasked to work out solutions around these issues and I suddenly realized my approach had been overly zealous without reason
The idea had always been to avoid mysql aggregations at query time as it’s pretty much discouraged in any possible forum or stack overflow answer
It turns out things have improved significantly in mysql 5.6, although I cannot find a definitive reference to it in the change logs and aggregations over up to 100k rows or less than 1mb tend to be very quick.
Hence a query like this (the energy_5 table has 36 bytes fixed size rows)
SELECT sum(longEnergy), SUM(longReactiveEnergy), MIN(voltageMin), MAX(voltageMax), MIN(currentMin), MAX(currentMax), SUM(cost), SUM(co2) FROM energy_5 where circuitID = 55
would run in just under 0.1 seconds while aggregating over 50k rows selected and in just 0.3 seconds while aggregating over 130k rows, on an amazon m3.medium instance and a 32mil row table
As a result I was able to reduce the number of aggregation needed down to basically two (5min and 30min) while retaining the query speed I had before
As a result:
- the aggregations are now handled partially in the DB and partially by the energy data collector
- it is now possible to move a circuit from one timeone to another (only full hour and half hour timezones are supported though)
- effects of aggregation bugs are generally limited to few 30min buckets and required very little effort to recalculate
- the API contract to the end user hasn’t changed
More often than not a hybrid approach like this turns out to be the winner.