Premature optimization

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%


I don’t usually agree with this statement. It’s an old product of a past era where developers were greatly limited by the CPU and memory resources available.

We’ve now gone to the other end of the spectrum and nowadays some frameworks and patterns are way too abstract and disconnected from the underlying technological boundaries.

I’ve seen quite a few projects where the state of the art software engineering practices of Object Oriented Incapsulations & inheritance fail miserably in production because they yield production code which is too slow.

The java VM has been long optimized to run with hundreds of threads and sometimes millions of object instances per web app. That’s just insane!

ORMs are a typical example of abstraction that cause only headaches in production unless you are dealing with no load or no data, as it tries to bridge very different domains (relational and OO) with a one-size-fit-all approach.

 The error

Despite of all of the above I fell into the premature optimization trap myself at least once.

Tasked to build an energy management solution I designed a system that would aggregate raw 5min data by multiple intervals defined as buckets of seconds after epoch..

These intervals would look like something like this:

| intervals | examples | mysql tables |
| —- | —- |
| 5m | 04:00:00, 04:05:00, 04:10:00… | energy_5 |
| 15m | 04:00:00, 04:15:00, 04:30:00… | energy_15 |
| 30m | 04:00:00, 04:30:00, 05:00:00… | energy_30 |
| 1hr | 04:00:00, 05:00:00, 06:00:00… | energy_hour |
| 1day | 06/11/2015, 00:00:00, 06/12/2015, 00:00:00… | energy_day |
| 1week | …. | energy_week |
| 1month | … | energy_month |
| 1year | …. | energy_year |

The main idea behind my architecture was to pre-aggregate the data around these intervals for fast data querying from the APIs

The aggregation would rely on Redis to constantly update a bucket until its data would become stale and could be flushed to mysql

And indeed it worked quite well, not without issues though.

The drawbacks of preaggregating are quite easy to identify though:

Recently I was tasked to work out solutions around these issues and I suddenly realized my approach had been overly zealous without reason

The idea had always been to avoid mysql aggregations at query time as it’s pretty much discouraged in any possible forum or stack overflow answer

It turns out things have improved significantly in mysql 5.6, although I cannot find a definitive reference to it in the change logs and aggregations over up to 100k rows or less than 1mb tend to be very quick.

Hence a query like this (the energy_5 table has 36 bytes fixed size rows)

SELECT sum(longEnergy), SUM(longReactiveEnergy), MIN(voltageMin), MAX(voltageMax), MIN(currentMin), MAX(currentMax), SUM(cost), SUM(co2)
FROM energy_5
where circuitID = 55

would run in just under 0.1 seconds while aggregating over 50k rows selected and in just 0.3 seconds while aggregating over 130k rows, on an amazon m3.medium instance and a 32mil row table

As a result I was able to reduce the number of aggregation needed down to basically two (5min and 30min) while retaining the query speed I had before


As a result:

More often than not a hybrid approach like this turns out to be the winner.


Now read this

Evolving CRUD (part 1)

Serving content at scale limiting complexity Scalability is a simple concept that proves difficult to achieve without introducing complexity Create Read Update Delete CRUD stands for Create Read Update Delete and identifies all the... Continue →