Evolving CRUD (part 1)
Serving content at scale limiting complexity
Scalability is a simple concept that proves difficult to achieve without introducing complexity
Create Read Update Delete #
CRUD stands for Create Read Update Delete and identifies all the possible type of interactions with a specific resource hosted in any type of datastore.
It’s the simplest paradigm for database access and translates easily into the world of web applications: most RESTful APIs are in fact built using this paradigm
Rightly so as it’s intuitive and simple in itself
Usually a web application exposing a CRUD interface externally does translate each interaction it to a similarly straight forward database interaction.
For example, with a RESTful API with a SQL backend we could have:
|CRUD operation||RESTful method||SQL statement|
CRUD + cron #
Sooner or later, more often than not the need for some kind of background processing arises.
That’s usually when scheduled tasks need to be added. They can live on their own dedicated servers but some technologies like Java and Node.js allow for an implementation within the web app itself.
Scheduled tasks are usually time based execution of business logic and are therefore inherently fragile in case of system or network slow downs.
CRUD + caching #
Usually a plain vanilla CRUD system cannot withstand a significant amount of traffic. The bottleneck is usually the database. The web application layer can scale out as the traffic grows but the database layer can scale up limitedly..
Pretty much all relational dmbs offer some kind of ACID compliance. This in practice translates to hard to beat constraints in terms of read and write capacity: reads can hardly scale over the 200/300 req/sec and with writes the limit can only be lower, as each INSERT/UPDATE will inevitably hit the disk before returning.
As the traffic grows the it’s easy to realise that read requests generate an order of magnitude more traffic than write requests.
Most websites like news or blogs in fact exhibit a low write, high read requests behaviour. This largely depends on on the amount of user generated vs curated content the web application serves.
As caching is added as an afterthought and caching is a synonym of in memory storage,
Memcache is usually the technology of choice
Heavy SELECT queries can be seamlessly cached in Memcache for a variable amount of time. A memory based caching layer can easily offer 100k req/sec on average hardware.
Given its speed it’s not abnormal to try and serve 99% of the page requests straight from cache.
The issue with this approach is that we are effectively throwing away all the relational modelling done while designing the web application in order to effectively build a key-value noSQL model, at a late stage and without a proper design
CRUD + caching + cdn #
As the traffic becomes more significant and geographically distributed, in app caching starts showing some limitations: your web application layer needs to scale out more and more to the point of becoming impractical or uneconomic.
The easiest way to cheaply scale out read requests is to have multiple distributed copies of our data served by an ad hoc dedicated 3rd party system, i.e. a CDN
As read capacity easily scales out using a CDN we face the typical tradeoff that comes with caching: freshness vs throughput.
This is similar to what we achieve by using
Memcache locally, only made worse by the distributed nature of a CDN and by its implications in terms of change propagation across the network.
At start static files are a quick win and can be served by a CDN with limited side effects, especially if they are immutable - i.e. version controlled.
They are a source of bandwidth usage only though, not CPU, and bandwidth is cheaper than CPU on most cloud providers.
To save up on CPU it’s necessary to avoid serving dynamically generated content straight from origin whenever possible.
This can be achieved setting cache headers on each dynamic response so that very few requests hit origin and most of the are served by the CDN
CDNs effectively offer a good (D)DOS protection mechanism this way.
Issues with our approach #
Some observations on what we’ve achieved in our journey:
- we have designed a system with conceptual simplicity without allowing performance considerations affect its design. In other words we followed the don’t preoptimize mantra adding only ad hoc optimization when needed.
- we started from a relatively simple conceptual design and evolved it rather quickly into a more complex system
- caching now happens at 2 different layers (CDN and Cache server)
- our caching strategy is effectively pull: if a view is not in the cache, let’s compute it and store it there for subsequent requests.
- only GET requests for static or slow changing content can be effectively intercepted by the CDN. Data-manipulation requests (POST/PUT/DELETE) must always hit origin and specifically the dbms. Therefore User Generated Content is still our Achilles’ heel
- we’ve introduced delays in content updates propagation along the way and sacrificed freshness
- We are still relying on CRON jobs or scheduled tasks for background processing, timeout based and hence inherently fragile
Let’s see if we can do any better in part 2