Evolving CRUD (part 1)

Serving content at scale limiting complexity

Scalability is a simple concept that proves difficult to achieve without introducing complexity

 Create Read Update Delete

CRUD stands for Create Read Update Delete and identifies all the possible type of interactions with a specific resource hosted in any type of datastore.

It’s the simplest paradigm for database access and translates easily into the world of web applications: most RESTful APIs are in fact built using this paradigm

Rightly so as it’s intuitive and simple in itself

Create Read Update Delete.png

Usually a web application exposing a CRUD interface externally does translate each interaction it to a similarly straight forward database interaction.

For example, with a RESTful API with a SQL backend we could have:

CRUD operation RESTful method SQL statement
Create POST /resources INSERT
Read GET /resources/:resourceId SELECT
Update PUT /resources/:resourceId UPDATE
Delete DELETE /resources/:resourceId DELETE

 CRUD + cron

Sooner or later, more often than not the need for some kind of background processing arises.

That’s usually when scheduled tasks need to be added. They can live on their own dedicated servers but some technologies like Java and Node.js allow for an implementation within the web app itself.

Create Read Update Delete with cron job

Scheduled tasks are usually time based execution of business logic and are therefore inherently fragile in case of system or network slow downs.

 CRUD + caching

Usually a plain vanilla CRUD system cannot withstand a significant amount of traffic. The bottleneck is usually the database. The web application layer can scale out as the traffic grows but the database layer can scale up limitedly..

Pretty much all relational dmbs offer some kind of ACID compliance. This in practice translates to hard to beat constraints in terms of read and write capacity: reads can hardly scale over the 200/300 req/sec and with writes the limit can only be lower, as each INSERT/UPDATE will inevitably hit the disk before returning.

Create Read Update Delete with cron job and caching

As the traffic grows the it’s easy to realise that read requests generate an order of magnitude more traffic than write requests.

Most websites like news or blogs in fact exhibit a low write, high read requests behaviour. This largely depends on on the amount of user generated vs curated content the web application serves.

As caching is added as an afterthought and caching is a synonym of in memory storage, Memcache is usually the technology of choice

Heavy SELECT queries can be seamlessly cached in Memcache for a variable amount of time. A memory based caching layer can easily offer 100k req/sec on average hardware.

Given its speed it’s not abnormal to try and serve 99% of the page requests straight from cache.

The issue with this approach is that we are effectively throwing away all the relational modelling done while designing the web application in order to effectively build a key-value noSQL model, at a late stage and without a proper design

 CRUD + caching + cdn

As the traffic becomes more significant and geographically distributed, in app caching starts showing some limitations: your web application layer needs to scale out more and more to the point of becoming impractical or uneconomic.

The easiest way to cheaply scale out read requests is to have multiple distributed copies of our data served by an ad hoc dedicated 3rd party system, i.e. a CDN

Create Read Update Delete with cron job, caching and CDN

As read capacity easily scales out using a CDN we face the typical tradeoff that comes with caching: freshness vs throughput.

This is similar to what we achieve by using Memcache locally, only made worse by the distributed nature of a CDN and by its implications in terms of change propagation across the network.

At start static files are a quick win and can be served by a CDN with limited side effects, especially if they are immutable - i.e. version controlled.

They are a source of bandwidth usage only though, not CPU, and bandwidth is cheaper than CPU on most cloud providers.

To save up on CPU it’s necessary to avoid serving dynamically generated content straight from origin whenever possible.
This can be achieved setting cache headers on each dynamic response so that very few requests hit origin and most of the are served by the CDN

CDNs effectively offer a good (D)DOS protection mechanism this way.

 Issues with our approach

Some observations on what we’ve achieved in our journey:

Let’s see if we can do any better in part 2

 
2
Kudos
 
2
Kudos

Now read this

Synchronization is not an option

I recently had an interesting discussion with a java developer around multithreading Multithreading is one of the most complex topic in Computer Science. It’s not a surprise that very few programming languages (java and c++) offer the... Continue →