Network partitions and nosql
Many distributed noSql datastores claim support for consistency or availability.
From the CAP theorem we know that a distributed system under partition (P) can only be Consistent or Available but not both at the same time
Partitions: Why care about them? #
Because they can happen at any time, with any distributed deployment.
The Internet is not just a big LAN and some regions can at time get disconnected randomly
Network partitions can also happen within the same datacentre as well. Hardware can always fail within the same LAN (or Availability Zone) and not caring about network topology changes is simply a bad practice. Full stop.
Ideally you would want a datastore to never lose writes (CP) under any circumstance. No business is keen on loosing data after all.
It’s puzzling to see how hard this is to achieve when our datastore uses some kind distributed architecture.
Most nosql systems tend to privilege availability over consistency where possible.
Once a partition has occurred things tend to get messy: a datastore can:
- refuse writes by switching into read only mode in the minority partition.
- accept writes on both sides and then choose which side to keep and which writes to throw away (Last-Writer-Wins)
- accept writes on both sides and try to reconcile them all afterwards.
Many distributed datastores claim never to lose data and yet un-openly implement the LWW approach (#2) or a sort of hybrid between #1 and #2 trying to reduce the loss windows
This is because High Availability is usually a design goal and a selling point for distributed datastores. Compromising on HA to preserve consistency usually has serious speed implications which are deemed unacceptable.
Still: losing writes may or may not be an option for a specific application
Approach #3 is very appealing but only works for a specific class of applications like real time collaboration tools (i.e. Google Docs): if all updates are commutative they can be merged without conflicts. Roughly speaking this means they are CRDT
So where do we stand? #
Aphyr has done an outstanding job in verifying the claims of many well regarded distributed datastores in regards to consitency or availability behaviour under partitions. Have a look at his blog. It’s time well spent.
Systems that claim to be CP but are only AP (at best) #
This is a discomforting list :(
- MongoDB : see this
- ElasticSearch: see this
- Riak: only if CRDTs are switched off
- Cassandra: when it uses timestamp based clocks & Last-Writer-Wins cell updates
AP systems that don’t claim to be CP #
- Redis (sentinel or cluster): designed to be AP despite some effort to try and reduce the data loss window.
Storing financial transactions in Redis is probably a bad idea.
Verified CP systems #
- Apache Zookeeper: proof from Twiitter or from Aphyr
- Etcd & Consul: see this. As they are Raft based, they use logical clocks as opposed to timestamp based clocks
What can we do #
Most blog posts about partitioning finish up with something like
partitions can happen so be prepared
Yes but how?
Reading through Twillio’s experience one lesson I can bring home is that if you need a CP system and all you have is an AP system (Redis) replicate your writes to the AP system to a backup CP system. This won’t give you an automatic fix if the AP system gets partitioned but you can at least manually rebuild the data when the storm ends.
Not pretty but workable.
What about reads? #
If stale reads are not a problem, we have no problem.: different datastores behave differently in regards to read consistency and you should follow their documentation.
It is worth noting that strong read consistency is needed more often than not: distributed counters or a distributed locks simply don’t work with stale reads.
A wonderful answer on Quora on this topic