The P in CAP is for Performance

07 Mar 2024 - by 'Maurits van der Schee'

I was reading the (well written) article "The CAP Theorem. The Bad, the Bad, & the Ugly" by Dominik Tornow (recommended read). On "partition tolerance" (the P in CAP) he writes: "network partitions are inevitable in a realistic system model" and that tolerating partitions is a "non-negotiable requirement". His opinion is a common one, shared by Eric Brewer (author of the CAP Theorem), who also wrote in "CAP Twelve Years Later" that his framework was misleading and that "designers cannot forfeit P". I disagree as the same could be said about Consistency or Availability: people won't lightly give those up on those either.

On the CAP theorem

Eric Brewer (in my words) said in his original keynote that you can have "2 out of 3":

Consistency
Availability
Partition tolerance

My interpretation is that a bank may lean towards CA, while a social network may lean towards AP and an email system may lean towards CP.

By stating that P must always be present one may conclude that this "divides the world into CP and AP systems" as Tornow writes. He was rephrasing Brewer who wrote that "The general belief is that for wide-area systems, designers cannot forfeit P" and that we "therefore have a difficult choice between C and A". But I feel I disagree with their views. One may choose to ignore P (to some extend, read on).

CAP properties are non-binary

If you choose to NOT have P (Partition tolerance), you may achieve CA. This requires you to accept that networks are reliable. In other words, as long as the Internet works (we can reach every node within specified time) we can achieve CA. Now that doesn't sound so crazy or does it? Maybe the P should stand for "no partitioning within a specified time-frame", redefining the P as Performance.

The "choose 2" statement makes one belief that CAP properties (and especially partition tolerance) are an on/off thing, which they aren't. Just as C and A are a gradual thing with lots of shades of grey, the P also has levels. The P levels being the performance level that needs to be guaranteed. Think about slow systems, such as email for instance.

Brewer seems to agree that these are not binary properties when he writes: "Finally, all three properties are more continuous than binary." This seems in contradiction with his statement about the "difficult choice between C and A", but maybe it isn't. Brewer also writes that "partitions are rare", but it really depends on whether one qualifies (a single) packet loss as a network partition, albeit a very short one.

Retries harm performance

Internet works with packets. Any packet that is lost is re-transmitted, that is the nature of TCP. Any packet loss can therefor lead to a partition. The only factor is how long you'll retry. And if you retry longer, then the guaranteed performance of your system will go down, while also decreasing the chance of partitioning. Determining whether or not there is a partition (another host is reachable) is impossible without saying how long you'll retry. You may accept TCP's retry limits, but they are an arbitrary limit (of Reno) that you may choose to respect (or not, by implementing retries in the application layer).

Therefor I feel one should talk about CAP like this:

Consistency, in different levels (C)
Availability, with different guarantees (A)
Partition tolerance, for a specific time (P)

Every system needs all three to some extend. What those levels should be depends on your use case and thus requirements. Performance in a distributed system is determined by the node-to-node distance and their network speeds. Perceived performance can be increase by trading in consistency for better latency (as some global systems do, see PACELC Theorem). On the other hand: If the nodes can't reach each other due to partitioning, then the performance suffers (due to retries).

Examples of low guaranteed performance

Now let's look at two examples of low guaranteed performance due to partition tolerance.

1) Email servers can easily regularly miss 24 hours of Internet connectivity, does that disqualifies email as an online world-wide distributed system? The system is available when you are able to send and can check for mail received. Sending and receiving an email is often instant, but may also take days (email may end up in queues, doing retries). This lack of guaranteed performance is a way of working around the P problem.

2) In the early days of the Internet (in the 90's) we had FidoNet running on BBSes that uses a store-and-forward system to exchange email. It was the ultimate partitioned system, only connecting to a list of nodes (that it knew of) once per night. Email exchanges were done using phone lines and timed to run late at night, normally 4 AM. Every morning you could check your email and email could take days to arrive.

I think these systems have/had quite high consistency (C) and quite high availability (A) while also clearly having good partition tolerance (P), but at the cost of accepting low guaranteed performance.

Conclusion

I was reading a great article by Dominik Tornow, but I realized I didn't agree with it. Then I realized that I didn't agree with Eric Brewer's original point-of-view on the CAP properties either. I feel the 3 CAP properties are continuous and not binary on/off. I suspect that Brewer also knows this but likes to speak in hyper-boles as the nuances may not be as "quotable". I guess "pick 2" sounds good, but in reality it is more "choose your balance", which may sound boring. So on the CAP properties we can conclude that you may want (but can't have) them all.

Links

PS: Liked this article? Please share it on Facebook, Twitter or LinkedIn.

TQ
dev.com