Data in the cloud and how the big cloud vendors handle consistency

Home > Cloud computing > Data in the cloud and how the big cloud vendors handle consistency

Data in the cloud and how the big cloud vendors handle consistency

2011/10/25 Herbjörn Wilhelmsen Leave a comment Go to comments

The way to go in the cloud

Massive scalability is a key component of elasticity that in turn is the key advantage of cloud computing. Handling massive amounts of data is far from easy whether you use cloud computing or not. To get the real benefits of the cloud there are a couple of limiting factors that needs to be considered – at least that is the way the official dogma goes.

We cannot “have-it-all” with big data

Many seasoned developers/archtects are used to working with, or even designing, databases that offer perfect consistency and very good availability. Sadly, this is a more challenging task with big data.

Spreading out the data on multiple machines is a good way to improve availability. That will enable your solution to serve more requests per time unit and also gives you the opportunity to implement automatic failover.

However, if we use that way to improve availability it will impact consistency. If we save data to machine A and that machine immediately fails, machine B will take over so that our availability remains high. The consequence is that the data that we just saved to machine A will not be reflected on machine B, i.e. our consistency will be less than perfect.

According to the CAP theorem (I will not explain that here) we have to prioritize between availability and consistency. This is a tough choice since we generally want both.

What kind of choices are available?

The choices that we do have is to forfeit consistency or to forfeit availability. This is not as dramatic as it sounds, since we would still have good consistency and good availability! However, sometimes consistency is of utmost importance; It is required that all data is completely consistent at all times in all machines. In such a scenario we have to accept reduced availability. On the other hand, if availability is too important to reduce we can choose to reduce the data consistency. Here are some terms that is important to understand before reading the rest of this blog post:

Eventual consistency is a term that was popularized by Werner Vogels. Data is saved and committed to some but not all machines (often called nodes) before your write request returns. The advantages include that only some of the machines must be available when data is saved and that some time may be saved due to contacting fewer machines before the saving operation returns. However, this leaves a window of inconsistency, meaning that during a period of time requests might be served using old data. Working on reducing the inconsistency window so that it is closed before the next expected request arrives is a recommended strategy.
Read your writes means that if a process writes data it will always have access to that updated data and will not have to deal with older values. (This is only interesting to discuss when dealing with eventual consistency.)
Monotonic reads means that if a process once accesses a data entry it will never be presented to an older version of that data on a later occasion. (This is only interesting to discuss when dealing with eventual consistency.)

The big cloud vendors Amazon, Microsoft and Google all offer data stores suitable for building cloud solutions with huge amounts of data. Amazon and Google offer eventually consistent reads and strongly consistent reads in their products, but Microsoft do not offer any such options. In Microsoft Azure consistent reads is the only option.

Do these choices make any difference?

A very interesting study on data consistency in the cloud that compare these cloud vendors was published earlier this year (2011) by a bunch of researchers based in Australia. Their findings are very interesting and is summarized in the table below.

Vendor	Product	Option	Consistent after	Read your writes	Monotonic read
Amazon	SimpleDB	Eventually consistent read	500 ms	No	No
Amazon	SimpleDB	Consistent read	0 ms	Yes	Yes
Amazon	S3	Reduced redundancy	0 ms	Yes	Yes
Amazon	S3	Standard redundancy	0 ms	Yes	Yes
Microsoft	Azure Table	(no option available)	0 ms	Yes	Yes
Microsoft	Azure Blob	(no option available)	0 ms	Yes	Yes
Google	App Engine Data Store	Strong consistent read	0 ms	Yes	Yes
Google	App Engine Data Store	Eventual consistent read	0 ms*	Yes*	?*

Interestingly, the results show that during these tests only one option (Amazon’s SimpleDB with Eventually consistent read) gave rise to situations where the reader of the data saw any effects of eventual consistency. Another interesting finding is that SimpleDB using the Consistent reads option was slightly faster – contrary to some of the hoped for benefits of choosing eventual consistency.

As for the Google App Engine Data Store using the Eventual consistent read option the results presented in the table above are marked with an asterisk (*) and here’s why: 11 out of 3,311,081 read operations returned stale data when reader and writer were not running in the same application. This consistency level is very high for an eventual consistency option. The explanation for these results might be that data is fetched from a secondary replica only if the primary one is unavailable. Since stale values were only returned when reader and writer was running in different applications Read you writes consistency seems to be offered.

Some advice

Based on the findings in this research this is what I recommend you to do when you are working with massive amounts of data in the cloud:

Express requirements regarding availability and consistency in business terms
Carefully consider your availability and consistency options with your business needs and implementation costs in mind
Perform tests with realistic machine configurations, realistic amounts of traffic and realistic amounts & structure of data
Evaluate and choose you implementation strategy
Keep track of how your vendor changes their implementation and what that means for your solution

Expressing your consistency and availability needs in business terms is essential if you want to arrive at a decent solution. Evaluating without being able to compare the positive business outcomes of increased availability and possibly added development costs associated with eventual consistency, might lead you astray. Thinking about perfect consistency might also lead you astray. Although perfect consistency might be good for your business it has to be compared to what the business outcomes of reduced availability will be. Your own tests (and other tests e.g. benchmark tests) may also help you reason about how much performance and availability may be affected by your different options.

The last point (no 5) is very important to remember when you are using a cloud based solution since most vendors changes their hardware configuration, software configuration and software implementation from time to time. When cloud vendors make these changes you may be presented with new options, but some of these changes might just be carried out without giving you any new options; You simply get a “better service” – meaning that you may have to go back to testing again. At the end of the day, re-running your tests on a regular basis might be your best option. That way you do not have to worry about missing out on important vendor updates.

Comments (1) Trackbacks (0) Leave a comment Trackback

Simple DB Explorer (@sdbexplorer)

2011/10/28 at 08:03

Reply

Awesome Article. Amazon SimpleDB supports two read consistency options: eventually consistent read and consistent read and in future release of SDB Explorer we will try to optimize these two option for end users.

No trackbacks yet.

Herbjorn's Blog

Leave a comment Cancel reply

My books

Featured posts

Herbjorn's Blog