Summary:
Dynamo is a highly available key-value storage engine that Amazon's core services use to provide an "always-on" experience.
It is achieved by extensive use of object versioning and application-assisted conflict resolution
ACID properties:
Dynamo targets applications that operate with weaker consistency if this results in high availability. Dynamo does not provide any isolation guarantees and permits only single key updates.
First time knowing about service level agreement (SLA). It is a contract where client (internal user of Dynamo) and a service provider agree on serveral system-related characteristics. An example of aa simple SLA is a service guarateeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.
One of the main design considerations for Dynamo is to give services control over their system properties, such as durability and consistency, and to let services make their own tradeoffs between functionality, performance and cost-effectiveness.
Dynamo is designed to be an eventually consistent data store. That is all updates reach all replicas eventually.
An important design consideration is to decide when to perform the process of resolving update conflicts. Dynamo targets the design space of an "always writeable" data store (i.e., a data store that is highly available for writes)
Since Amazon needs to make sure when user checkout, the writes is always availble (not rejected on customer side). This design consideration makes lots of sense.
Who performs the process of conflict?
This can be done by the data store or the application. On data store side, the choice are limited. On application side, they can decide on the conflict resolution method that is best suited for its client's experience. For instance, the application that maintains customer shopping carts can choose to "merge" the conflicting versions and return a single unified shopping cart.
Compare to P2P storage systems, Dynamo does not focus on the problem of data integrity and security because it is built for a trusted environment. And it is only storing key-value pairs compare to Big-Table where multi-dimensional sorted map is stored.
So this allows Dynamo focus on high availability where updates are not rejected even in the wake of network partitions or server failures.
Dynamo using consistent hashing to allow scale incrementally. In consistent hashing, the output of a hash function is treated as a fixed circular space or "ring" (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this apce which represents its "position" on the ring. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected (follow incremental scalbility design)
But it's not perfect, due to random position, it could cause non-uniform data and load distribution. To address these issues, Dynamo uses a variant of consistent hashing:
instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the ring. To this end, Dynamo uses the concept of “virtual nodes”. A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. Effectively, when a new node is added to the system, it is assigned multiple positions (henceforth, “tokens”) in the ring.
Version control
Dynamo uses vector clocks [12] in order to capture causality between different versions of the same object. A vector clock is effectively a list of (node, counter) pairs.
Vector clock provides eventual consistensy so update rates could very (if it is high Dynamo is still able to write)
Dynamo uses a consistency protocol similar to those used in quorum systems.
This protocol has two key configurable values: R and W. R is the minimum number of nodes that must participate in a successful read operation. W is the minimum number of nodes that must participate in a successful write operation.
Setting R and W such that R + W > N yields a quorum-like system.
N is highest-ranked reachable nodes.
Implementation
Dynamo uses Berkeley Databse (BDB) Transactional Data Store, BDB Java Edition, MySQL. The main reason for designing a pluggable persistence component is to choose the storage engine best suited for an application's access patterns. For exmaple, BDB can handle objects in order of tens of kilobytes where MySQL can handle larger sizes.
The majority of Dynamo's production instances use BDB Transactional Data Store.
沒想到BDB也有開源版本……看來大部分軟件都是有源頭的...
All communications are implemented using Java NIO channels.
Fine tuning
The main advantage of Dynamo is that its client applications can tune the values of N, R and W to achieve their desired levels of performance, availability and durability. For instance, the value of N determines the durability of each object. A typical value of N used by Dynamo’s users is 3.