Clock-Certain Wait



    Think about a key-value retailer the place values are saved with a timestamp
    to designate every model. Any cluster node that handles the consumer request
    will be capable to learn the most recent model utilizing the present timestamp
    on the request processing node.

    Within the following instance, the worth ‘Earlier than Daybreak’ is up to date
    to worth “After Daybreak” at time 2, as per Inexperienced’s clock.
    Each Alice and Bob try to learn the most recent worth for ‘title’.
    Whereas Alice’s request is processed by cluster node Amber, Bob’s request is
    processed by cluster node Blue.
    Amber has its clock lagging at 1; which implies that
    when Alice reads the most recent worth, it delivers the worth ‘Earlier than Daybreak’.
    Blue has its clock at 2; when Bob reads the most recent worth,
    it returns the worth as “After Daybreak”

    This violates a consistency often called exterior consistency.
    If Alice and Bob now make a telephone name, Alice will likely be confused; Bob will
    inform that the most recent worth is “After Daybreak”, whereas her cluster node is
    exhibiting “Earlier than Daybreak”.

    The identical is true if Inexperienced’s clock is quick and the writes occur in ‘future’
    in comparison with Amber’s clock.

    It is a drawback if system’s timestamp is used as a model for storing values,
    as a result of wall clocks are usually not monotonic.
    Clock values from two completely different servers can not and shouldn’t be in contrast.
    When Hybrid Clock is used as a model in
    Versioned Worth, it permits values to be ordered
    on a single server in addition to on completely different servers which
    are causally associated.
    Nevertheless, Hybrid Clocks (or any Lamport Clock primarily based clocks)
    can solely give partial order.
    Which means any values which aren’t causally associated and saved by
    two completely different purchasers throughout completely different nodes can’t be ordered.
    This creates an issue when utilizing a timestamp to learn the
    values throughout cluster nodes.
    If the learn request originates on cluster nodes with lagging clocks,
    it in all probability will not be capable to learn the hottest variations of
    given values.


    Cluster nodes wait till the clock values
    on each node within the cluster are assured to be above the timestamp
    assigned to the worth whereas studying or writting.

    If the distinction betweeen clocks may be very small,
    write requests can wait with out including a substantial amount of overhead.
    For instance, assume the utmost clock offset throughout cluster nodes is 10ms.
    (Which means, at any given time limit,
    the slowest clock within the cluster is lagging behind t – 10ms.)
    To ensure that each different cluster node has its clock set previous t,
    the cluster node that deal with any write operation
    should await t + 10ms earlier than storing the worth.

    Think about a key worth retailer with Versioned Worth the place
    every replace is added as a brand new worth, with a timestamp used as a model.
    Within the Alice and Bob instance talked about above the write operation
    storing the title@2, will wait till all of the clocks within the cluster are at 2.
    This makes positive that Alice will all the time see the most recent worth of the title
    even when the clock on the cluster node of Alice is lagging behind.

    Think about a barely completely different situation.
    Philip is updating the title to ‘After Daybreak’. Inexperienced’s clock has its
    time at 2. However Inexperienced is aware of that there is perhaps a server with a clock
    lagging behind upto 1 unit. It’s going to due to this fact need to
    wait within the write operation for a length of 1 unit.

    Whereas Philip is updating the title, Bob’s learn request is dealt with
    by server Blue. Blue’s clock is at 2, so it tries to learn the title at
    timestamp 2. At this level Inexperienced has not but made the worth out there.
    This implies Bob will get the worth on the highest timestamp decrease than 2,
    which is ‘Earlier than Daybreak’

    Alice’s learn request is dealt with
    by server Amber. Amber’s clock is at 1 so it tries to learn the title at
    timestamp 1. Alice will get the worth ‘Earlier than Daybreak’

    As soon as Philip’s write request completes – after the wait of max_diff is over –
    if Bob now sends a brand new learn request, server Blue will attempt to learn the most recent
    worth in accordance with its clock (which has superior to three); this may return
    the worth “After Daybreak”

    If Alice initializes a brand new learn request, server Blue will attempt to learn the
    newest worth as per its clock – which is now at 2. It’s going to due to this fact,
    additionally return the worth “After Daybreak”

    The principle drawback when making an attempt to implement this resolution is that
    getting the precise time distinction throughout cluster nodes
    is just not attainable with the date/time {hardware} and working methods APIs
    which are presently out there.
    Such is the character of the problem that Google has its personal specialised date time API
    referred to as True Time.
    Equally Amazon has
    AWS Time Sync Service and a library referred to as ClockBound.
    Nevertheless, these APIs are very particular to Google and Amazon,
    so can’t actually be scaled past the confines of these organizations

    Sometimes key worth shops use Hybrid Clock to
    implement Versioned Worth.
    Whereas it’s not attainable to get the precise distinction between clocks,
    a wise default worth may be chosen primarily based
    on historic observations.
    Noticed values for optimum clock drift on servers throughout
    datacenters is mostly 200 to 500ms.

    The important thing-value retailer waits for configured max-offset earlier than storing the worth.

    class KVStore…

      int maxOffset = 200;
      NavigableMap<HybridClockKey, String> kv = new ConcurrentSkipListMap<>();
      public void put(String key, String worth) {
          HybridTimestamp writeTimestamp =;
          kv.put(new HybridClockKey(key, writeTimestamp), worth);
      personal void waitTillSlowestClockCatchesUp(HybridTimestamp writeTimestamp) {
          var waitUntilTimestamp = writeTimestamp.add(maxOffset, 0);
      personal void sleepUntil(HybridTimestamp waitUntil) {
          HybridTimestamp now =;
          whereas ( than(waitUntil)) {
              var waitTime = (waitUntil.getWallClockTime() - now.getWallClockTime()) ;
              Uninterruptibles.sleepUninterruptibly(waitTime, TimeUnit.MILLISECONDS);
              now =;
      public String get(String key, HybridTimestamp readTimestamp) {
          return kv.get(new HybridClockKey(key, readTimestamp));

    Learn Restart

    200ms is just too excessive an interval to attend for each write request.
    Because of this databases like CockroachDB or YugabyteDB
    implement a verify within the learn requests as a substitute.

    Whereas serving a learn request, cluster nodes verify if there’s a model
    out there within the interval of readTimestamp and readTimestamp + most clock drift.
    If the model is accessible – assuming the reader’s clock is perhaps lagging –
    it’s then requested to restart the learn request with that model.

    class KVStore…

      public void put(String key, String worth) {
          HybridTimestamp writeTimestamp =;
          kv.put(new HybridClockKey(key, writeTimestamp), worth);
      public String get(String key, HybridTimestamp readTimestamp) {
          checksIfVersionInUncertaintyInterval(key, readTimestamp);
          return kv.floorEntry(new HybridClockKey(key, readTimestamp)).getValue();
      personal void checksIfVersionInUncertaintyInterval(String key, HybridTimestamp readTimestamp) {
          HybridTimestamp uncertaintyLimit = readTimestamp.add(maxOffset, 0);
          HybridClockKey versionedKey = kv.floorKey(new HybridClockKey(key, uncertaintyLimit));
          if (versionedKey == null) {
          HybridTimestamp maxVersionBelowUncertainty = versionedKey.getVersion();
          if (maxVersionBelowUncertainty.after(readTimestamp)) {
              throw new ReadRestartException(readTimestamp, maxOffset, maxVersionBelowUncertainty);

    class Shopper…

      String learn(String key) {
          int attemptNo = 1;
          int maxAttempts = 5;
          whereas(attemptNo < maxAttempts) {
              strive {
                  HybridTimestamp now =;
                  return kvStore.get(key, now);
              } catch (ReadRestartException e) {
                  logger.information(" Acquired learn restart error " + e + "Try No. " + attemptNo);
                  Uninterruptibles.sleepUninterruptibly(e.getMaxOffset(), TimeUnit.MILLISECONDS);
          throw new ReadTimeoutException("Unable to learn after " + attemptNo + " makes an attempt.");

    Within the Alice and Bob instance above, if there’s a model for “title”
    out there at timestamp 2, and Alice sends a learn request with learn timestamp 1,
    a ReadRestartException will likely be thrown asking Alice to restart the learn request
    at readTimestamp 2.

    Learn restarts solely occur if there’s a model written within the
    uncertainty interval. Write request don’t want to attend.

    It’s necessary to keep in mind that the configured worth for optimum clock drift
    is an assumption, it’s not assured. In some instances,
    a nasty server can have a clock drift greater than the assumed worth. In such instances,
    the issue will persist.

    Utilizing Clock Certain APIs

    Cloud suppliers like Google and Amazon, implement clock equipment with
    atomic clocks and GPS to guarantee that the clock drift throughout cluster nodes
    is saved beneath a number of milliseconds. As we’ve simply mentioned, Google has
    True Time. AWS has
    AWS Time Sync Service and ClockBound.

    There are two key necessities for cluster nodes to verify these waits
    are applied accurately.

    • The clock drift throughout cluster nodes is saved to a minimal.
      Google’s True-Time retains it beneath 1ms generally (7ms within the worst instances)
    • The attainable clock drift is all the time
      out there within the date-time API, this ensures programmers do not want
      to guess the worth.

    The clock equipment on cluster nodes computes error bounds for
    date-time values. Contemplating there’s a attainable error in timestamps
    returned by the native system clock, the API makes the error specific.
    It’s going to give the decrease in addition to the higher certain on clock values.
    The actual time worth is assured to be inside this interval.

    public class ClockBound {
        public ultimate lengthy earliest;
        public ultimate lengthy newest;
        public ClockBound(lengthy earliest, lengthy newest) {
            this.earliest = earliest;
            this.newest = newest;
        public boolean earlier than(lengthy timestamp) {
            return timestamp < earliest;
        public boolean after(lengthy timestamp)   {
            return timestamp > newest;

    As defined on this AWS weblog the error is
    calculated at every cluster node as ClockErrorBound.
    The actual time values will all the time be someplace between
    native clock time and +- ClockErrorBound.

    The error bounds are returned each time date-time
    values are requested for.

    public ClockBound now() {
        return now;

    There are two properties assured by the clock-bound API

    • Clock bounds ought to overlap throughout cluster nodes
    • For 2 time values t1 and t2, if t1 is lower than t2,
      then clock_bound(t1).earliest is lower than clock_bound(t2).newest
      throughout all cluster nodes

    Think about we have now three cluster nodes: Inexperienced, Blue and Orange.
    Every node might need a distinct error certain.
    As an example the error on Inexperienced is 1, Blue is 2 and Orange is 3. At time=4,
    the clock certain throughout cluster nodes will appear like this:

    On this situation, two guidelines have to be adopted to implement the commit-wait.

    • For any write operation, the clock certain’s newest worth
      must be picked because the timestamp.
      This can be certain that it’s all the time greater than any timestamp assigned
      to earlier write operations (contemplating the second rule beneath).
    • The system should wait till the write timestamp is lower than
      the clock certain’s earliest worth, earlier than storing the worth.

      That is As a result of the earliest worth is assured to be decrease than
      clock certain’s newest values throughout all cluster nodes.
      This write operation will likely be accessible
      to anybody studying with the clock-bound’s newest worth in future. Additionally,
      this worth is assured to be ordered earlier than every other write operation
      occur in future.

    class KVStore…

      public void put(String key, String worth) {
          ClockBound now =;
          lengthy writeTimestamp = now.newest;
          kv.put(new VersionedKey(key, writeTimestamp), worth);
      personal void waitUntilTimeInPast(lengthy writeTimestamp) {
          ClockBound now =;
          whereas(now.earliest < writeTimestamp) {
              Uninterruptibles.sleepUninterruptibly(now.earliest - writeTimestamp, TimeUnit.MILLISECONDS);
              now =;
      personal void removePending(lengthy writeTimestamp) {
          pendingWriteTimestamps.take away(writeTimestamp);
          strive {
          } lastly {
      personal void addPending(lengthy writeTimestamp) {

    If we return to the Alice and Bob instance above, when the worth for
    “title”- “After Daybreak” – is written by Philip on server Inexperienced,
    the put operation on Inexperienced waits till the chosen write timestamp is
    beneath the earliest worth of the clock certain.
    This ensures that each different cluster node
    is assured to have a better timestamp for the most recent worth of the
    clock certain.
    For example, contemplating this situation. Inexperienced has error certain of
    +-1. So, with a put operation which begins at time 4,
    when it shops the worth, Inexperienced will choose up the most recent worth of clock
    certain which is 5. It then waits till the earliest worth of the clock
    certain is greater than 5. Basically, Inexperienced waits for the uncertainty
    interval earlier than really storing the worth within the key-value retailer.

    When the worth is made out there in the important thing worth retailer,
    that the clock certain’s newest worth is assured to be greater than 5
    on each cluster node.
    Which means Bob’s request dealt with by Blue in addition to Alice’s request
    dealt with by Amber, are assured to get the most recent worth of the title.

    We’ll get the identical outcome if Inexperienced has ‘wider’ time bounds.
    The better the error certain, the longer the wait. If Inexperienced’s error certain
    is most, it would proceed to attend earlier than making the values out there in
    the key-value retailer. Neither Amber nor Blue will be capable to get
    the worth till their newest time worth is previous 7. When Alice will get the
    latest worth of title at newest time 7,
    each different cluster node will likely be assured to get it at it is newest time worth.


    When studying the worth, the consumer will all the time choose the utmost worth
    from the clock certain from its cluster node.

    The cluster node that’s receiving the request must guarantee that as soon as
    a response is returned on the particular request timestamp, there are
    no values written at that timestamp or the decrease timestamp.

    If the timestamp within the request is greater than the
    timestamp on the server, the cluster node will wait till
    the clock catches up,
    earlier than returning the response.

    It’s going to then verify if there are any pending write requests on the decrease timestamp,
    which aren’t but saved. If there are, then the
    learn requests will pause till the requests are full.

    The server will then learn the values on the request timestamp and return the worth.
    This ensures that after a response is returned at a specific timestamp,
    no values will ever be written on the decrease timestamp.
    This assure is known as Snapshot Isolation

    class KVStore…

      ultimate Lock lock = new ReentrantLock();
      Queue<Lengthy> pendingWriteTimestamps = new ArrayDeque<>();
      ultimate Situation cond  = lock.newCondition();
      public Elective<String> learn(lengthy readTimestamp) {
          Elective<VersionedKey> max = kv.keySet().stream().max(Comparator.naturalOrder());
          if(max.isPresent()) {
              return Elective.of(kv.get(max.get()));
          return Elective.empty();
      personal void waitForPendingWrites(lengthy readTimestamp) {
          strive {
              whereas ( -> ts <= readTimestamp)) {
          } lastly {

    Think about this ultimate situation: Alice’s learn request is dealt with by
    server Amber with error certain of three. It picks up the most recent time as 7 to
    learn the title. In the meantime, Philip’s write request is dealt with by Inexperienced
    (with an error certain of +-1), it picks up 5 to retailer the worth.
    Alice’s learn request waits till the earliest time at Inexperienced is previous 7
    and the pending write request. It then returns the most recent worth with
    a timestamp beneath 7.


    Please enter your comment!
    Please enter your name here