Time is something that most of us gloss over without much thought. However, when it comes to modern distributed systems like Cassandra and Zookeeper, time is incredibly important. We need all our nodes to have their clocks synced within miliseconds of each other
Hell could freeze over! When time is out of sync between your nodes you WILL end up with havoc at some point. Our biggest fear was writes to Cassandra being overwritten because nodes were out of sync, but there are hundreds of other edge cases that can screw you over.
Well you might just say NTP, and that is on the right track, but it is more complicated than just installing the NTP package.
Syncing with external NTP pools is unreliable, too much jitter. If you get a bad node then your time is messed up for that 1 node and then you have lost your consistency. The goal of this entire project was to have a single true time for our entire network and all our nodes would be in sync with that time.
What we did…
We built our own private NTP server cluster on 3 existing nodes. 1 node is a master that is synced with the amazon NTP pool and the other 2 providing HA and synced directly from the master.
Even better these nodes can be swapped out with raspberry pi’s with GPS modules for Stratum 0 time accuracy… great if you want to completely cut out public NTP pools
Every other node in the network (app, db, etc.) gets it’s time directly from the single master or the standby slaves in the event of a failed master. This means that we have a single true time from the master that is in sync with wall clock time and consistent within a few microseconds across the network.
The ubuntu NTP pool sucks. Change your ntp.conf to use a more stable pool like 0.amazon.pool.ntp.org
So we built this into a Chef Cookbook which manages all the master/slave election, client and server configuration, and NTP configuration automatically. It’s open source on Github:https://github.com/evertrue/ntp_cluster
If you are interested in using it, please let me know, I will be happy to help get you going