No Single Points of Failure

written by Isaac Wong
16 Jun 2014

The no single point of failure design principle asserts simply that no single part of a system can stop the entire from working. For example, in our Electronic Data Capture product, Rave, the database server is a single point of failure. If it crashes we cannot continue to serve clients in any fashion. But, if we kept all of Rave in cache, we could continue to operate in read-only mode. Has the system stopped working? Some functionality has definitely failed, but some business process could still progress.

It is crucial to understand our patient and sponsor’s critical operations and first secure those from single points of failure. For example, a failure mode for our Single Sign on when the database server crashes could be reading from cache. While this does stop sponsor’s from creating studies, it does allow them to navigate to our Randomization and Trial Supplies Management application, Balance, and unblind. Unblinding is far more critical than study enrollment.

We first secure life threatening issues, and then iteratively work to secure the next most crucial business process, and so on, until no single point of failure remains in the Medidata Clinical Cloud.

The cost of Failure

Before delving into the details I want to speak about why we care about this principle. Most business do not want single points of failure for obvious reasons. Amazon has calculated the revenue loss for each ms of downtime or latency and it is not small. For Medidata the issue is more serious. Functionality like randomization is used in operating rooms, and unblinding can be a matter of life or death for a patient. We are not selling ad’s or providing profundity in 140 characters. Our uptime is directly related to a patient’s health. We take great responsibility in making our applications correct and timely and available at all times. When you are in a hospital about to undergo a procedure, the Medidata logo on the doctor’s computer should be a symbol of calm and safety.

Single points of failure for a system include infrastructure, but I like to enlarge the definition and include people and process that when they fail cause the entire system to fail. Our definition will include everything needed to deliver and maintain business functionality. For Medidata, this includes not only the physical machines and infrastructure, but validation, testing, deployment, and the people and the processes that make all that happen. If we can’t validate, we can’t release and correct a bug. If we can’t get a code review approved, we can’t release and fix a critical issue. If no-one one knows how something works then we can’t fix it. So let’s look at some single points of failure.

Types of Failure

Machine failure, physical or virtual, are the most obvious examples of single points of failure. Here is a quick list of potential issues:

Physical or virtual database servers.
Physical or virtual web or application nodes.
Physical or virtual web or application nodes that store data on the node’s file system or memory.
The physical machines that host the virtualized environments.

Other physical or virtual infrastructure that can be a single point of failure:

Routers.
Data centers.
Proxies.
Power.
Credit card used to sign up for a SAAS solution used in the platform.

Turning our attention to people and processes we have the following examples of single points of failure:

Only one person has the knowledge or credentials to deploy to production.
Only one person knows how something in a product works.
Only one person has credentials to any supporting system for a product, for example the aggregated logs of an application.
Tests or validation can only be run by one person on one single physical machine.

What do you guarantee?

Now that we have looked at some examples of single points of failure let’s look at how we can prevent them. The first step in designing for fault tolerance starts at the very beginning. When designing your products ask yourself:

What happens when a network partition fails?
What happens if this machine fails?
What happens if any service dependency I have fails?
What are my stress levels that I can tolerate in code and in people?
What happens if this line of code is executing and the machine fails? This module? Can I recover? What do I lose?
What happens if I fail in the middle of an upload?
Can I lose data if “this” fails?

From the beginning identify what your system “guarantees”. For a file upload your guarantee could be that once the file is uploaded to S3 we can survive failure in the subsequent processing of the file. So the UX would return once the file is in S3 and the processing continues asynchronously. What do you optimize for? Amazon optimizes for taking your order at all times and in all chaos. Medidata optimizes for taking your Audit at all times and in all chaos since in our regulated environment tracking the what, who, when and why of data change is an absolute requirement.

Improve your odds

Now that you have identified some single points of failure here are some basic strategies for remediating them:

Use stateless web and application servers
Load balancers are your friend.
If you are in AWS, use AWS tools like RDS multizone, multizone deployment, point in time restores, readonly slaves.
For super sensitive process make sure your data is multiregion.
Cache results of your dependencies so you can read from cache if the dependency goes down.
Use technologies that don’t have leaders and can dynamically add nodes (e.g. Cassandra).
Use technologies that have leaders and reliable election/consensus algorithms like Paxos.
Automate everything.
TEST! Have your own Chaos Monkey.
Train people.
Document everything.
Use 5-whys to find out the process/base reasons something failed.
Document your DR plans. Test them. Run fire drills to make sure the human process works.

Always think about what happens in the extremes and confirm what you think is an extreme really is extreme. What happens when you have more data than you thought you would? How will this work when I have twice as many log entries than now? Can I handle it? Be aware of the stress points of your system and how your system deforms when stress increases.

Another key piece in the no single point of failure quest is monitoring and instrumentation. We need to measure and instrument so we know when something fails, that the DR plan went into effect, how long it took for the DR plan to execute, and that the DR plan worked. We then need to know why something failed so we can fix it.

Failure is the Norm

In summary, failure is the normal mode of operation in the cloud. We need to plan for it. But having no single points of failure is also a requirement for elasticity. Single things don’t expand or shrink. We want lots of ephemeral resources that we can toss at problems such as scaling, reliability, and drain them away when not needed. Single points of failure don’t meet that definition. This thinking also bleeds into our messaging plans. We are implementing the “smart edges, dumb routers” paradigm. We want our services to manage workflows that often are delegated to brittle non-agile heavy chunks of middleware that our competitors like to sell. Smart edges and dumb routers leads to more loosely coupled, available, and evolvable systems since we don’t care if we lose routers. We care about our edges (our services), and we go to great lengths to protect them. The internet was build on the idea of fate-sharing. So I close with this quote from one of our grey-beard fathers:

“The fate-sharing model suggests that it is acceptable to lose the state information associated with an entity if, at the same time, the entity itself is lost. Specifically, information about transport level synchronization is stored in the host which is attached to the net and using its communication service.” – David D. Clark

or as Mark Twain said:

“Put all your eggs in one basket and then watch that basket”

Note: That one basket is a logical basket. Your basket should be horizontally scalable for high availability with no single point of failure.