Chapter 1


In this chapter we looked at what it means to have a well designed data system.
First off a data system is a software or process that deals with data; including the storage, processing, and retrieval of data.
What is a well designed data system?
Generally, its a system that is:
Reliable
Scalable
Maintainable
Reliability
A reliable system is one that performs correctly, and is tolerant of certain faults. These faults are when certain components of the system deviate from the spec, and are known faults. We can design systems to be fault-tolerant so that they can deal with these faults.
Faults can be due to hardware, software or human error. A system failure is when the system is down as a whole and cannot serve its purpose.
Scalability
Scalability is a systems ability to cope with increased load. How we measure load is through a set of numbers known as load parameters.
An example of load parameters is if we look at twitter, a user can post a tweet. That tweet is then seen by every follower in their home timeline.
So if 1 request is sent to Twitters API’s POST Tweet endpoint every follower will then, upon checking their timeline, see it.
The way they see it can either be:
The follower, when opening the timeline will query every single tweet from everyone they follow. More work is done during read time.
Or when posting, the home timeline caches of every follower can be updated with the new tweet. More work is done during write time
We can see that if more work is done during write time, the issue of fan-out occurs. Where from one simple tweet, potentially millions of additional requests must be sent out to possibly update the home timeline caches of their followers. Now imagine someone having millions of followers, that can slow the tweet process down.
So in the case of Twitter, their load parameter is the number of followers you have.
When we find these parameters, the following should be considered:
What happens when we increase the load but the systems resources remain unchanged?
How much do we need to increase the system’s resources by to improve performance?
To answer these we need to measure performance. For a web application its best to use some sort of distributed measurement like percentiles. To calculate a percentile we order all the response times from smallest to largest. The mid point is the 50th percentile or p50 and tells us that half of the requests were less than or equal to that value in their response time.
Percentiles tell us more about our data, otherwise if we just use the mean then one larger value will skew our data.
Higher percentiles like 99th, 99.95 etc are known as tail-latencies*. When backend services call multiple APIs/other services, the probability of a slower requests increases known as **tail-latency amplification. Sometimes we get slower requests that hold up other requests from being serviced (they are ‘latent’, awaiting service*). This is called head of line blocking**.
Maintainability
Maintainable systems are straight forward and easy to maintain. In the case of data systems, the system should be easy to refactor and monitor.
Subscribe to my newsletter
Read articles from syed junaid directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
