An Architectural Analysis of Netflix

MarcoMarco
13 min read

Overview

Netflix is a streaming service that offers users on-demand shows/movies on an array of internet-connected devices. High-level functional requirements include the ability to run and support streaming/playback (Hahn 2016). Other functional requirements include the focus on dealing with dynamic user load. This has been achieved through scale elasticity with AWS providing IaaS (Watson 2016). Goals for Netflix include high availability, innovation and velocity (Rangarajan 2018).

System Choice

With Netflix accounting for 37% of all US internet traffic and making 4000 daily production changes- they provide a real world example of scalability and resiliency from the use of a microservices architecture, chaos engineering and devops integration (Hahn 2016).
Using microservices architecture, Netflix can focus on technical innovation through loosely coupled and independent work by focusing more on finding the best solution rather than on the adherence of an internal company language and frameworks (Bengston 2020). The use of microservices that are further running on elastic IaaS allows the company to provide scalable, robust content streaming services with minimal downtime and high velocity upgrades.

Netflix is unique as it incorporates chaos engineering through Chaos Kong/Monkey in order to build a resilient system (Hahn 2016). This entails purposefully breaking system components or taking out AWS regions during production in order to be prepared for an actual outage. They are unique as they outsource infrastructure but provide their own network architecture and CDNs, Open Connect (Netflix 2019).

Architectural Characteristics

Two key architectural characteristics that are important for the system is operational characteristics of scalability and availability (Richards and Ford 2020, 58). Scalability refers to the ability for the system to provide services as users and requests increase/decrease. Scalability provides a key role at Netflix as they often experience dynamic load on the system with evening spikes as users watch Netflix shows as a pastime after work. Availability refers to how often the system can maintain uptime. The core ethos at Netflix is to ensure virtually no downtime, smooth playback and streaming whilst maintaining a high and steady bitrate on all it's videos.

Architectural System Description and Relation to Characteristics

The front-end incorporates supported clients through the use of the Netflix app or a web browser, incorporating the Netflix Device Ready Platform SDK on any device environment that runs as a front-end client. The back-end of Netflix uses scalable storage and computing instances coupled with business logic microservices such as content discovery and playback security (Nguyen 2020) with scalable distributed databases like Cassandra (Evans 2017). Additional architectural parts of the system include big data processing with Kafka and Hadoop as well as use of in-house tools for video processing and encoding. The third architectural part includes Netflix's content delivery network. The use of microservices entails decomposing business functionalities into independently deployable services with isolated persistence, each service has a connection to their own database, allowing for incremental upgrades of the system by incorporating red-black deployment patterns (Evans 2017) to each service with close to no downtime thus achieving availability.

An event-driven style is further incorporated using the Keystone platform (Netflix 2018) for handling microservices events in real-time, allowing for an increase in availability and scalability.
Netflix runs their services on the cloud through AWS, and leverages AWS specific components such as AWS elastic load-balancer and S3 storage buckets.

The core scope of this report will be upon the back-end microservices style.

Architectural Choices of Styles and Patterns in Relation to Characteristics

Figure 1: Netflix architecture depiction (Nguyen 2020)

Figure 1 above depicts a simplified overview of the Netflix microservices architecture style. In order to maintain scalability, multiple instances of specific microservices are deployed dynamically and scaled horizontally in order to meet user load. With regards to figure 1, an elastic load balancer is used in order to route requests to various service instances of the Zuul API Gateway of which is a valid pattern in having a gateway before every service (Alagarasan 2015). This is known as an external API gateway pattern(Richardson 2019, 259). The gateway allows for user requests to be dynamically routed to various required services and provides edge failure resiliency (Nguyen 2020). Zuul also helps in improving availability and scalability via the load testing feature- allowing for the Netflix team to incorporate chaos engineering and thus be resilient to failures.

In order to maintain scalability, the edge service loadbalancer is not enough. Netflix also introduced load balancing on all services in order to distribute traffic across middle-tier services, this is done through Netflix’s in house client-side load balancer, Ribbon (Wang and Tonse 2013). Furthermore, interservice communication between middle-tier services allows for “lightweight REST” (Wang and Tonse 2013) based API calls and thus Netflix is able to maintain high availability through increased performance of inter-service communication via lightweight protocols. Furthermore Netflix employs a circuit breaker pattern using their Hystrix library whereby if a service fails, a static response is given back in order to improve availability of the system opposed to having the whole system halt. Furthermore, decoupling of business logic and datastore as found in traditional microservice architecture allows for nearly all Netflix services to be considered stateless and thus achieve an extremely high level of scalability (Nguyen 2020).

Figure 2: Eureka service discovery with Ribbon Client Side Load-Balancing using the Round Robin algorithm (Wang and Tonse 2013)

As this distributed architecture instantiates the notion of dynamically scaled services that can be teared up/down, Netflix has a problem of dynamically allocated IP addresses as well as the discovery of available services. The Eureka service registry follows a service discovery pattern known as self-registration (Richardson 2019, 82). The self-registration pattern entails registered services periodically invoking a heartbeat API in order for the service registry to know of its existence. This helps improve scalability by “building elasticity into the microservices” (Richards and Ford 2020, 252) as it’s the client's job to let Eureka know of its existence. Multiple clients can be deployed according to scale with Eureka service passively monitoring current services. This prevents strain on the service registry and thus reduces the probability of failing. One situation that Eureka service discovery does prove to be a negative catalyst towards availability is that it can act as a single point of failure if no redundant service registries are available (Wang and Tonse 2013).

Figure 3: Hybrid Microservice at Netflix (Evans 2017)

Figure 3 further illustrates a simplified version of the internal architecture of a microservice in order to achieve scalability per microservice. It is depicted that Netflix horizontally scales each service as user load increases. Netflix incorporates a microservices anti-pattern, coupling, with shared service client libraries for all services that deals with reading and writing data to the persistence layer (Evans 2017). Although figure 3 does depict the data isolation microservice practice of having a dedicated database per service, use of EVCache turns the traditional microservice into a hybrid microservice. EVCache, which is sharded, provides redundancy for stateful applications by having multiple copies written out to multiple nodes. The extensive use of EVCache helps improve performance and thus achieves both high availability and scalability. The use of cache fronting helps mitigate performance issues that incur when scaling persistence level calls by first orchestrating that the client gets the cache first, if the data is not available then it is queried from the database. Not only does this improve scalability by limiting DB calls but EVCache’s linear scalability performance also improves availability within the service in case either the DB or the cache client fails.

Availability is further instantiated with each service running its own process in a container, this does mean better fault isolation as a service failure will be isolated from others. This does add the operational overhead of a container per-service pattern but aids well in deployment over IaaS as the growth of a service allows one to scale in a resource-dependent way as opposed scaling per physical machine where a resource such as CPU processing may be constrained (Richardson 2019, 9). Additional complexity includes the need for tooling such as container orchestration of which Netflix has built their own, Titus (Netflix 2018). This illustrates lifecycle issues of microservices being that additional tooling needs to be developed in order to deploy a microservices architecture. Thus pivoting away from incorporating the rapid availability of production ready business logic and instead having microservice management as a distractional overhead.

Additional patterns: Play API Choices

Figure 4.1. Netflix event handler and I/O (Rangarajan 2018)

With the Play API acting as an orchestration layer to the network's microservices in figure 1, the core business logic has to be highly scalable and available under high requests. Thus, the need for parallel processing was a priority and Netflix combined both synchronous execution and asynchronous I/O. This is instantiated by having each request blocked from the gateway by a dedicated thread handler on the network event loop. The outgoing thread-loop runs non-blocking I/O that is set up per-client. Once the execution for the microservice is executed or times out, the dedicated thread will construct a corresponding response. This allows for a high rate of availability by ensuring simplicity of synchronous tasks whilst ensuring the often slowest parts of the system, being I/O, does not prove to be a bottleneck (Nguyen 2020).

Figure 4.2: API Service per function (Rangarajan 2018)

In figure 4.2, the use of the single responsibility principle and bounded contexts allows for decoupling and granularity preventing inter-code dependencies.

Additional patterns: Stream-Processing Pipeline

Figure 5: Stream-Processing Pipeline (Nguyen 2020)

The stream-processing platform is used internally for Netflix business analytics (Nguyen 2020).​ As this requires real-time collection and aggregation of microservice events in production to other data processors, a producer-consumer pattern has been used alongside a middle rout​er. This creates flexibility and scalability in processing as the router can either send the produced events to a Hadoop batch processing system/elastic search or events can be directly consumed by custom steam

processing applications via a Kafka consumer such as real-time recommendation systems (Nguyen 2020).

Critical Analysis of Architectural Limitations Identified

As microservices fall under the distributed architecture style, there are a number of limitations that come along with distributed computing. Such limitations have been described as fallacies of distributed computing by L.Deutsch (Richards and Ford 2020, 124). One fallacy is that the network is reliable, with regards to the Netflix microservice architecture becoming extremely reliant on the network, the architecture itself becomes less reliable which is contradictory as microservices sees itself to prove to be a more resilient solution opposed to monoliths. Although network reliability issues can be mitigated through circuit breaker patterns such as Hystrix, the notion that Netflix runs their infrastructure on AWS means they could be potentially affected by misbehaving tennants. Another fallacy is that bandwidth is infinite, microservices architecture that is deployed at scales seen by Netflix have this issue arising quickly. Such an example of a fix can be seen through Netflix’s Play API. One way that Netflix was able to minimise this overhead was to use gRPC instead of REST with JSON. This allowed for RPC methods to be defined by a protocol buffer which is far more lightweight than REST. A third fallacy incurred is that transport costs are nill, as microservices require more network infrastructure and communication, this leads to increased costs opposed to monoliths. This can be countered as the architectural choice of using IaaS has led to a reduction of this cost overhead. This is due to the elastic computing features of AWS coupled with dynamic scaling of specific fine-grained services providing the ability to move to a completely variable costs approach rather than a fixed costs approach seen when a company uses their own infrastructure that horizontally scales all aspects of the monolith when only a certain set of services are experiencing heavy-load instead. A distributed system, especially one running on IaaS, does incur additional security verification overhead needed at every API end-point. Other security and performance overheads include TLS wrapping over API calls. Performance enhancement for this includes client-side use of NTBA protocol for playback requests over OCA server locations to remove TLS latency (Nguyen 2020).

As Netflix deploys their microservices architecture via AWS, it is necessary to instantiate network partitioning of data across AWS availability zones in order to improve availability by having data duplicated across regions leading to a persistence limitation. As with any microservices architecture that adheres to network partitioning the cost of CAP theorem always applies (Evans 2017).

Netflix is to make a difficult choice between consistency and availability. The difference between consistency and availability when writing to various nodes across a network partition can be described as a choice- whether one waits for a node that is down to come online in order to make sure that all nodes are consistent or does one read/write only to available nodes and maintain availability instead. Netflix chose availability with eventual consistency. Depicted in figure 6, clients will read/write to one or more Cassandra DB nodes. Nodes written to will persist data over time to various other database copies located across other nodes, achieving high availability and eventual consistency (Evans 2017).

Figure 6. Eventual Consistency Illustration. (Evans 2017)

Although more time is needed upfront to decompose business logic into sets of microservices as well as source additional container and orchestration tooling. Overall, the use of decomposition patterns when building microservices coupled with a container-per-service deployment pattern allows not only fault isolation and high availability, but further welcomes use of multiple languages and new frameworks via each service communicating with one another through a defined API contract. This helps increase longevity of the system by allowing each fine-grained service to be updated and rebuilt in their own way whilst preventing the overall system from cascading failures due to a lack of tight coupling between services. Thus, a trade-off for increasing the inherent complexity of a system via microservice incorporation does provide the benefit of increased modularity, framework flexibility and system lifespan.

As shown in figure 3, Netflix has favoured to reduce duplication of persistence level code with the cost of increased coupling between services by using a shared service client library. Yet, this form of coupling can be seen as operational reuse using the sidecar pattern as seen with circuit breakers (Richards and Ford 2020, 250). Although microservices should prefer duplication over coupling, a balance between coupling and duplication often becomes dependent on specific decisions for the company itself and the law of software architecture that “everything is a trade-off” becomes prominent (Richards and Ford 2020, 23).

Use of IaaS does mean dependence on a third party provider, but one does have the advantage to absorb resource spikes from misbehaving code or user-influx if things go wrong due to elastic computing (Evans 2017).
Furthermore, styles do provide constraints on architectures and operational drift may occur during development as humans often do have a hard time changing the status quo (Evans 2017), thus architectural styles such as microservices have rather become embedded in the company’s DNA as seen with the popularity of Netflix’s OSS microservice libraries. This can be seen as both a positive and negative front to the architecture community depending on the company’s success.

Words (including figures and inline references): 2530

Bibliography Alagarasan, Vijay. 2015. “Seven Microservices Anti-patterns.” InfoQ.

https://www.infoq.com/articles/seven-uservices-antipatterns/.
Bengston
, Will. 2020. “Case Study: Netflix | SANS Cloud INsecurity Summit 2018.” Youtube.

https://www.youtube.com/watch?v=li7-ZYXl7w4.
Evans
, Josh. 2017. “Mastering Chaos - A Netflix Guide to Microservices.” Youtube.

https://www.youtube.com/watch?v=CZ3wIuvmHeM.
Hahn
, Dave. 2016. “AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209).”

Youtube. https://www.youtube.com/watch?v=aWgtRKfrtMY.
Netflix
. 2018. “Titus, the Netflix container management platform, is now open source.” Netflix

Technology Blog. https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open- source-f868c9fb5436.

Netflix. 2018. “Keystone Real-time Stream Processing Platform.” Netflix Technology Blog. https://netflixtechblog.com/keystone-real-time-stream-processing-platform-a3ee651812a.

Netflix. 2019. “Open Connect Overview.” Netflix. https://openconnect.netflix.com/Open-Connect-Overview.pdf.

Nguyen, Cao D. 2020. “A Design Analysis of Cloud-based Microservices Architecture at Netflix.” Medium.

https://medium.com/swlh/a-design-analysis-of-cloud-based-microservices-architecture-at-n

etflix-98836b2da45f.
Rangarajan, Suudhan. 2018. “Netflix Play API - An Evolutionary Architecture.”

https://www.infoq.com/presentations/netflix-play-api/.
Richards
, Mark, and Neal Ford. 2020. ​Fundamentals of Software Architecture: An Engineering

Approach.​ Sebastopol: O'Reilly.

Richardson, Chris. 2019. ​Microservices Patterns.​ Shelter Island: Manning.
Wang, Alexander, and Sudhir Tonse. 2013. “Announcing Ribbon: Tying the Netflix Mid-Tier Services

Together.” Netflix Technology Blog. https://netflixtechblog.com/announcing-ribbon-tying-the-netflix-mid-tier-services-together- a89346910a62.

Watson, Coburn. 2016. “AWS re:Invent 2016: From Resilience to Ubiquity.” Youtube. https://www.youtube.com/watch?v=leqUbSY55hY.

0
Subscribe to my newsletter

Read articles from Marco directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Marco
Marco

Senior DevOps Engineer exploring the world of distributed systems