Azure Storage is a highly consistent cloud storage used by Microsoft in developing many of its internal and external workflows. WAS provides cloud storage in the form of Blobs (user files), Tables (structured storage), and Queues (message delivery). The paper I read detailed its underlying working, three layers namely Stream Layer, Partition Layer, Front End layer, make up the design of this Storage and in this Blog I will mainly be talking about the Stream Layer which acts as a distributed file system used to store data in each storage stamp (it’s a cluster of n racks of storage nodes ) . Let’s get to it, then.

The Stream Layer

This layer is responsible for storing the bits on the disk. It does that by using files called “streams”(which are ordered lists of large storage chunks called “extents”). Data stored on this layer is accessed by the partition layer. Moreover, it is also responsible for what is called intra stamp replication.

Intra Stamp Replication

This is a synchronous process, which means it is carried out sequential to a client’s request. It ensures that enough replicas of the data is kept on the same storage stamp for durability. Its handled completely by the stream layer which will get into later in this blog.

Now, before we start, let’s just know about two important terminologies that we will keep using throughout this blog. The first one is:

Blocks

This is the minimum unit of data that is stored of reading and writing. Data is appended as one or more concatenated blocks to an extent. For a read, the client gives an offset to a stream and the required number of blocks are read to fulfill the length of the read. Furthermore, for block reads a checksum validation is used per block to check data integrity.

Extent

These are the units of replication here and for each extent three replicas are stored within a storage stamp. An extent consists of a sequence of block. Small objects (table rows, messages or blobs) are stored within the same extent and larger objects are spread across multiple extents whose offset are stored by the partition layer for future reads.

Stream

A stream is an ordered list of pointers to extents, it looks like a big file to the partition layer. Only the last extent in the stream is appended to, the rest of the extents are immutable.

Components of the Stream layer

There are two main components of the stream layer, the Stream manager and Extent nodes

Stream Manager

The SM keeps track of stream namespace, extents in a stream and extent allocation across Extent Nodes(EN). The SM is responsible for a lot of the tasks like maintaining stream namespaces, creating and assigning extents to EN’s, garbage collection extents that are no longer pointed by any streams, performing replication of extents. SM periodically polls the state of EN and what extents they store,if it is found that an extent has less number of replicas than desired it performs lazy-replication to achieve the desired number. For replication, it chooses EN across different fault domain (hardware with single point of failure) to ensure that if one replica is lost due to failure of any sort, the rest are unaffected and still serve reads from the client. The SM does not know anything about blocks, just extents and streams so that the amount of state can be kept small to fit into SM’s memory. SM and the partition layer are co designed in such a way that they will not use more than 50 million extents and more than 100000 streams for a given storage stamp. This parameterization can comfortably fit into 32GB of memory of SM.

Extent Node

Extent node maintains the storage for a set of extent replicas assigned to it by the SM. An EN has N disks attached, which it completely controls for storing extent replicas and their blocks. An EN knows nothing about streams, and only deals with extents and blocks. Internally on an EN server, every extent on disk is a file, which holds data blocks and their checksums, and an index which maps extent offsets to blocks and their file location. Each extent node contains a view about the extents it owns and where the peer replicas are for a given extent. ENs only talk to other ENs to replicate block writes sent by a client, or to create additional copies of an existing replica when told to by the SM. When an extent is no longer referenced by any stream, the SM garbage collects the extent and notifies the ENs to reclaim the space.

Working of the Stream Layer

Append Operation and Sealing

Streams can only be appended to; existing data is immutable. The append operations are atomic, i.e either the entire data block is appended, or nothing is. There are also atomic multi-block append which helps us to append a large amount of sequential data as a single atomic operation which can be used later for small reads. The stream layer guarantees that multi block append will happen atomically and if the client never hears back it should retry the request, implication for this is that the client should expect the same data to be appended multiple times in the face of timeouts and retrying and should be able to deal with duplicates. An extent has a target size, specified by the client (partition layer), and when it fills up to that size the extent is sealed at a block boundary, and then a new extent is added to the stream and appends continue into that new extent. Once an extent is sealed it can no longer be appended to.

Intra-Stamp Replication

When a stream is first created , the SM assigns three replicas for the first extent (one primary and two secondary) to three extent nodes (step B), which are chosen by the SM to randomly spread the replicas across different fault domains while considering extent node usage (for load balancing). In addition, the SM decides which replica will be the primary for the extent. Writes to an extent are always performed from the client to the primary EN, and the primary EN is in charge of coordinating the write to two secondary ENs. The primary EN and the location of the three replicas never change for an extent while it is being appended to (while the extent is unsealed). Therefore, no leases are used to represent the primary EN for an extent, since the primary is always fixed while an extent is unsealed.

For an extent, every append is replicated three times across the extent’s replicas. A client sends all write requests to the primary EN, but it can read from any replica. Append is sent to the primary EN for the extent by the client, and the primary is then in charge of (a) determining the offset of append in the extent, (b) ordering (choosing the offset of) all the appends if there are concurrent append requests to the same extent , (c) sending append with its chosen offset to the two secondary extent nodes, and (d) only returning success for append to the client after a successful append has occurred to disk for all three extent nodes. Only when write have succeeded for all three replicas will the primary EN then respond to the client that append was a success. If there are multiple outstanding appends to the same extent, the primary EN will respond “success” in the order of their offset to the clients. As appends commit in order for a replica, the last append position is considered to be the current commit length of the replica.

When a stream is opened, the metadata for its extents is cached at the client, so the client can go directly to the ENs for reading and writing without talking to the SM until the next extent needs to be allocated for the stream. If during writing, one of the replica’s ENs is not reachable, a write failure is returned to the client. The client then contacts the SM, and the extent that was being appended to is sealed by the SM at its current commit length. At this point, the sealed extent can no longer be appended to. The SM will then allocate a new extent with replicas on different (available) ENs, which makes it now the last extent of the stream. The information for this new extent is returned to the client. The client then continues appending to the stream with its new extent.

Sealing

From a high level, the SM coordinates the sealing operation among the ENs; it determines the commit length based on the commit length of the extent replicas. Once the sealing is done, the commit length will never change again. To seal an extent, the SM asks all three ENs their current length. During sealing, either all replicas have the same length, which is the simple case, or a given replica is longer or shorter than another replica for the extent. This latter case can only occur during an append failure where some but not all the ENs for the replica are available, i.e., some of the replicas get the append block, but not all of them. the SM will choose the smallest commit length based on the available ENs it can talk to. This will not cause data loss since the primary EN only return success unless all replicas have been written to disk for all three ENs which means that the smallest commit length is sure to contain all the writes that have been acknowledged to the client. During the sealing, all the extent replicas that were reachable by the SM are sealed to the commit length chosen by the SM. Moreover, If an EN was not reachable by the SM during the sealing process but later becomes reachable, the SM will force the EN to synchronize the given extent to the chosen commit length and after sealing the commit length of an extent is never changed.

Read Load-Balancing

When reads are issued for an extent that has three replicas, they are submitted with a “deadline” value which specifies that the read should not be attempted if it cannot be fulfilled within the deadline. If the EN determines the read cannot be fulfilled within the time constraint, it will immediately reply to the client that the deadline cannot be met. This mechanism allows the client to select a different EN to read that data from, likely allowing the read to complete faster.

Anti-Starvation

Many hard disk drives are optimized to achieve the highest possible throughput, and sacrifice fairness to achieve that goal. They tend to prefer reads or writes that are sequential. Since streams can be very large, it was obserevrd that some disks would lock into servicing large pipelined reads or writes while starving other operations. On some disks this could lock out non- sequential IO for as long as 2300 milliseconds. To avoid this problem scheduling new IO is avoided when there is over 100ms of expected pending IO already scheduled or when there is any pending IO request that has been scheduled but not serviced for over 200ms.

Journaling

As we have seen, durability in stream layer is achived by having three replicas for an extent. One optimisation that reduces latecy while still mantaining durability is that on each extent node a whole disk drive or SSD as a journal drive is mantained for all writes into the extent node. When the partition layer does a stream append, the data is written by the primary EN while in parallel sent to the two secondaries to be written. While each EN performs its append, it (a) writes all of the data for the append to the journal drive and (b) queues up the append to go to the data disk where the extent file lives on that EN. Once either succeeds, success can be returned. If the journal succeeds first, the data is also buffered in memory while it is being written into the disk, and any reads for that data are served from memory until the data is on the data disk. After that, the data is served from the data disk. This also enables the combining of contiguous writes into larger writes to the data disk, and better scheduling of concurrent writes and reads to get the best throughput. Also appends do not have to contend with reads going to the data disk and journaling allows the append times from the partition layer to have more consistent and lower latencies.

Conclusion

So, there we have it the stream layer of WAS. Hope you found it interesting, as for the remaining partiton layer I will try to cover it in some future post. I know its incomplete without it so for those who are curios, I have linked the paper for you to explore. Here is the paper https://www.cs.purdue.edu/homes/csjgwang/CloudNativeDB/AzureStorageSOSP11.pdf. Bye!

Unveiling Azure Storage: A Deep Dive into the Stream Layer

Table of contents