Amazon S3 and Glacier Storage

Introduction

In this chapter, we’ll be discussing two important object storage services offered by AWS: Amazon Simple Storage Service (Amazon s3) and Amazon Glacier. Amazon s3 is a highly secure, durable, and easily-scalable cloud storage solution that enables developers and IT teams to store and retrieve data from anywhere on the web. It offers a simple web service interface and provides the option to pay only for the storage you actually use, making it an ideal solution for eliminating the need for capacity planning and constraints associated with traditional storage.

Amazon s3 is a foundational web service that is used by nearly every application running in AWS, either directly or indirectly. It can be used alone or in combination with other AWS services, and provides high integration with many other AWS cloud services. Amazon s3 is used as a target storage for services such as Amazon Kinesis, Amazon EMR, and Amazon EBS, and is used for data staging or loading in services such as Amazon Redshift and Amazon DynamoDB.

Amazon s3 offers various storage classes designed for different use cases such as general purpose, infrequent access, and archive. It also provides configurable lifecycle policies that enable automatic migration of data to the most appropriate storage class. Amazon s3 provides a rich set of permissions, access controls, and encryption options to ensure data security and control who has access to the data.

Amazon s3 is commonly used for backup and archive, media storage and distribution, big data analytics, static website hosting, cloud-native mobile and internet application hosting, and disaster recovery. Understanding Amazon s3 in detail is crucial as it is a flexible, highly integrated, and widely used storage service.

Amazon Glacier is a cloud storage service that is closely related to Amazon S3. It is optimized for long-term backup and archiving of “cold data,” which is data that is rarely accessed and has a retrieval time of three to five hours. Amazon Glacier can be used as a storage class of Amazon s3, or as an independent archival storage service.

Object Storage in Amazon s3

Each Amazon s3 object contains both data and metadata. Objects/file/data reside in containers called buckets, and each object is identified by a unique user-specified key (filename). Buckets are a simple flat folder with no file system hierarchy. That is, you can have multiple buckets, but you can’t have a sub-bucket within a bucket. Each bucket can hold an unlimited number of objects.

Amazon s3 is not a traditional file system and operates differently. Instead of incrementally updating portions of a file, you GET or PUT an object as a whole. Amazon s3 is highly durable and scalable object storage that is optimized for reads and has a minimalist feature set. It provides a simple and robust abstraction for file storage that frees users from many underlying details that are typically dealt with in traditional storage.

For example, with Amazon s3 you don’t have to worry about device or file system storage limits and capacity planning-a single bucket can store an unlimited number of files. You also don’t need to worry about data durability or replication across availability zones-Amazon S3 objects are automatically replicated on multiple devices in multiple facilities within a region. The same with scalability — if your request rate grows steadily, Amazon S3 automatically partitions buckets to support very high request rates and simultaneous access by many clients.

Amazon Simple Storage Service (Amazon S3) Basics

Buckets :

Amazon Simple Storage Service (Amazon s3) is a cloud-based storage solution that uses containers called buckets to store files or objects. Each object is contained within a bucket, and bucket names are global, meaning they must be unique across all AWS accounts. Bucket names can contain up to 63 lowercase letters, numbers, hyphens, and periods. Best practices recommend using bucket names that contain your domain name and conform to DNS name rules.
AWS Regions :

Amazon s3 offers users the ability to choose a region where their data will be stored, providing control over data storage. Users can create and use buckets located close to a specific set of end-users or customers to minimize latency, located in a specific region to satisfy data locality and sovereignty concerns, or located far away from primary facilities to satisfy disaster recovery and compliance needs. Data is stored in the chosen region unless explicitly copied to another bucket located in a different region.
Objects :

Amazon s3 buckets store entities called objects, which can hold virtually any kind of data in any format. Objects can range in size from 0 bytes up to 5TB, and there is no limit to the number of objects that can be stored in a single bucket. This means that Amazon S3 can store an almost unlimited amount of data.

Each object comprises data (the file) and metadata (data about the file). Amazon s3 treats an object’s data as a stream of bytes, meaning that the service doesn’t differentiate between text and binary data. The metadata associated with an object is a set of name/value pairs that describe the object. There are two types of metadata: system metadata and user metadata. System metadata is generated and used by Amazon S3 and includes data such as the object’s size, date last modified, MD5 digest, and HTTP Content-Type. User metadata is optional and can be specified at the time of object creation. You can use custom metadata to add tags to your data with attributes that are meaningful to you.
Keys :

In Amazon s3, every object stored in a bucket has a unique identifier called a key, which serves as the object’s filename. A key can contain up to 1024 bytes of Unicode UTF-8 characters, including slashes, backslashes, dots, and dashes. Keys must be unique within a single bucket, but different buckets can have objects with the same key. The combination of the bucket, key, and optional version ID uniquely identifies an object.
Object URL :

Each Amazon s3 object can be accessed using a unique URL that is formed using the web services endpoint, the bucket name, and the object key.

For example, an object with the key “jack.doc” in the “mybucket” bucket can be accessed using the URL “http://mybucket.s3.amazonaws.com/jack.doc".

If another object with the same key is created within a nested directory structure, such as “http://mybucket.s3.amazonaws.com/fee/fi/fo/fum/jack.doc", the key or filename becomes the string “fee/fi/fo/fum/jack.doc”.

Keys can contain delimiter characters like slashes or backslashes to help organize Amazon S3 objects logically, but to Amazon S3, it’s just a long key name in a flat namespace.
Amazon S3 Operations :

The Amazon S3 API is intentionally simple, with only a handful of common operations. They include:

a. Create/delete a bucket

b. Write an object

c. Read an object

d. Delete an object

e. List keys in a bucket
Durability and Availability :

Data durability and availability are important concepts for any storage system. Amazon s3 provides both very high durability and availability of data. Standard storage in Amazon s3 is designed for 99.999999999% durability and 99.99% availability of objects over a year. Amazon s3 achieves high durability by storing data redundantly on multiple devices in multiple facilities within a region. Amazon s3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage.

If you need to store non-critical or easily reproducible derived data, you can choose to use Reduced Redundancy Storage (RRS) at a lower cost. RRS offers 99.99% durability with a lower cost of storage than traditional Amazon s3 storage.
Data Consistency :

Amazon s3 is an eventually consistent system. Changes in your data may take some time to propagate to all locations. For PUTS to new objects, Amazon s3 provides read-after-write consistency. However, for PUTS to existing objects and object DELETES, Amazon s3 provides eventual consistency. This means that if you PUT new data to an existing key, a subsequent GET might return the old data. Similarly, if you DELETE an object, a subsequent GET for that object might still read the deleted object. In all cases, updates to a single key are atomic, and for eventually consistent reads, you will get either the new data or the old data, but never an inconsistent mix of data.
Versioning :

Versioning in Amazon s3 is a feature that helps safeguard your data from accidental or malicious deletions by maintaining multiple versions of each object in the bucket, with a unique version ID for each version. This feature allows you to retrieve and restore any version of every object stored in your Amazon s3 bucket. If someone unintentionally alters or deletes an object in your s3 bucket, you can easily restore the object to its original state by referencing the version ID, along with the bucket and object key. Versioning is enabled at the bucket level and cannot be removed once enabled, only suspended.
Multi Factor Authentication(MFA) Delete :

MFA Delete is an additional layer of protection that can be added to bucket versioning. It requires an extra level of authentication to permanently delete an object version or alter the versioning state of a bucket. MFA Delete necessitates an authentication code, a temporary one-time password, generated by a virtual or hardware Multi-Factor Authentication (MFA) device. It is important to note that only the root account can enable MFA Delete.
Object Lifecycle Management :

Amazon s3 Object Lifecycle Management is roughly equivalent to automated storage tiering in traditional IT storage infrastructures. In many cases, data has a natural lifecycle, starting out as “hot” (frequently accessed) data, moving to “warm” (less frequently accessed) data as it ages, and ending its life as “cold” (long-term backup or archive) data before eventual deletion.

For example, many business documents are frequently accessed when they are created, then become much less frequently accessed over time. In many cases, however, compliance rules require business documents to be archived and kept accessible for years. Similarly, studies show that file, operating system, and database backups are most frequently accessed in the first few days after they are created, usually to restore after an inadvertent error. After a week or two, these backups remain a critical asset, but they are much less likely to be accessed for a restore. In many cases, compliance rules require that a certain number of backups be kept for several years.

Using Amazon s3 lifecycle configuration rules, you can significantly reduce your storage costs by automatically transitioning data from one storage class to another or even automatically deleting data after a period of time.

For example, the lifecycle rules for backup data might be:
1. Store backup data initially in Amazon S3 Standard.
2. After 30 days, transition to Amazon Standard-IA.
3. After 90 days, transition to Amazon Glacier.

Amazon Glacier

Amazon Glacier is an extremely low-cost storage service that provides durable, secure, and flexible storage for data archiving and online backup. To keep costs low, Amazon Glacier is acceptable. designed for infrequently accessed data where a retrieval time of three to five hours is acceptable.

Amazon Glacier can store an unlimited amount of virtually any kind of data, in any format. Common use cases for Amazon Glacier include replacement of traditional tape solutions for long-term backup and archive and storage of data required for compliance purposes. In most cases, the data stored in Amazon Glacier consists of large TAR (Tape Archive) or ZIP files.

Like Amazon S3, Amazon Glacier is extremely durable, storing data on multiple devices across multiple facilities in a region. Amazon Glacier is designed for 99.999999999% durability of objects over a given year.

Amazon Glacier Basics

Archives :

In Amazon Glacier, data is stored in archives. An archive can contain up to 40TB of data, and you can have an unlimited number of archives. Each archive is assigned a unique archive ID at the time of creation. (Unlike an Amazon s3 object key, you cannot specify a user-friendly archive name.) All archives are automatically encrypted, and archives are immutable — after an archive is created, it cannot be modified.
Vaults :

Vaults are containers for archives. Each AWS account can have up to 1,000 vaults. You can control access to your vaults and the actions allowed using IAM policies or vault access policies.
Vaults Locks :

You can easily deploy and enforce compliance controls for individual Amazon Glacier vaults with a vault lock policy. You can specify controls such as Write Once Read Many (WORM) in a vault lock policy and lock the policy from future edits. Once locked, the policy can no longer be changed.

Data Retrieval :

You can retrieve up to 5% of your data stored in Amazon Glacier for free each month, calculated on a daily prorated basis. If you retrieve more than 5%, you will incur retrieval fees based on your maximum retrieval rate. To eliminate or minimize those fees, you can set a data retrieval policy on a vault to limit your retrievals to the free tier or to a specified data rate.

In conclusion, Amazon S3 and Amazon Glacier are powerful cloud storage solutions offered by Amazon Web Services. S3 is optimized for highly durable and highly available storage of frequently accessed data, while Glacier is designed for long-term archival storage of infrequently accessed data. Both services offer easy-to-use APIs, simple pricing models, and built-in security features. By understanding the basics of S3 and Glacier, including buckets, objects, keys, and versioning, you can make informed decisions about which service best meets your storage needs. Whether you’re storing images, videos, documents, or backups, Amazon S3 and Glacier provide scalable, reliable, and cost-effective solutions for your data storage requirements.

Amazon Simple Storage Service(s3) and Amazon Glacier

Table of contents