Effortless On-Prem Backup to AWS Cloud with DataSync


Suppose you have been provided a task, to automate backups of the data to the AWS Cloud. The data on premises in approximately 1 million.csv files that are hosted on a VM and we need to backup the data on a daily basis.
As a solutions architect you need to provide a solution that custom filters to back up only a subset of the data that is located in designated source directories. And we can leverage the AWS Direct connect connection setup by the organization.
You can follow below steps to achieve the desired results.
Install the DataSync Agent:
Deploy the AWS DataSync agent on the on-premises virtual machine (VM) where the
.csv
files are stored.The agent is used to transfer data from the on-premises environment to AWS storage services.
Configure the Source and Destination:
Source: Specify the source directories on the VM where the
.csv
files are located.Destination: Configure the destination as an Amazon S3 bucket. For example:
s3://my-backup-bucket/csv-backups/
Apply custom filters during task configuration to back up only specific files or directories based on requirements (e.g., file name patterns or directory paths).
Set Up a Scheduled Task:
Create a DataSync task and schedule it to run daily. You can configure the schedule using cron expressions to automate daily backups.
DataSync will only transfer incremental changes (new or modified files) to optimize the backup process.
Leverage AWS Direct Connect:
- Since the company already has an AWS Direct Connect connection, DataSync can use the private connection for faster, more reliable data transfer with reduced network costs compared to the public internet.
Data Lifecycle Management:
- Use S3 Lifecycle Policies to manage data retention, such as archiving older backups to S3 Glacier or S3 Glacier Deep Archive to reduce costs.
Why AWS DataSync?
Automated Backups: DataSync simplifies and automates the transfer process, reducing operational overhead.
Incremental Transfers: Only modified or new files are transferred, saving time and bandwidth.
Custom Filtering: You can specify filters to include or exclude specific files or directories during the transfer.
Integration with Direct Connect: DataSync works seamlessly with AWS Direct Connect for efficient and secure data transfer.
Ease of Management: DataSync is a fully managed service, eliminating the need to build and maintain custom scripts or processes.
Alternative Solutions (With Higher Overhead)
AWS Storage Gateway:
You could use the File Gateway configuration to expose an NFS or SMB interface on-premises and back up data to S3.
However, this would require additional configuration and might not support fine-grained filtering out of the box.
Custom Scripts with AWS CLI/SDK:
Write custom scripts to copy data from the VM to S3 using tools like the AWS CLI or SDKs.
This approach requires ongoing maintenance and development effort, increasing operational overhead.
Amazon S3 Transfer Acceleration:
- Use S3 Transfer Acceleration for faster uploads over the public internet, but this would not leverage the existing Direct Connect and lacks filtering capabilities.
Summary
The best solution for automating daily backups of on-premises .csv
files with custom filters, minimal operational overhead, and efficient use of AWS Direct Connect is AWS DataSync.
Subscribe to my newsletter
Read articles from vikram chaudhary directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by