A few months ago, I finally had the opportunity to use Fivetran in a real-world project. To provide some background, this was a greenfield project involving a financial system, tooling and architecture had been selected before my arrival. The tech stack included Databricks for the Data Lake (or Delta Lake), with Fivetran chosen as the data acquisition tool. The company was using Control-M as its scheduler, but no one really wanted to use that. For this specific task, it was decided to utilize the Fivetran Scheduler to trigger the data acquisition processes.

My Mission (or so I thought)

I was brought into this project to work on Databricks components. Despite this, I was eager to see Fivetran in action. Architect colleagues I’ve collaborated with for DMS often touted how much easier Fivetran will make such acquisition or migration tasks.

I diligently explored how to orchestrate the data acquisition with Databricks workflows. As expected, Databricks didn’t have direct connectivity to Fivetran. This is a common security practice. It prevents data engineers from being tempted to create code to directly access the Fivetran APIs, potentially leading to messy implementations.

I expressed concerns that orchestration should employ the preferred orchestration tool, despite it being considered outdated technology. Unfortunately, my concerns were ignored. With my well-defined JIRA card languishing without any progress, I decided to focus and proceed with the work regardless.

With no API connectivity, I began examining Fivetran logs, which were being synced by Fivetran to Databricks. BTW, Fivetran doesn’t provide logs on its web console, and the way a customer can access logs is by syncing log data to a destination using a Fivetran connector. (This novel topic might be a future blog.)

The Fivetran log connector had a 5-15 minute delay which was acceptable for now. SLA's will be dealt with much later on.

Initial Signs

We had a job scheduled at 4 p.m. From the logs, I noticed syncs were only starting around 4:11 p.m. This seemed odd, but I didn't think much of it at the time, as it was still early in the project and nothing was stable yet.

After a few more weeks, we were preparing to promote the code to the next higher environment. We deployed the same code (Fivetran was deployed using Terraform, by the way) for a 4 p.m. sync. We waited, but by 4:15, the job hadn't started. By 4:30, still nothing. 4:45, nothing. Finally, the job only started at 4:54 p.m.

RTFM

Digging through the Fivetran documentation, I stumbled upon this warning:

https://fivetran.com/docs/core-concepts/syncoverview

When you add a new destination, Fivetran assigns it a fixed time offset. The offset can be any random value in minutes ranging from 0 to 60. It is derived from the destination ID hash. This offset is shared by every connector in the destination. The offset value remains the same regardless of the set sync frequency.

Let that sink in: a random 0-60 minute offset. And it can’t be modified, even through a support request (trust me, I tried).

Denouement

This discovery threw a wrench in our plans to avoid using Control-M. After losing a moderate amount of effort on creating a log-based solution, we pivoted to developing a simple script to be triggered by Control-M. In terms of code, simpler solution. However, in terms of infrastructure, it was a minor nightmare. We had to sort out an agent for Control-M, open up network connectivity, create a Fivetran service user to trigger the API calls, and manage secrets, among other tasks.

Things I learned

Avoid the fivetran scheduler like the plague, the API is the way to go to trigger fivetran syncs. You won’t be hit by an unreasonably high offset. And you’ll have more options to handle reschedules or errors on syncs. I’ll rant about the Fivetran Scheduler retry logic in another time.

Embrace the orchestration tool as your ally. I still do not like Control-M. While Control-M may not be my preferred choice, I recognize the robust infrastructure this organization has established around it, including change management processes, 24/7 support teams, alerts, and more. It does make me think of what hurdles needs to be scaled to get something like Airflow productionized.

Upset with Fivetran Scheduler Offsets