Combine Dataform with Typescript for Improved Workflows

“But how do you ensure every BigQuery table you create has the same metadata in Dataform?”

Yes, I thought to myself, how do I ensure that? This was a question posed recently to me and I realized I don’t quite have an answer. And I really didn’t want to use Terraform as the solution. We could use Javascript with Dataform and use objects that are passed around. That would certainly help.

But Typescript is just oh so much nicer. Is there a way to use it?

On first thought, it seemed quite simple. Typescript compiles to Javascript. Dataform has a package.json so clearly it supports node modules. I can probably just install anything I want, and then simply use the compiled Javascript files for my workflow. Time to wrap this one up no? Not so fast.

Turns out there are a host of problems and I will go through some the things I bumped into in this little experiment. But if you want to skip all of that and just get the code, here’s a preview into what the final thing looks like.

A peek into the final result

As you can see, there are typescript files under src folder, javascript under the build folder and finally the dataform files under definitions. This allows us to clean typed definitions, chain functions with autocomplete, get type hints, the whole deal (well, almost as you will soon see).

The repository with code is here: https://github.com/raghuveer-s/combine-dataform-with-typescript/tree/main

Now let’s breakdown this journey.

Problems and solutions

Missing type definitions

The very first thing you might look for is type definitions when working with a greenfield Typescript project. But unfortunately, there is no convenient @types/dataform provided by Google in npm. If you read the Dataform docs on using Javascript or look at the dataform core reference (https://cloud.google.com/dataform/docs/reference/dataform-core-reference), its clear we can use types in the Typescript code. It also does exist to some degree in the source code in the core package (https://github.com/dataform-co/dataform/tree/main/core) and the protobuf files.

First problem solved, we can use a mix of the documentation and the source to extract the type definitions we need: https://github.com/raghuveer-s/combine-dataform-with-typescript/blob/main/src/global.d.ts

Node modules are supported.. but not really?

Dataform documentation mentions that you can install packages through package.json (https://cloud.google.com/dataform/docs/install-package#install-package). This sounds great. Node modules embedded along with ETL transformations? Sign up me up so I says!

But unfortunately, this was yet another road bump. YMMV greatly here.

When I was experimented with installing even the simplest of modules, I ran into compilation issues. Here’s the usecase I used (building off the brewery example as always, refer here: https://raghuveer.me/a-practical-introduction-to-google-cloud-dataform):

  • The daily location sales has a column which is the total sales in INR (Indian Rupees).

  • So, how about I enrich this by getting the latest exchange rate and have the total sales in USD (US Dollar)?

A little contrived, but I think it makes sense in the context of sales reports.

To do this, I downloaded a really small module called currency-exchange-rates (https://www.npmjs.com/package/currencies-exchange-rates) and attempted using it, but even before I could check if the module’s currency conversion still works, I kept running into compilation issues. A couple of things I discovered while debugging:

  • Dataform only uses CommonJS.

  • Overcoming this hurdle through bundlers is still a no go, as there seems to be a strict subset of modules that are allowed in the end.

Going down the rabbit hole

What struck me as odd is I tried it with the module mentioned in the documentation (postoffice, for which the source doesn’t seem to exist on Github) but even then it resulted in compilation error. Iirc, the error was a message which said it failed to find node:url. This was interesting, if this is the inbuilt url package and Dataform complains with compilation errors, then I have to ask if dependencies are severely restricted in some way. I half-confirmed this hypothesis by installing is-buffer library which has no dependencies and no compilation errors surfaced.

Needing more information, I decided to check things locally using dataform compile first. I tried bundling everything into bundle.js file wondering if that might help, but I kept running into VMError.

Browsing the source, what I can gather is that Dataform uses vm2 (https://github.com/patriksimek/vm2) as a way to execute untrusted Javascript code. This means code and modules are executed in an isolated context, and the support for even for built in modules maybe restricted, let alone support for any module from npm. This constraint also explains the need to use CommonJS because vm2 is built on top of node vm (https://nodejs.org/api/vm.html).

So it would seem the only way to avoid the problem is.. by not using most modules from npm in the core code. Which is where I stand right now. All the modules I’ve used in the Typescript repo so far are for testing, lint, building etc.

Honestly, this one’s a bummer. I do hope I’m in the wrong here and I just misunderstood the documentation somehow. Or if someone can suggest a workaround, that would be awesome. Technically I guess you could this by stitching together a pipeline with Workflows, Cloud Functions and so on.

I really hope Dataform can do this in the future. In my opinion, there is a strong case for working with node modules and enriching workflows directly in Dataform. But for now, we soldier on.

Global scope

includes/ folder can have Javascript files with constants and utility functions (https://cloud.google.com/dataform/docs/reuse-code-includes), but using them with module.exports as mentioned in the docs makes them available in the global scope.

In other words: Using them from other files does not strictly need a require(). Referencing the file name allows you to use the exported constant or function. You can see an example of this here: https://github.com/GoogleCloudPlatform/bigquery-utils/blob/09082672f8a35037f79a54d166cf17fca7792c6d/dataform/examples/dataform_assertion_unit_test/definitions/tests/test_date_assertions.js

I find this a bit confusing. I would rather use the require() instead to reference modules that have Javascript code. Turns out this is works just as a regular require() where we specify the path of the module.

This is both interesting and important to keep in mind.

Folder structure

Dataform has a strict folder structure requirement. It needs definitions/, and after that includes/ for any utility functions.

And now this one’s one me, but I don’t like my typescript code’s folder structure being coerced like this, I want to have my tables in a tables folder, my ETL in a transformations folder, my tests in a tests folder and so on. Which means I need to map my folder structure to what Dataform wants and modify import paths if necessary to have it respect Dataform’s requirements in the final .js files.

Any good build tool and some scripts should be able take care of this one.

Path referencing and manipulation

Now we get to the glue that makes all of this work.

To smoothly interoperate between Typescript land and compiled Javascript, we can use tsconfig.json. Depending on what you are trying to do, there are a few options available:

  • paths config in tsconfig.json to remap imports to the correct location. I used this to remap all paths with a prefix of @includes/ to resolve to modules in the includes/ directory.

  • Use declare to declare the variable or function to the Typescript compiler. This is useful, for example, to make functions accessible globally with any imports. It can be used when path manipulation is tricky.

  • And lastly, you have typeRoots which is useful if you want to add more organization to your code or you have more than one type definition file. For example, you could create a @types folder for your type definitions and add typeRoots in tsconfig.json with the folders that are interesting to you.

I used paths configuration to help with the module resolution. You could use a mix of the above to varying effect.

The build tool I used is rollup (https://rollupjs.org/guide/en/) with the replace() plugin to transform the @includes paths into ./includes and wrote a tiny script to move the files I want from the build output into a structure Dataform finds acceptable. Depending on how you structure your project, this can technically mean, that the files have (or may have) incorrect paths in the build directory. For the final code in this repository, this is infact what happens. It can be corrected, but my reasoning was that I’m never going to use the artifacts in the build/ directory apart from being a temporary location (and I got lazy) to copy over from into Dataform’s folder structure so I cheated a little and let it be.

This is the most important section in this experiment. Depending on how you organize your project, your tsconfig.json and rollup.config.json will be the files that need tweaking.

Building the code

With most of the heavy lifting being done with tsconfig.json, rollup.config.json, the code can be built as a regular Typescript project.

You could also chain dataform compile to this and check locally if the compilation is valid! This way a lot of the work is shift left, we spend less time going online and checking if things work in the Dataform Workspace code editor.

Now coming to the original question of how do you ensure BigQuery tables created have the same configuration. As a first step, we can create a “BigQuery Incremental Table Config” and ensure that it has the properties that you want (partition, clustering spec etc etc) and use this in the publish() method.

Next, we could even think of enforcing rules for publish() methods, for “daily sales data”, to get called only with this incremental table config. I certainly think this can tweaked further to harden the transformation logic and make it resilient to regressions or errors.

The key take away is that we now have programmatic way of checking what we are doing with our data transforms.

Putting it all together

The source code as mentioned is here: https://github.com/raghuveer-s/combine-dataform-with-typescript/tree/main.

The end result is we have our Dataform workflows in full typed code with unit tests, which in my opinion does make it much more maintainable. It’s a proof of concept and is a bit rough around the edges, but I intend to use it as a base and experiment with it in future projects. Let me know if it was useful!

0
Subscribe to my newsletter

Read articles from Raghuveer Sriraman directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Raghuveer Sriraman
Raghuveer Sriraman

Working mostly on Data Engineering, Machine Learning and Bayesian analytics.