Combine Dataform with Typescript for Improved Workflows


“But how do you ensure every BigQuery table you create has the same metadata in Dataform?”
Yes, I thought to myself, how do I ensure that? This was a question posed recently to me and I realized I don’t quite have an answer. And I really didn’t want to use Terraform as the solution. We could use Javascript with Dataform and use objects that are passed around. That would certainly help.
But Typescript is just oh so much nicer. Is there a way to use it?
On first thought, it seemed quite simple. Typescript compiles to Javascript. Dataform has a package.json
so clearly it supports node modules. I can probably just install anything I want, and then simply use the compiled Javascript files for my workflow. Time to wrap this one up no? Not so fast.
Turns out there are a host of problems and I will go through some the things I bumped into in this little experiment. But if you want to skip all of that and just get the code, here’s a preview into what the final thing looks like.
A peek into the final result
As you can see, there are typescript files under src
folder, javascript under the build
folder and finally the dataform files under definitions
. This allows us to clean typed definitions, chain functions with autocomplete, get type hints, the whole deal (well, almost as you will soon see).
The repository with code is here: https://github.com/raghuveer-s/combine-dataform-with-typescript/tree/main
Now let’s breakdown this journey.
Problems and solutions
Missing type definitions
The very first thing you might look for is type definitions when working with a greenfield Typescript project. But unfortunately, there is no convenient @types/dataform
provided by Google in npm. If you read the Dataform docs on using Javascript or look at the dataform core reference (https://cloud.google.com/dataform/docs/reference/dataform-core-reference), its clear we can use types in the Typescript code. It also does exist to some degree in the source code in the core package (https://github.com/dataform-co/dataform/tree/main/core) and the protobuf files.
First problem solved, we can use a mix of the documentation and the source to extract the type definitions we need: https://github.com/raghuveer-s/combine-dataform-with-typescript/blob/main/src/global.d.ts
Node modules are supported.. but not really?
Dataform documentation mentions that you can install packages through package.json
(https://cloud.google.com/dataform/docs/install-package#install-package). This sounds great. Node modules embedded along with ETL transformations? Sign up me up so I says!
But unfortunately, this was yet another road bump. YMMV greatly here.
When I was experimented with installing even the simplest of modules, I ran into compilation issues. Here’s the usecase I used (building off the brewery example as always, refer here: https://raghuveer.me/a-practical-introduction-to-google-cloud-dataform):
The daily location sales has a column which is the total sales in INR (Indian Rupees).
So, how about I enrich this by getting the latest exchange rate and have the total sales in USD (US Dollar)?
A little contrived, but I think it makes sense in the context of sales reports.
To do this, I downloaded a really small module called currency-exchange-rates
(https://www.npmjs.com/package/currencies-exchange-rates) and attempted using it, but even before I could check if the module’s currency conversion still works, I kept running into compilation issues. A couple of things I discovered while debugging:
Dataform only uses CommonJS.
Overcoming this hurdle through bundlers is still a no go, as there seems to be a strict subset of modules that are allowed in the end.
Going down the rabbit hole
What struck me as odd is I tried it with the module mentioned in the documentation (postoffice
, for which the source doesn’t seem to exist on Github) but even then it resulted in compilation error. Iirc, the error was a message which said it failed to find node:url. This was interesting, if this is the inbuilt url
package and Dataform complains with compilation errors, then I have to ask if dependencies are severely restricted in some way. I half-confirmed this hypothesis by installing is-buffer
library which has no dependencies and no compilation errors surfaced.
Needing more information, I decided to check things locally using dataform compile
first. I tried bundling everything into bundle.js file wondering if that might help, but I kept running into VMError
.
Browsing the source, what I can gather is that Dataform uses vm2 (https://github.com/patriksimek/vm2) as a way to execute untrusted Javascript code. This means code and modules are executed in an isolated context, and the support for even for built in modules maybe restricted, let alone support for any module from npm. This constraint also explains the need to use CommonJS because vm2 is built on top of node vm (https://nodejs.org/api/vm.html).
So it would seem the only way to avoid the problem is.. by not using most modules from npm in the core code. Which is where I stand right now. All the modules I’ve used in the Typescript repo so far are for testing, lint, building etc.
Honestly, this one’s a bummer. I do hope I’m in the wrong here and I just misunderstood the documentation somehow. Or if someone can suggest a workaround, that would be awesome. Technically I guess you could this by stitching together a pipeline with Workflows, Cloud Functions and so on.
I really hope Dataform can do this in the future. In my opinion, there is a strong case for working with node modules and enriching workflows directly in Dataform. But for now, we soldier on.
Global scope
includes/
folder can have Javascript files with constants and utility functions (https://cloud.google.com/dataform/docs/reuse-code-includes), but using them with module.exports
as mentioned in the docs makes them available in the global scope.
In other words: Using them from other files does not strictly need a require()
. Referencing the file name allows you to use the exported constant or function. You can see an example of this here: https://github.com/GoogleCloudPlatform/bigquery-utils/blob/09082672f8a35037f79a54d166cf17fca7792c6d/dataform/examples/dataform_assertion_unit_test/definitions/tests/test_date_assertions.js
I find this a bit confusing. I would rather use the require()
instead to reference modules that have Javascript code. Turns out this is works just as a regular require()
where we specify the path of the module.
This is both interesting and important to keep in mind.
Folder structure
Dataform has a strict folder structure requirement. It needs definitions/
, and after that includes/
for any utility functions.
And now this one’s one me, but I don’t like my typescript code’s folder structure being coerced like this, I want to have my tables in a tables
folder, my ETL in a transformations
folder, my tests in a tests
folder and so on. Which means I need to map my folder structure to what Dataform wants and modify import paths if necessary to have it respect Dataform’s requirements in the final .js
files.
Any good build tool and some scripts should be able take care of this one.
Path referencing and manipulation
Now we get to the glue that makes all of this work.
To smoothly interoperate between Typescript land and compiled Javascript, we can use tsconfig.json
. Depending on what you are trying to do, there are a few options available:
paths
config intsconfig.json
to remap imports to the correct location. I used this to remap all paths with a prefix of@includes/
to resolve to modules in theincludes/
directory.Use
declare
to declare the variable or function to the Typescript compiler. This is useful, for example, to make functions accessible globally with any imports. It can be used when path manipulation is tricky.And lastly, you have
typeRoots
which is useful if you want to add more organization to your code or you have more than one type definition file. For example, you could create a@types
folder for your type definitions and addtypeRoots
intsconfig.json
with the folders that are interesting to you.
I used paths
configuration to help with the module resolution. You could use a mix of the above to varying effect.
The build tool I used is rollup (https://rollupjs.org/guide/en/) with the replace()
plugin to transform the @includes
paths into ./includes
and wrote a tiny script to move the files I want from the build output into a structure Dataform finds acceptable. Depending on how you structure your project, this can technically mean, that the files have (or may have) incorrect paths in the build directory. For the final code in this repository, this is infact what happens. It can be corrected, but my reasoning was that I’m never going to use the artifacts in the build/
directory apart from being a temporary location (and I got lazy) to copy over from into Dataform’s folder structure so I cheated a little and let it be.
This is the most important section in this experiment. Depending on how you organize your project, your tsconfig.json
and rollup.config.json
will be the files that need tweaking.
Building the code
With most of the heavy lifting being done with tsconfig.json
, rollup.config.json
, the code can be built as a regular Typescript project.
You could also chain dataform compile
to this and check locally if the compilation is valid! This way a lot of the work is shift left, we spend less time going online and checking if things work in the Dataform Workspace code editor.
Now coming to the original question of how do you ensure BigQuery tables created have the same configuration. As a first step, we can create a “BigQuery Incremental Table Config” and ensure that it has the properties that you want (partition, clustering spec etc etc) and use this in the publish()
method.
Next, we could even think of enforcing rules for publish()
methods, for “daily sales data”, to get called only with this incremental table config. I certainly think this can tweaked further to harden the transformation logic and make it resilient to regressions or errors.
The key take away is that we now have programmatic way of checking what we are doing with our data transforms.
Putting it all together
The source code as mentioned is here: https://github.com/raghuveer-s/combine-dataform-with-typescript/tree/main.
The end result is we have our Dataform workflows in full typed code with unit tests, which in my opinion does make it much more maintainable. It’s a proof of concept and is a bit rough around the edges, but I intend to use it as a base and experiment with it in future projects. Let me know if it was useful!
Subscribe to my newsletter
Read articles from Raghuveer Sriraman directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Raghuveer Sriraman
Raghuveer Sriraman
Working mostly on Data Engineering, Machine Learning and Bayesian analytics.