Taming the Multi-Headed Beast: Maintaining SDKs in Production for Years

MetisMetis
18 min read

Managing an SDK in production over several years is similar to taming a multi-headed beast. Each aspect of the development process demands unique attention and precision. Balancing the evolving needs of a diverse user base, adapting to the ever-changing technology landscape, and ensuring consistent, high-quality performance across various environments presents a complex array of challenges. This article distills the key lessons and strategies we've honed from years of experience in maintaining and evolving our SDKs in a dynamic and demanding technological ecosystem. We base the article on our experience and lessons learned over the years of maintaining various SDKs.

What Our SDKs Do

Over the years, we maintained SDKs to ease the integration of the user’s application with the Metis platform. The prevention part of Metis lets users identify slow queries and other database issues right when they are developing their applications. To do that, the user’s application needs to notify Metis about the APIs that are called, the SQL statements that are executed, and the performance of the database. We wanted to achieve the following:

  • Ease of use - users should be able to integrate with Metis in a well-known way. Ideally, with a command that they can copy from the documentation and just run.

  • One for all and all for one - we want the onboarding with Metis to be a one-time action. Once one of the team members integrates the application with Metis, all the other team members should be onboarded automatically. Specifically, we wanted to minimize (or remove entirely) the need to configure each user’s environment separately.

  • No code modifications - we wanted to minimize the amount of code changes needed in the user’s application. Ideally, the integration should be done completely outside of the user’s codebase.

  • No external dependencies - we wanted to minimize the number of things that “need to be installed”. We wanted to avoid the need to install heavy agents, extensions, or manual configuration of the application or the database.

We took the following assumptions about where our SDKs are going to be used:

  • Web applications - we initially wanted to focus on web applications that expose some REST API. However, we shouldn’t limit ourselves to this context only. Users should be able to integrate their CLI tools, desktop applications, or even IoT devices with Metis if needed.

  • Databases - we needed a way to extract the execution plans from the database. We understood that we may need to generate these plans manually if needed.

  • Modern standards - we assume users’ applications use modern technologies that are common in the industry. However, we shouldn’t prevent older technologies from integrating with Metis.

Conceptually, our SDKs need to achieve the following:

  • They need to let Metis know that such and such interaction happened. For instance, API X was called.

  • They need to provide all the SQL queries that were executed by the database. Various data access technologies (drivers, ORMs, query builders) achieve this differently and generate queries using different techniques (parameters, CTEs, window functions, prepared statements, stored procedures) and running even different numbers of queries (like many statements for joined tables).

  • They need to provide execution plans explaining what happened in the database. Those plans can be generated automatically or on demand.

In short, our SDK had to capture the API interaction “somehow”, get the SQL query “somehow”, correlate these two, and provide the execution plan explaining what happened in the database. Over the years we explored three different approaches. Let’s see them one by one.

First Approach - SDK Per Language/ORM/Web Using OpenTelemetry

Our first approach was to implement a separate SDK for each independent technology stack. We decided to use OpenTelemetry which is widely adopted and covers most of our needs.

The idea was to write an SDK that instruments the web server, SQL driver, and ORM. By instrumenting the web server, we could obtain the details of the REST call. By instrumenting the SQL driver and the ORM, we could get the query that was executed and run it again with the EXPLAIN keyword to get the execution plan.

Recommended reading: What Are ORMs & Should You Use Them

The user would then need to install the SDK (with their regular package manager) and call one method from the SDK to enable the instrumentation. Also, they didn’t need to configure their database or install additional agents around. It

was very neat from the user's perspective.

However, this approach posed multiple problems:

  • Troubles with reusing the implementation - we had to write custom instrumentation for nearly all web servers, SQL drivers, and ORMs. Languages often rely on only a couple of widely used libraries (like web servers or SQL drivers), however, we had to instrument each library separately.

  • Differences between versions of dependencies - every time the library we instrument (like the web server or SQL driver) changed versions, we had to make sure our code still worked. This was getting even harder when the dependencies broke their compatibility and we effectively had to maintain SDKs for different versions of a given technology stack.

  • Lack of parameter values - SQL drivers and ORMs often integrate with OpenTelemetry. However, they don’t provide the same details. For instance, some libraries do not provide parameter values (they give SELECT \ FROM table WHERE column = $1* without specifying what $1 is). Extracting the parameter values required us to dive deep into the ORM implementation and understand how to use the mechanisms the ORM provides.

  • Inability to correlate REST and SQL - sometimes it was impossible to correlate the SQL query with the REST call because they were executed between technologies with no link between them.

  • Problems with testing frameworks - web servers are often mocked during automated unit tests. We still wanted to capture the “calls” somehow and we had to support many testing libraries for that.

In short, the maintenance burden was enormous. While we could maintain a couple of SDKs this way, it wasn’t scalable and we had to look for something else.

Second Approach - Extracting Queries And Plans From The Database

We identified that the biggest challenge of the first approach was how to get the proper query and the execution plan. Instrumenting web servers wasn’t that hard (and wasn’t that crucial). We decided to switch gears from extracting details from the SQL driver and the ORM to getting these details from the database. To do that, we had to somehow correlate the queries and the REST calls. We wanted to do that by query stamping.

There are libraries like sqlcommenter that can automatically put comments on the SQL statement. We wanted to use it to turn SELECT FROM table into SELECT FROM table /\ traceid-spanid */* which would clearly indicate what REST call caused this statement. We then needed to modify the database configuration to log all the queries and the plans.

Having that, we could run an agent that would connect to the database, extract logs, identify queries, and send them to Metis. We still had to instrument the web server inside the application to send details of the REST calls. Ultimately, the user had to install our SDK and change the database configuration. It was less ideal than we wanted but still acceptable.

However, this approach also posed many challenges:

  • Database configuration - we had to modify the database configuration. This is very difficult in various database providers (like RDS) and may even not be possible. This is also not very uniform between various DB engines (like PostgreSQL and MySQL).

  • Ephemeral databases - it’s hard to modify the database configuration for databases that are short-living and can’t be restarted easily. This is especially common in automated tests.

  • Difficult query stamping - not all ORMs and SQL drivers integrate with libraries like sqlcommenter. For those cases, we still had to instrument these libraries manually. Sometimes it was impossible.

  • Increased costs - changing the database to log all queries increases the hosting cost. We didn’t want to incur additional costs for our users.

While this solution generalized quite well to many technologies, we still had to maintain many SDKs. We looked for another solution.

Recommended reading: How Metis Optimized Queries Executed by SQLAlchemy

Third Approach - Moving The Ownership

We realized that we couldn’t provide a solution for each language separately. Instead, we decided to move the ownership of the integration onto the frameworks and use whatever they support.

We decided to use automatic instrumentation from OpenTelemetry. We don’t need to modify the application code at all in many cases. It also covers most of the popular libraries. We instruct the user to enable the instrumentation and deliver traces to our OpenTelemetry collector.

The user needs to run our OpenTelemetry collector locally. The collector is language agnostic and just receives the traces. It then extracts the SQL statement and queries the database to get the plan. However, it uses the EXPLAIN keyword so there is no need to reconfigure the database.

In the best scenario, the user doesn’t need to change their application at all. They just need to run a Docker container alongside their application (which can be done easily manually or with TestContainers) and enable the instrumentation with a couple of environment variables.

However, there are again some issues:

  • Enabling instrumentation in code - some languages do not support enabling the auto-instrumentation from outside of the process. In that case, the user needs to enable it manually in the code. However, the code changes are now maintained by OpenTelemetry. We don’t need to maintain it for various languages or libraries.

  • Instrumenting libraries - not all libraries support auto-instrumentation. The user can either raise a feature request for the library, or they can instrument the library manually. The latter is often possible and straightforward. Again, the code for the instrumentation is owned by the library owner.

  • Extracting parameter values - some libraries do not provide parameter values when integrating with Open Telemetry. In these cases, the user needs to extract those parameters manually and reconstruct the query. Many ORMs or SQL drivers expose callbacks for logging the details and this can be used to reconstruct the original query.

  • Correlating queries with REST calls - sometimes the query can’t be easily correlated. The user can either stamp the query or disregard this feature.

In short, the happy path is covered with a couple of environment variables. The worst path requires adding around 30 lines of code to the application to enable the integration with the collector running locally. This works for automated tests as well.

The biggest benefit of this approach is the integration with modern standards and effectively no SDK at all. The user can even call our collector with CURL if they want to. We are not limited by the libraries, languages, technologies, or hosting environments.

We went a long way to get here. Let’s see what we learned.

Key Challenges in Long-Term SDK Maintenance

There are many challenges when it comes to maintaining the SDK for a longer time.

The first one is the challenge of uniform functionality. Achieving feature parity across languages like Python, Java, and JavaScript is a formidable task. This challenge involves ensuring that dynamic typing in Python aligns with statically typed languages like Java. It requires a deep dive into polymorphism, encapsulation, and concurrency models of each language to maintain a uniform API behavior and functionality, regardless of the underlying language differences. The challenge is to find the right balance between elegant features provided by some languages and technical limitations coming from others.

The second challenge is the challenge of evolution and managing versioning across ecosystems. Navigating the SDK through the ever-changing landscape of language updates is intricate. This includes adapting to Python's rapid release cycle and new features, while simultaneously keeping up with Java's long-term support (LTS) releases and JavaScript's evolving ECMAScript standards. The key lies in mastering semantic versioning, understanding the nuances of backward compatibility, and designing a robust strategy for deprecation policies. Just like with the uniform functionality, the more features of the platform we use, the harder it is to support the SDK reliably.

Last but not least is the challenge of polyglot proficiency and navigating language diversity. Balancing the strengths and limitations of each language in a polyglot environment is complex. This entails not only understanding Python's idiomatic constructs but also grasping Java's JVM intricacies and JavaScript's single-threaded event loop. The challenge is to develop a cross-language architectural vision, where concepts like Python’s GIL (Global Interpreter Lock), JavaScript async-await, or GO’s goroutines are well-understood and correctly applied in a multi-language ecosystem.

Let’s now dive deep into these challenges.

The Challenge of Uniform Functionality

In SDK development, maintaining consistent functionality across various programming languages can be a complex task. There are many areas where issues manifest themselves, one of which is the proper communication protocol. Over the years, there have been many approaches to communication. Some more known are Remote Procedure Calls (RPCs) with widely adopted solutions like Network File System, CORBA, or RMI. Another approach is HTTP which focuses on stateless communication and is widely used in REST. Some modern communication methods are often just reiterations of the same concepts. In our case, we migrated between technologies. Let’s see what we did and how.

Shifting from REST API to gRPC for Enhanced Integration

The problem is that our initial setup with multiple language SDKs communicating via a REST API server led to inconsistencies. The diverse ways languages handle HTTP requests and JSON parsing resulted in non-uniform behavior and data interpretation issues. While JSON seems to be very well standardized, there are many issues around escaping special characters, parsing the payload into data structures (especially in strongly typed languages), or dealing with circular structures.

We solved the issue by transitioning to gRPC, which uses Protocol Buffers (protobuf) for defining service methods and message types. This change allowed us to generate consistent client and server code for all our language SDKs, ensuring uniform communication. gRPC lets us define the API once and use it uniformly between all the languages. It’s conceptually very similar to Interface Description Language in CORBA and we can use this approach to generate clients automatically. Most importantly, the maintenance of the API is now offloaded to external tools and we don’t need to deal with inconsistencies.

Could we achieve the same with HTTP and JSON? To some extent yes. There are solutions for generating clients based on the JSON API definition. However, these solutions are independent and are not standardized. There is no single solution that we could use the same way in all the languages.

Implementation Highlights:

Protobuf Utilization: We defined our data structures and service interfaces using protobuf. This unified data modeling across languages, such as Python and Java, ensures consistent serialization and deserialization.

gRPC Advantages: gRPC provided a more efficient, strongly-typed, and streamlined method for server communication compared to REST. This was crucial for achieving functionally equivalent and efficient communications across our SDKs.

Impact and Adaptation: The switch to gRPC revolutionized our approach to cross-language SDK communication. It brought in greater consistency and efficiency, addressing the core challenge of uniform functionality.

Consistency Across SDKs: The protobuf-generated code ensured uniform data handling and functional calls, regardless of the language.

Efficiency and Strong Typing: Protobuf offered a more compact and faster data handling method than JSON, and its strong typing system reduced errors and inconsistencies.

Conclusion

The move to gRPC and Protocol Buffers marked a significant step in overcoming the challenge of uniform functionality in our multi-language SDK environment. This strategic decision not only streamlined our communication protocols but also laid a strong foundation for future development.

Using Well-Known Protocol

Defining custom protocol gives us the freedom to include anything we need. However, this also requires the users to learn the protocol or we would need to build libraries to deal with it.

At the same time, using a well-known protocol brings much easier and bigger adoption.

However, some features may not be available or may be hard to use.

The default approach should be to go with the industry standard. The benefits outweigh the drawbacks. Also, the standard can be adopted by other users in the industry, so even if there is no library for supporting the standard today, odds are there will be one soon. We won’t need to maintain it or develop it.

We decided to use OpenTelemetry but there is no technical requirement for our users to use it. They can send the data using any tools as long as they adhere to the payload structure. Since it’s just a JSON, the users can send it even using CURL if they want. Most of the time they don’t need to do it manually because there are libraries for doing that already.

The Challenge of Evolution: Managing Versioning Across Ecosystems

Effective version management in multi-language SDK development is a crucial aspect of maintaining long-term viability and compatibility. This is especially important when we introduce new features that are not supported by all our SDKs yet, or when we use features from external dependencies (like Open Telemetry) that are not available in all the languages. SDKs need to evolve independently and yet they need to be consistent to not surprise the users.

We incorporated many strategies to deal with the versioning.

Rigorous Testing in Isolated Environments

Ensuring compatibility across different language versions presents a major challenge, especially when languages update independently and at their own pace. To fix this issue, we implemented isolated testing using Docker and TestContainers. This approach allowed us to simulate diverse environments and language versions, ensuring our SDK's compatibility with each update.

It’s crucial to have reproducibility and immutability when testing the solutions. SDKs can’t rely on things that we can’t control like the local environment or a particular region of the globe. While Docker doesn’t provide full immutability (as the containers are stateful and we can’t guarantee that dependencies do not change), it is good enough to share the solution between platforms and environments. If it works locally, it’s very likely that it will work somewhere else. We could have used NixOS or Nix Package Manager to achieve even higher immutability.

Another great aspect of containers is that they can be easily used with Testcontainers. The library enables us to run our tests in lightweight, throwaway instances of common databases and services, ensuring a comprehensive testing environment. This makes it much easier to reproduce the issues when we can create the data sources once and share them across the team, and they are updated and maintained automatically by the testing framework.

Organized Continuous Integration (CI)

Testing SDKs is tedious and time-consuming. We can’t do it manually. Instead, we need to have the testing solution automated. Testing across all supported language versions was essential to prevent integration issues.

We established a robust CI system that automatically tested our SDK against all supported language versions upon each code commit. This ensured early detection and resolution of compatibility issues.

However, just running the CI pipeline is not enough. All the pipelines must test features uniformly and report issues consistently. If we identified a bug in one scenario in one particular language, odds are the same scenario would fail in other languages as well. We solved that by maintaining a uniform list of test cases that we run in each technology. If there was a bug in one of them, we just added the test case to all the technologies to cover it everywhere.

Our CI pipeline included tests against multiple versions of each language, ensuring broad compatibility. Every update underwent rigorous automated testing, reinforcing the reliability of our SDK across versions.

In-App Versioning Management

Managing dependencies and versions within the SDK itself was complex, given the diverse language ecosystems involved. To solve that, we utilized tools like Lerna for JavaScript and Poetry for Python to manage our version dependencies effectively across different parts of our SDKs. Lerna helped us manage multiple JavaScript packages within a single repository, streamlining version control and dependency management. Poetry handled dependency resolution and packaging, ensuring consistent versioning across our Python-based SDK components.

Using industry-wide tools and best practices helped us simplify the versioning.

Conclusion

Navigating the challenge of evolving SDKs across diverse programming ecosystems demanded a multi-faceted approach. By employing rigorous testing in isolated environments, maintaining an organized CI process, and implementing sophisticated in-app versioning management tools, we effectively managed versioning complexities, ensuring our SDK's adaptability and robustness in a dynamic technological landscape.

The Challenge of Polyglot Proficiency: Navigating Language Diversity

In multi-language SDK development, mastering language diversity involves not only technical expertise but also a strategic approach to team organization and knowledge sharing. Our approach was to create a network of language expertise, where each team member specialized in one language while also being competent in another. This ensured both leadership and support across different languages.

Recommended reading: Why DevOps Should Own Their Databases and How to Do It

To do that, each major language in our SDK (like Java, Python, C#, etc.) had a designated Champion. This individual was the primary expert and leader for development efforts in that language. Champions kept abreast of the latest developments in their language, guided the SDK's development in that language, and ensured best practices were followed.

Alongside being a Champion for one language, each expert also served as a supportive collaborator for another language. For example, the Java Champion might be a supportive expert in C#, providing a secondary layer of expertise. This structure ensured no single point of failure. It allowed for continuous progress and quick response in case of issues or bugs in the absence of the primary Champion.

Each Language Champion provided weekly updates to the entire team. These updates included new language features, changes, bug fixes, and any relevant insights. This practice kept the entire team informed and up-to-date, fostering a culture of continuous learning and adaptability.

Regular sessions were held where Champions and supporters shared insights, discussed challenges, and brainstormed solutions. These sessions not only helped in disseminating knowledge but also ensured that all team members were capable of contributing across different language domains.

Addressing the challenge of language diversity in our SDK development requires a harmonious blend of individual expertise and collective learning. The cross-supportive structure among language experts, coupled with regular updates and knowledge-sharing sessions, created a dynamic environment where each language's nuances were respected and leveraged, significantly benefiting our SDK's evolution.

Conclusion: Harmonizing Complexity in Multi-Language SDK Development

In summary, the journey of maintaining an SDK over several years is a testament to the intricate balance of technical acumen, strategic planning, and collaborative team dynamics. Through addressing the Challenge of Uniform Functionality, we embraced the shift from REST to gRPC, harnessing the power of Protobuf for consistent and efficient communication across languages. In tackling the Challenge of Evolution, our rigorous approach to testing, continuous integration, and in-app versioning management ensured that our SDK stayed current and compatible across the ever-evolving landscape of programming languages. Finally, by confronting the Challenge of Polyglot Proficiency, we cultivated a cross-supportive network of language experts, fostering a culture of shared knowledge and adaptability.

This multi-faceted approach has not only enabled us to tame the multi-headed beast of SDK maintenance but has also positioned us to thrive in a dynamic technological ecosystem. Our experience underscores the importance of continuous evolution, proactive problem-solving, and the power of a unified team working towards a common goal. As we move forward, these lessons and strategies will continue to guide us in navigating the complexities of SDK development, ensuring that we deliver robust, versatile, and forward-thinking solutions to our diverse user base.

0
Subscribe to my newsletter

Read articles from Metis directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Metis
Metis