Cloud Run with Pub/Sub: Key Lessons

Project Context

I’ve been working on a system built with a microservices architecture that uses Google Cloud Pub/Sub and Cloud Run to manage notifications and keep a continuous NLP analysis flow running smoothly. Each microservice has a specific role, and Pub/Sub ensures they communicate efficiently, delivering messages in the right order and at the right time for every layer of the system.

What Happened

Everything was going fine until I started configuring the NLP microservice on Cloud Run. That’s when I ran into a frustrating issue: the microservice instance would go inactive if it didn’t receive HTTP requests, which meant it couldn’t receive Pub/Sub notifications. Since Cloud Run automatically suspends instances without traffic, this completely disrupted the message flow and caused processing delays.

At first, I thought the issue might be related to resource usage, as the NLP analysis process is pretty demanding in terms of CPU and memory. I figured some internal process might be blocking cores or draining resources, so I tried larger instances and more advanced configurations. But no matter what I did, the problem didn’t go away.

I spent hours combing through logs and monitoring every deployment, trying to figure out what was going wrong. I ended up doing 32 deployments, with around 75% of them involving code changes. Each time I checked the logs and saw the service still wasn’t working, it felt like I was going in circles.

Finally, after all that trial and error, I realized the problem wasn’t in my code or the resources I allocated. It was something much simpler (and more frustrating): Cloud Run’s default behavior. Without HTTP or REST traffic, instances automatically suspend. The root cause was baked into the very nature of the service.

Diagnosis and Technical Explanation

After extensively reviewing the logs and the instance behavior, I discovered that Cloud Run suspends inactive instances when there are no incoming HTTP requests. This feature is designed to optimize cloud resource usage, which is beneficial in many cases, but for this particular service, it was causing interruptions in the Pub/Sub message flow.

The Pub/Sub configuration included automatic acknowledgment (ack) and message republishing in case of failure, which helped temporarily but didn’t solve the underlying issue: if the Cloud Run instance wasn’t active, it simply couldn’t receive notifications.

The Solution

To resolve this issue, I made the following adjustments:

Configuration on Cloud Run:

• I configured Cloud Run to prevent the NLP instance from automatically suspending due to lack of HTTP traffic. This ensured that the instance would remain active and available to receive Pub/Sub messages at all times.

• Use “CPU is Always Allocated”: Enabled this setting to keep CPU resources allocated even without HTTP traffic.

• Set Minimum Number of Instances: Configured a minimum number of instances to ensure continuous service availability.

With this configuration, the service will always stay active.

Adjustments in Pub/Sub:

• I kept the configuration so that Pub/Sub would automatically send the acknowledgment (ack) and republish messages in case of failure, ensuring message delivery even if there were any transient errors. In my particular case, my process took more than 10 minutes, so using an ACK Deadline was not feasible.

These changes ensured that the NLP service remained continuously available, and Pub/Sub notifications began arriving without interruptions, allowing for a constant, delay-free processing flow.

Results

After applying this solution, the results were clear:

• Improved Availability: The NLP service remains active and available to receive Pub/Sub messages without interruptions.

• Increased Efficiency in Message Flow: The configuration enables a continuous message flow, eliminating processing delays and enhancing overall system stability.

Summary: Key Takeaways from the Lesson Learned

• In cloud-based microservice systems, it’s essential to review inactivity settings to prevent critical instances from being automatically suspended.

• Configure automatic acknowledgment in Pub/Sub and set retries to handle failures, but ensure the instance remains active if it needs to receive messages continuously.

• Cloud Run’s automatic suspensions can be useful in some projects, but it’s important to adjust this behavior for services that require constant availability.

Reflections and Next Steps

This experience gave me a better understanding of how inactivity settings work in Cloud Run and how they can impact services that rely on continuous availability. In the future, I plan to implement observability tools to monitor service activity and automatically adjust configurations in case of prolonged inactivity periods.

Bonus Track!!

While writing this post, I went to capture screenshots of the instances and discovered a new solution: I was able to set up a Pub/Sub trigger to automatically activate my microservice when something is published to the topic! With this, I can return to zero instances, and each time a Pub/Sub notification arrives, it automatically activates. It’s like an “alarm clock” that wakes up the instance just in time to process the message. What a discovery! But hey, when you don’t know, you end up doing silly things. Now I have an efficient solution, even though it took me a while to figure it out.

Finally, time for a beer! 🍻 See you in the next lesson learned.

Lesson Learned #01: Cloud Run with Cloud Pub/Sub