GCP - Professional Cloud Architect - Study Guide & Notes

Table of contents
- Background
- Overview
- Section 2 - Google Cloud Regions and Zones
- Section 3 - Google Compute Engine
- Section 4 - Instance Groups and Load Balancing
- Section 5 - Google Cloud Compute Engine and Load Balancing for Architects
- Section 6 - Getting Started with Gcloud
- Section 7 - Getting Started with Google Cloud Platform Managed Services
- Section 8 - Getting Started with Google Cloud App Engine
- Section 9 - Google Cloud Kubernetes Engine for Cloud Architects
- Section 10 - Getting Started with Google Cloud Functions
- Section 11 - Getting Started with Google Cloud Run
- Section 12 - Exploring Google Cloud Functions 2nd generation
- Section 13 - Getting Started with Cloud KMS
- Section 14 - Exploring Block and File Storage
- Section 15 - Exploring Object Storage (Cloud Storage)
- Section 16 - Authentication in Google Cloud with Cloud IAM
- Section 17 - Exploring Databases in Google Cloud Platform
- Section 18 - Relational Databases for Transactional Applications in Google Cloud Platform
- Section 19 - NoSQL Databases in Google Cloud Platform
- Section 20 - Creating Private Networks with Cloud VPC in Google Cloud Platform
- Section 21 - Operations in Google Cloud Platform
- Section 22 - Exploring IAM and Project Organization in Google Cloud Platform
- Section 23 - Compute Engine Virtual Machines
- Section 24 - Asynchronous Communication in Google Cloud with Cloud Pub Sub
- Section 25 - Implementing Hybrid Cloud with Google Cloud
- Section 26 - Exploring Data Warehouses in Google Cloud - BigQuery
- Section 27 - Exploring Data Lifecycle and Data Architectures in Google Cloud
- Section 28 - Caching
- Section 29 - Agile, DevOps, SRE and SDLC Evolution
- Section 30 - Release Management in Google Cloud Platform
- Section 31 - Compliance and Regulations for Your Google Cloud Solutions
- Section 32 - Planning Cloud Migrations to Google Cloud Platform
- Section 34 - Exploring More Google Cloud Platform (GCP) Services
- Section 35 - Understanding Google Cloud Architecture Framework
- Section 36 - Case Studies - Solutions - Professional Cloud Architect
Background
I recently (Feb 2025) recertified as a Professional Cloud Architect, within Google Cloud Platform. I’ve been working professionally with both AWS and GCP for a few years now, and I enjoy keeping up to date with their expansive product offering, especially since my day-to-day usage at work really only revolves around a small set of services, such as GKE, Cloud Storage, BigQuery and the cross-cutting products such as IAM and KMS.
This post contains the notes I used to study for the exam.
Overview
Notes for the GCP Professional Cloud Architect: Google Cloud Certification course.
Section 2 - Google Cloud Regions and Zones
20+ regions - specific geographical location where resources are hosted
Using multiple regions => high availability, low latency, global footprint, govt regulations
Zones => help achieve H/A in the same region
Each region has at least three zones
Each zone has one or more clusters (physical infrastructure)
Zones within a region are connected with low latency links
It’s important to know which resources are zonal, which are regional and which are global; for example
Zonal: VMs, PDs, local SSDs, GPUs
Regional: regional MIGs, regional PDs, Cloud SQL, Filestore, Internal LBs
Global: Cloud CDN, Cloud Spanner, buckets (depending on config), global LBs, Cloud CDN, BigQuery, Pub/Sub, IAM
Section 3 - Google Compute Engine
Virtual machines = virtual servers in GCP
Google Compute Engine (GCE) - service to provision and manage virtual machines
Support for load balancing (either through discrete VMs that are created as a back-end services with network endpoint groups, or instance groups, managed or unmanaged)
Supports auto-scaling for multiple VM instances, through MIGs (managed instance groups)
You can attach storage to VM instances (persistent disks - PDs, or local SSDs)
You can manage network connectivity (through VPC firewall rules) and configuration of VM instances
Machine families and machine types define the type of hardware that VMs run on
Families
General Purpose (E2, N2, N2D, N1): best price/performance ratio; web/application servers, small/medium databases, dev environments
Memory Optimized (M2, M1): large databases, in-memory analytics
Compute Optimized (C2): gaming application
Storage Optimized (Z3)
Types - define exactly how much CPU and memory is provisioned
For each family, there are a number of available types, for example: e2-standard-16 (16 vCPUs), m2-megamem, m2-ultramem, etc.
Memory and networking capabilities increase with CPU (this is why you only see the CPU value in the type name)
Image - define which OS and software you want installed
Public and custom (yours) images
This will be defined as part of the boot disk, for which you can also select the type (HDD, SSD, etc.) and provisioned IOPS
You can also use snapshots and existing disks to define your boot disk
External IPs are publicly addressable (access can still be restricted through firewall rules)
Internal IPs are internal and private, can only be used inside the VPC network (not across networks, unless VPC peering is enabled); an instance’s internal IP address must come from the range of the subnet it resides in
All VM instances are assigned at least one internal IP address; External IPs are optional and can be opted into; stopping a VM instance leads to losing its IPs, unless using static IPs
Static IP addresses (Option to
Reserve Static IP Address
) allow you to assign a constant IP address (internal or external) to a VM instanceRegional or global (you should create the IP address in the region where you have the VMs)
IPv4 or IPv6
Standard or premium network
Billed even when not in use (higher bill than when in use)
Startup script
Allows bootstrapping of VMs (install OS patches, software, etc.)
Configured when creating a virtual machine (Management → Startup Script)
Takes time, and needs to be executed every time you start the VM => better alternative is to create a custom image, that has everything you need already installed
Instance template
Describe a VM instance once and then use that template to create groups of identical instances
Define the machine type, image, labels, startup script, etc. in a reusable template
Used to create individual instances or instance groups (useful as back-ends to LBs)
Immutable, can’t be changed once created, but can be copied and a new version created
Can specify an image family instead of a specific image (in which case, the latest version is used)
Can be used across regions/zones
- Global instance templates can be used in any region. To lower the impact of outages outside your region and gain data residency within your region, use a regional instance template
Custom image
Installing OS patches and software at the launch of an instance (through startup scripts) increases boot time
The alternative is to create a custom image with OS patches and software pre-installed
An image is a replica of a disk that contains the applications and operating system needed to start a VM. You can create custom images or use public images pre-configured with Linux or Windows OSes
An image is a bootable data disk that you can use to create virtual machine (VM) instances. It contains the operating system (OS), configured settings, and installed applications
Can be created from
an existing VM instance
a persistent boot disk
a snapshot (point in time extract of the data on a PD - stored in GCS)
another image
a file in Cloud Storage
Can be shared across Projects
You can deprecate old images and specify replacement image
You can create a corporate image that incorporates all your security standards (a hardened image)
You should prefer custom images to startup scripts
Images can be stored regionally or multi-regionally
It’s not recommended to create images from disks attached to a running instance; stop the instance first
Section 4 - Instance Groups and Load Balancing
Instance Group - a group of VM instances, managed as a single entity
Can be zonal or regional (each instance is still zonal, but there are multiple instances across multiple zones), but NOT global
Can be managed or unmanaged
Managed (MIGs)
Identical VMs created using an Instance Template (same image, type, config) - you MUST use an instance template
Features: auto scaling, auto healing, managed releases
Can be used as a back-end for an LB
Unmanaged
Can have different configurations (image, hardware)
Does not offer auto scaling/healing and other services
Not recommended, unless you really need different kinds of VMs
You can use an instance template to create individual VMs, if you want, or you can add different instances to the IG manually
Can be used as a back-end for an LB
MIGs
Can maintain a certain number of instances
Can detect application failures using health checks (self healing)
You need to make sure that the health checkers can reach the instances, by adding the source IP addresses to firewall rules
Health checks determine if applications on your VMs respond to requests. They’re used for load balancing and with autohealing in managed instance groups
Increase and decrease instances based on load (auto scaling) - you configure min, max instances and the metrics to track; can also define a cooldown period to prevent frequent scale-up/scale-down
Can add Load Balancer in front of them, to distribute load
Can create instances in multiple zones (regional MIGs)
Can release new application versions without downtime (rolling updates, or canary deployments)
Updating a MIG
Rolling update - gradual update of instances to a new instance template
Optionally you can specify a template for canary testing
Specify when the update should happen (proactive = immediately, or opportunistic)
Maximum surge = maximum number of instances added at a point in time
Maximum unavailable = how many instances can be offline during the update
Rolling restart/replace
- No change in template, but you want to restart/replace
Cloud Load Balancing
Distributes traffic across instances of an application in a single region or multiple regions
Enables H/A, auto scaling, resiliency
Fully distributed, software defined, managed service
Important features
Health check - routes to healthy instances (this can be a different health check to the one configured on MIGs, but they are configured in the same place in the UI)
Auto scaling (through back-end services, such as MIGs)
Supports global load balancing with single anycast IP (for global LBs)
Supports internal load balancing, as well as external
HTTP(S) load balancer configs
Backend services - can be instance groups, buckets or network endpoint groups
You can have multiple backend services configured for a single HTTP(S) LB, and have routing rules to each
You can enable Cloud CDN to cache static content
You configure the health check (mandatory)
Each back-end service points to one or more back-ends; for example, a back-end service for Service A may have two back-end MIGs, each one in a different region
The load balancer uses a set of policies and algorithms to distribute traffic among the backends. It does not simply use round-robin distribution; instead, it considers factors like backend health, capacity, and proximity to the client.
For instance, if backend service 1 has two backends (one MIG in
eu-west2
and one MIG inUS-east-1
), the load balancer will typically route the request to the backend that is geographically closest to the client, provided that the backend is healthy and not overloaded.The exact selection criteria can include factors like latency, load, and health checks.
Host and path rules (routing rules)
Supports advanced rules, like URL redirect and URL rewrite, based on host, path, and headers
URL Maps - route requests to backend services or backend buckets (for example, map /service-a to Backend Service A and /service-b to Backend Service B )
Frontend configuration
Protocol and port
For SSL, a certificate must also be selected
SSL/TLS termination/offloading
Client to load balancer: HTTPS
Load balancer to VM instance: HTTP (inside GCP network, so safer) - reduce the load on the instances, they don’t need to worry about TLS/HTTPS
Choosing an LB: https://cloud.google.com/static/load-balancing/images/lb-product-tree.svg → note that ALBs are always proxy, whereas with NLBs you can decide on proxy vs pass-through
Features
External HTTP(S) - Global, External, Proxy only
Internal HTTP(S) - Regional, Internal, Proxy only
SSL Proxy - Global, External, TCP with SSL offload, Proxy
TCP Proxy - Global, External, TCP without SSL offload, Proxy
External Network - Regional, External, TCP/UDP, Passthrough
Internal TCP/UDP - Regional, Internal, TCP/UDP, Passthrough
LB across MIGs in multiple regions
You need to use the Premium networking tier, for global routing to work (in standard tier, the forwarding rule and external IP are regional)
Use HTTPS LB for this use case
You create multiple Regional MIGs in different regions, in the same Project
HTTPS LB can distribute load to the MIGs behind a single external IP address - requests are redirected to the nearest region
If all backends within a region are unhealthy, traffic is distributed to healthy instances in other regions
Section 5 - Google Cloud Compute Engine and Load Balancing for Architects
For the architect certification, you not only need to know the services and how to use them, but also how to build resilient, H/A, scalable, secure, performant solutions
Availability
Percentage of time an application provides the expected operations
99.99% availability is an example (4.5 minutes of downtime per month) - most online apps aim for this value
Achieving H/A in Compute Engine
Use multiple regional instance groups for each microservice
Distribute load using a Global HTTPS Load Balancing (Cloud Load Balancing is H/A by default, managed service)
Configure Health Checks for Instance Groups and Load Balancing
Enable Live Migrations for VM instances
Live Migration
Your running instance is migrated to another host in the same zone
Supported even for instances with Local SSD
Not supported for GPUs and preemptible instances
Availability Policy (a config when you create VM instances)
On host maintenance - select either migration of instances, or restart
Automatic restart - GCP can auto restart instances if they were terminated for non-user initiated reasons
Scalability
Can we handle growth in users, traffic, or data size, without drop in performance?
Ability to adapt to changes in demand
Options
Deploy to a bigger instance, with bigger CPU/memory (vertical); vertical scaling has limits and can be expensive
- Machine type can be changed, to achieve vertical scaling: e2-standard-2 → e2-standard-4, for example; changing the type requires stopping the machine => drop in availability
Increase the number of app instances and set-up a LB (horizontal); also increases availability, whereas vertical does not; but needs additional infrastructure, such as load balancers
Distribute VM instances (either in a single zone, or multiple zones, in a region, or even in multiple zones across regions)
Auto scale, using MIGs
Distribute load using LBs
Security
Use firewall rules to restrict traffic (ingress and egress)
Use internal IP addresses as much as possible
Use sole-tenant nodes when you have regulatory needs
Create a hardened custom image to launch VMs
Performance
Choose the right machine family for your workload
Use GPUs (AI/ML) and TPUs (massive matrix operations) to increase performance
Prefer creating custom image, to installing custom software at boot time
Resiliency
Ability of system to provide acceptable behaviour even when one or more parts fail
Run VMs in MIGs behind global LB
Use Cloud Monitoring for monitoring
Install logging agent to send logs to Cloud Logging
Enable Live Migration and Automatic Restart
Upto date image copied to multiple regions, to support disaster recovery
GPUs
Used to accelerate math intensive and graphics-intensive workloads, for AI/ML, etc.
You can attach GPUs to VMs, but they are expensive, and you need to use images with GPU libraries
Not supported on all machine types
Cost
Sustained use discounts
Automatic (no action required on your part) discount for running VMs for a significant portion of the billing month
Discounts increase with usage
Applicable to instances created by GKE and Compute Engine
Not applicable on certain machine types, or VMs created by App Engine Flexible and Dataflow
Committed use discounts
Use these for workloads with predictable resource needs
You commit for 1 or 3 years
You commit for a region and a specific machine type
Up to 70% discount based on machine type and GPUs - higher than sustained use discounts
Applicable to instances created by GKE and Compute Engine
Not applicable on VMs created by App Engine Flexible and Dataflow
Preemptible VMs
Short-lived (lasts only 24 hours), cheaper compute instances - similar to AWS spot instances (new version of Preemptible is actually called Spot)
Can be stopped by GCP any time, before the 24 hours expire, and you get a 30 seconds warning
Use these if your apps are fault tolerant, you are cost sensitive, workload is not immediate (batch programs that can be delayed)
Not always available, no SLA, cannot be migrated to regular VMs, cannot be automatically restarted, free tier credits not available
Latest version of preemptible VMs are called Spot instances - these do not have a max runtime (24 hours), the other properties are the same
Billing for Compute Engine
You are billed by the second, after a minimum of 1 minute
You are NOT billed for stopped instances, but are still billed for attached storage
Always create a Budget and attach an Alert to get alerted of over spend
Cost efficiency
Use auto scaling - have optimal number and type of VM instances running
Understand sustained use discounts and committed use discounts
Use preemptible VMs
Section 6 - Getting Started with Gcloud
The gcloud command has a standard format:
gcloud GROUP SUBGROUP ACTION
, for examplegcloud compute instances list
orgcloud compute instances create
gcloud config list
will show you configuration such as the default region to be usedYou can use
--filter
to filter the results, for example--filter zone:europe-west1
Cloud Shell
Backed by a VM instances that is automatically provisioned; 5 GB of persistent storage available in $HOME; prepackaged with the latest version of Cloud SDK, Docker, etc.; files in the $HOME directory persist between session; instance is terminated after 20 mins of inactivity
Can be used to SSH into virtual machines using their private IP addresses
- Cloud Shell runs in a Google-managed VPC network which is VPC peered with the project’s VPC network (you can configure which VPC it should use)
Section 7 - Getting Started with Google Cloud Platform Managed Services
IaaS - Only use infrastructure from the cloud provider; for example, using a VM to deploy your applications and DBs
You are responsible for app code and runtime, load balancing, auto scaling, OS upgrades and patches, availability, etc.
Provider is responsible for virtualization, physical hardware, networking
PaaS
Cloud provider responsible for OS, application runtime, auto-scaling, availability and LB
You are responsible for app config, app code
Serverless
You don’t need to worry about where to deploy, the types of servers, OS, scaling, availability
Zero visibility into infrastructure
Pay for use - pay for requests, not servers
Focus on code
Managed services for Compute
Compute Engine - IaaS
App Engine - PaaS
Cloud Run - CaaS (container as a service); does not need a Cluster
GKE - CaaS; needs a Cluster
Cloud Functions - FaaS
Section 8 - Getting Started with Google Cloud App Engine
PaaS - lower flexibility, but fewer responsibilities compared to Compute Engine
Simplest way to deploy and scale apps in GCP; provides end-to-end app management
Supports Go, Java, .NET, Node.js, PHP, Python, Ruby (in Standard mode) and anything basically (containers) in Flexible mode
Has good integration to various storage products, like Cloud SQL
No usage charges - pay for resources provisioned
Provides automatic load balancing and auto-scaling
Provides app health monitoring
Provides app versioning + traffic splitting
Types
Standard
Apps run in language specific sandboxes; complete isolation from OS/disk/other apps
V1: Java, Python, PHP, Go (older versions)
V2: Java, Python, PHP, Node.js, Ruby, GO (newer versions)
Flexible
Apps run within Docker containers
Makes use of Compute Engine virtual machines
Support any runtime (built-in support for Python, Java, Node.js, Go, Ruby, PHP, or .NET)
Provides access to background processes and local disks
Component hierarchy
Application → Service → Version → Instance
Application
One App per Project; the container for everything created as part of App Engine
Limited to a single region, so be careful which region you select
app.yaml file is used to define configuration of the application
Service
You normally define one per microservice
Each service can have its own configuration, scaling settings, and runtime
Each service has its own URL
Each Service can have different versions
Version(s)
A Version is a specific iteration of a service
Code + configuration
Multiple versions can co-exist, and support traffic split
gcloud app deploy --version=v2
(traffic will automatically be switched over to this version; you can use --no-promote to avoid this)gcloud app services set-traffic splits=v3=.5,v2=.5
Instance(s)
- An Instance is a single copy of a version of a service. Instances are the actual compute resources that handle incoming requests and run your code
Standard vs Flexible
Feature | Standard | Flexible |
Pricing | Instance hours | vCPU, memory, persistent disks |
Scaling | Manual, Basic, Automatic | Manual, Automatic |
Scaling to 0 | Yes | No |
Instance startup time | Seconds | Minutes |
Rapid scaling | Yes | No |
Max request timeout | 1 to 10 mins | 60 mins |
Local disk | Mostly. Can write to /tmp | Yes. Ephemeral |
SSH for debugging | No | Yes |
Scaling App Engine instances
Automatic - recommended for continuously running workloads;
Can scale based on target CPU utilization; target throughout utilization; max concurrent requests;
Can configure max/min instances
Basic - instances are created as and when requests are received
Recommended for adhoc workloads
High latency is possible
Tries to keep costs low
Not supported by Flexible
Configure Max Instances and Idle Timeout
Manual
Section 9 - Google Cloud Kubernetes Engine for Cloud Architects
Managed Kubernetes service
Each cluster can have different kinds of VMs (Nodes), defined in Node Pools (you can have multiple Node Pools per cluster) - you can use labels and affinity to define which Node Pool a Pod should be placed on
Provides auto-repair and auto-upgrade
Clusters can be zonal or regional (NOT global)
Provides Pod and Cluster Autoscaling
HPA automatically scales the number of pod replicas in a deployment or replication controller based on observed CPU utilization (or other select metrics)
VPA automatically adjusts the resource requests and limits (CPU and memory) for containers in your pods to match the required resources (useful for stateful apps, where horizontal scaling isn’t sufficient)
Integrates well with Cloud Logging and Monitoring
Uses Container-Optimized OS, a hardened OS, purpose built to run containers, built by Google
Supports mounting PDs and local SSDs; for example:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard-ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
replication-type: none
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-dynamic-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: standard-ssd
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx
volumeMounts:
- mountPath: "/mnt/data"
name: my-dynamic-storage
volumes:
- name: my-dynamic-storage
persistentVolumeClaim:
claimName: my-dynamic-pvc
Two ways to create clusters: Autopilot (hands-off, pay per pod), Standard (full ownership)
Autopilot
It’s now the standard UI that pops up when you try to create a cluster; it’s hard to find the button that takes you to Standard
Reduce operational costs; hands-off experience, no managing of Nodes and Node Pools
Provides security hardened opinionated settings
Commands
gcloud container clusters resize my-cluster --node-pool default-pool --num-nodes 5 --zone eu-west1-b
kubectl autoscale deployment hello-world-deployment --max=4 --cpu-percent=70
(will create a Horizontal Pod Autoscaler which references a Deployment)gcloud container clusters update my-cluster --enable-autoscaling --min-nodes=1 --max-nodes=10
- Will add up to 10 Nodes to a cluster, depending on resource requests
Cluster
Master node(s)
Manages the cluster
Not visible to you, Google managed
Components
API Server - handles all communications for a K8s cluster (from nodes and from outside)
Scheduler - decides placement of Pods
Controller Manager - manages deployments and replica sets
etcd - distributed database, storing the cluster state
Worker node(s)
You can actually see them in Compute Engine, when using Standard, and not Autopilot
Runs Pods
Kubelet - manages communication with the Master node(s)
Cluster types
Zonal - single zone, single control plane; all Nodes run in the same zone; can be multi-zonal: single control plane, but nodes run in multiple zones
Regional - replicas of control plane run in multiple zones within a region; Nodes also run in the zones where the control plane Nodes run
Private - VPC-native cluster; Nodes only have internal IP addresses
Alpha - created with early feature APIs
GCR - Google Container Registry
Alternative to DockerHub
Fully managed container registry
Naming: HostName/ProjectId/Image:Tag => gcr.io/projectName/helloworld:1.0
Scenarios
Keep costs low and optimize GKE implementation => preemptible VMs, use correct region, committed-use discounts
Efficient, completely auto-scaling GKE solution => Use HPA and Cluster Autoscaler
Execute untrusted third-party code in K8s cluster => Create a separate Node Pool with GKE sandbox and run it there
Enable only internal communication between microservices => Service of type ClusterIP
Pod stays Pending => Pod cannot be scheduled because of insufficient resources
Pod stays Waiting => Problem with pulling the image
Section 10 - Getting Started with Google Cloud Functions
Run code in response to events
Node.js, Python, Go, Java, .NET, Ruby
You don’t need to worry about servers, scaling, availability; you can focus on code
Even though it’s serverless, you must still choose a region for the function to run in, as well a the memory (which implies CPU)
Pay for what you use - number of invocations, time of execution, memory and CPU used
Default timeout of 1 min, configurable to up to 60 mins (for HTTPS triggers)
Two versions: 1st gen, 2nd gen (built on Cloud Run and EventArc)
Event = something that happened
Trigger = respond to an event with a Function call; the Event data is passed to the function
Events can be triggered from: Cloud Storage, Cloud Pub/Sub, HTTP call, Firebase, Firestore, Stack driver logging; there is an Other option, where you can basically choose any kind of EventArc trigger
Cloud Functions is currently being integrated into the Cloud Run UI; in fact, the product is now called “Cloud Run Functions”
You can select memory and CPU, but memory is dependent on CPU (“Selected memory requires a minimum of N vCPUs”)
You can choose either unauthenticated access or authenticated access when selecting HTTP triggers
You can select the maximum number of concurrent requests that can reach a container instance
You can select the service account that the Function assumes (in case it needs to interact with other GCP products)
You can select the networking: allow all traffic, allow internal traffic (from VPC networks in the same project), internal traffic + from Cloud Load Balancing
For egress, the function can’t normally access VPCs, but you can set this up using a VPC connector
Supports traffic splitting between versions (gen 2)
Section 11 - Getting Started with Google Cloud Run
“Container to Production in Seconds”
Built on top of Knative
Pay per use (CPU, Memory, Requests and Networking)
Cloud Run supports yaml configuration to deploy containers
Supports traffic splitting (similar to Cloud Functions)
Anthos - run Kubernetes clusters anywhere, including on your data centres
- Cloud Run for Anthos - deploy workloads to Anthos clusters running on-prem or on Google Cloud
gcloud run deploy
gcloud run revisions list
gcloud run services update-traffic myservice --to-revisions=v2=10,v1=90
Cloud Run is getting more and more similar to Cloud Functions, especially now that Cloud Functions are actually powered by Cloud Run, but there are nuances:
Cloud Run primarily uses HTTP as triggers, whereas Functions offer more trigger options
Cloud Functions is fully serverless, you don’t manage anything, whereas with Cloud Run you get more flexibility
In Cloud Functions, you directly deploy code, whereas in Cloud Run you use containers
Cloud Functions supports predefined runtimes, whereas Cloud Run supports anything via containerization
Use Cloud Functions for simple event-driven functions and Cloud Run for flexible microservices and APIs
Feature | Cloud Functions | Cloud Run | App Engine |
Type | Function-as-a-Service (FaaS) | Container-as-a-Service (CaaS) | Platform-as-a-Service (PaaS) |
Use Case | Event-driven, single-purpose functions | Containerized microservices, APIs | Full web applications, backend services |
Deployment Unit | Individual functions | Containers | Applications with multiple services |
Language Support | Limited set of runtimes (Node.js, Python, Go, Java, .NET, Ruby, PHP) | Any language that can run in a container | Standard env: Limited runtimes Flexible env: Any via containers |
Statelessness | Stateless | Stateless | Can be stateful with datastore |
Request Timeout | Up to 9 minutes (2nd gen) | Up to 60 minutes | Standard: 1 minute Flexible: No hard limit |
Cold Start Time | Low to medium | Medium | Standard: Very low Flexible: Higher |
Scaling | Automatic scaling to zero | Automatic scaling to zero | Standard: Automatic scaling to zero Flexible: Scales to minimum instances |
Resource Allocation | Memory: 256MB-8GB CPU: Limited options | Memory: 512MB-32GB CPU: Full control | Standard: Limited Flexible: Customizable |
Pricing Model | Pay per invocation and compute time | Pay per request and container instance time | Standard: Pay per instance hour Flexible: VM instance pricing |
Free Tier | 2M invocations/month 400,000 GB-seconds/month | 2M requests/month 360,000 vCPU-seconds/month 180,000 GiB-hours/month | 28 instance hours/day 9 instance hours/day (flexible) |
Execution Environment | Fully managed | Fully managed | Fully managed |
Custom Domain Support | Via API Gateway | Built-in | Built-in |
VPC Networking | Supported | Supported | Supported (flexible only) |
Startup Time | Fast | Medium | Standard: Very fast Flexible: Slower |
Dependencies | Limited | Any dependencies in container | Standard: Limited Flexible: Any |
Background Processes | Limited | Supported | Standard: Not supported Flexible: Supported |
Section 12 - Exploring Google Cloud Functions 2nd generation
Longer request timeout (up to 60 mins for HTTP-triggered functions)
Larger instance sizes (up to 32 GiB of memory with 8 vCPU)
Concurrency: up to 1000 concurrent requests per function instance (v1 only supported 1 request per instance)
Built on top of Cloud Run (supports revisions + traffic splitting)
Support for 90+ event types, via EventArc
When you deploy a Cloud Function, behind the scenes a Cloud Build build is triggered, that creates a container and deploys it
Scaling and concurrency
Autoscaling - in the past, as new invocations come in, new function instances are created; 1 request per instance (this is how 1st gen works)
- A function that has completed execution can be used by future requests
Typical problem is cold start (new function instance takes time to be ready)
- A typical solution is to configure a minimum number of instances, regardless of number of requests => increased cost
2nd gen supports concurrency - how many invocations a single instance can handle, up to 1000 (but your code should be able to handle concurrent requests)
gcloud functions deploy <name>
--source (Zip file in GS, or source repo, or local filesystem)
--trigger-bucket or --trigger-http or --trigger-topic or --trigger-event-filters (only for gen2, uses EventArc events)
--docker-registry (default is container-registry, an alternative is artifact-registry)
--docker-repository
--gen2
--runtime
--service-account (by default, 1st gen use default app engine service account and 2nd gen uses default compute engine service account
--timeout
--min-instances
--max-instances
Best practices
Set minimum number of instances, to avoid cold starts
Configure max number of instances (protect from abnormally high request levels)
Use Cloud Endpoints (or Apigee or API Gateway) for versioning
Use Cloud Run revisions for safer releases
Use Secret Manager to securely store secrets
Use individual Service Accounts for each function
Section 13 - Getting Started with Cloud KMS
Data can be at rest, in transit (in motion), either within the cloud or in/out of the cloud, and in use (non-persistent, in volatile memory)
Defense in depth => encrypt all data (hard disks, databases, in transit)
Symmetric key encryption uses the same data encryption key for encryption and decryption
You need to choose the right algorithm (strong)
You need to secure the encryption key
You need to figure out how to share the encryption key
Asymmetric key encryption uses a public and a private key; the public key is publicly available; encrypt with public, decrypt with private
- Finding the private key from a public key, when using a strong algorithm, is unfeasible
Cloud KMS creates and manages cryptographic keys (both symmetric and asymmetric)
You can control the keys' usage in your apps and GCP services
KMS provides APIs to encrypt, decrypt and sign data
Integrates with Google services
Google-managed key: no action on your part
Customer-managed key: you create a key in KMS and use it
Customer-supplied key: provide your own
Using KMS
Before creating a key, you need to create a keyring, where you can attach multiple keys (keyrings can be global or regional)
You can integrate an external key manager with KMS (only for regional keyrings)
You choose the protection level: Software, or HSM
You choose the purpose: symmetric encrypt/decrypt, asymmetric sign, asymmetric decrypt
GCP can automatically rotate keys
You need to configure IAM access accordingly; for example, if using a customer managed key to encrypt disks for Compute Engine, the Compute Engine service account needs access to the key
Section 14 - Exploring Block and File Storage
Block storage
Hard disks attached to computers
Typically, one block storage device is attached to one virtual server (read-only can be attached to multiple machines)
You can connect multiple block storage devices to one server
Used as: DAS (direct-attached storage), SAN (storage area network)
File storage
Shared by several servers
You can connect to it from GKE or Compute Engine instances
Block storage in GCP
Persistent Disks: network block storage
Zonal or regional (2x the cost of Zonal): defines how data is replicated across zones (when writing data to a regional PD, GCP ensures that it is synchronously written to the replica)
Durable, lifecycle not tied to the host
Provisioned capacity, you configure the size you want; you can increase size while attached to VM
Performance scales with size
By default, a boot disk is attached to your VMs, but you can add more
By default, disks are deleted when the VM is deleted; can be configured when creating the instance (Keep boot disk)
Scenarios
Improve performance: increase size, add more PDs, increase vCPUs of the instance
Increase durability: use regional PDs (replicated across 2 zones)
Hourly backups: schedule snapshots
Delete old snapshots: configure part of the schedule
Attaching a PD to a running instance
Attach disk to the running/stopped VM:
gcloud compute instances attach-disk <instance_name> --disk <disk_name>
Format the disk
Mount the disk
Resizing PDs (even when running)
gcloud compute disks resize <disk_name> --size <new_disk_size>
Take a snapshot, for restore, in case there are issues
Resize the file system and partitions
Types
Standard
Underlying storage: Hard Disk Drive
Sequential IOPS: Good
Random IOPS: Bad
Cost: cheap
Use case: Big Data (cost efficient)
Balanced
Underlying storage: SSD
Sequential IOPS: Good
Random IOPS: Good
Cost: cheaper than SSD
Use case: Balance between cost and performance
SSD
Underlying storage: SSD
Sequential IOPS: Very Good
Random IOPS: Very Good
Cost: expensive
Use case: high performance
Snapshots
Point in time backups of PDs
Can be scheduled
Can be auto-deleted after X days
Can be multi-regional or regional => increase durability
Can be shared across projects
Disks can be created from snapshots; if it’s a boot disk, VM instances can be launched from it
Incremental - 2nd snapshot only contains changes since 1st snapshot; deleting a snapshot only deletes that that is not needed by other snapshots
You should keep similar data together on a PD: separate OS, volatile data and permanent data
Snapshots are encrypted in the same way as the main disk
Recommendations
Avoid taking them more often than once per hour → they can reduce performance; schedule them during off-peak hours
Creating snapshots from disk is faster than creating from images, but creating disks from images is faster than creating from snapshots; snapshots are incremental, require some calculation before creating a disk; if you are repeatedly using a snapshot to create disks, you should create an image first
Machine Images
Different from an Image
- A Compute Engine Image is a template for creating new VMs. It captures the boot disk of a VM and can include an OS, applications, and basic configurations.
A machine image is a Compute Engine resource that stores all the configuration, metadata, permissions, and data from multiple disks of a virtual machine (VM) instance. You can use a machine image in many system maintenance, backup and recovery, and instance cloning scenarios.
Local SSDs: present (physically attached) on the same host as the VM; not available for all machine types
Temporary data (ephemeral storage), does not survive maintenance events unless Live Migration is enabled
Lifecycle tied to the host, and they aren’t backed up by snapshots or by machine images
High IOPS
Data automatically encrypted but you can’t configure the keys
Only supported by some machines
Supports SCSI and NVMe interfaces
Performance depends on the size of the disk and the vCPUs of the instance
Advantages: very fast I/O (10-100x compared to PDs)
Disadvantages: lower durability, lower availability, snapshots not supported
File Storage in GCP - Filestore: high performance file storage
Shared cloud file storage
Supports NFSv3
Provisioned capacity (1 - 64 TB, or 100 TB for high capacity)
Performance scales with capacity
Supports HDD (general purpose) and SSD (performance-critical workloads)
Use cases: file share, media workflows, content management
Global, Regional and Zonal resources
Global: images, snapshots, instance templates
Regional: regional MIGs, regional PDs
Zonal: zonal MIGs, VM instances, PDs (you can only attach PDs to instances in the same zone)
Storage scenarios
Very high IOPS but fine to lose data - Local SSDs
High performance file sharing system, attachable to multiple VMs - Filestore
Backup VM configs along with all PDs - Machine Images
Make it easy to launch VMs with hardened OS and customized software - Custom Image
Section 15 - Exploring Object Storage (Cloud Storage)
Most popular, flexible, inexpensive
Serverless: autoscaling and infinite scale
Treats entire object as a unit (no partial updates)
Access control can be set at object level (ACL), but it’s not recommended (uniform access control is recommended, which acts at bucket level)
Provides REST API to access and modify objects
- Also provides CLI (gsutil, NOT gcloud) and client libraries (C++, C sharp, Java, Node.js, PHP, Python, Ruby)
Can be used to store any file types: media files, archives, backups, staging data during migrations
- Many GCP services support import/export to Cloud Storage
Objects (key + value) are stored in Buckets
Buckets
Bucket names are globally unique → used in the object URL, so there are some restrictions on the name (only lower case, numbers, hyphens, underscores and periods; max size 63, min size 3)
Can upload an unlimited number of objects
Each Bucket belongs to a specific Project
Keys are unique in a bucket
Max object size is 5 TB
Can be regional, dual-regional (but only a subset of combinations), multi-regional (but on the same continent) options for storage
You can set-up cross-bucket replication between Buckets (best effort basis, but Turbo is available which provides an SLA of 15 mins replication across regions)
You can choose the storage class - there are huge variations in access patterns (some accessed every day, others not at all, others once a month) => cheaper price for infrequently accessed objects;
Storage class can be defaulted at bucket level, but can be overwritten at object level
Some features are independent of storage class
11 9s of durability
Low latency (first byte typically in tens of ms) → this is different to AWS
Unlimited storage and autoscaling
No minimum object size → this is different to AWS
Standard
No minimum storage duration
Availability of > 99.99% in multi-region or 99.99% in single region
Frequently used, short period of time
Nearline
Minimum storage duration of 30 days
99.95% availability in multi region and dual region, 99.9% in regional
Read or modify once a month on average
Coldline
Minimum storage duration of 90 days
99.95% in multi/dual region, 99.9% in regional
Accessed at most once a quarter
Archive
Minimum storage duration of 365 days
99.95% in multi/dual region, 99.9% for regional
Accessed less than once a year
Auto-storage
- Automatically transitions between Standard and Nearline
You can choose access control
ACLs - Fine grained (object level)
Uniform (bucket level) - works by attaching policies to the bucket, which define mappings between roles and principals
Versioning
Prevents accidental deletion and provides a historical view of an Object
Enabled at bucket level; can be turned on and off at any time
Live version is the latest version; if you delete a live object, it becomes non-current; if you delete non-current version, it is completely deleted
Versions are identified by the object key + a generation number; use this ID to perform operations with an older version
You can reduce costs by deleting older (noncurrent) versions - can use lifecycle management policies for this
Lifecycle management
Generally, usage reduces overtime => you should automatically change storage classes/delete files that are old
Rules are configured at the bucket level
Identify objects using conditions (age, created time, live/noncurrent, number of newer versions); you can set multiple conditions
Supported actions
Set storage class, with supported transitions Standard → Nearline/Coldline/Archive, Nearline → Coldline/Archive, Coldline → Archive (basically, it flows downward only)
Deletion
Encrypting Cloud Storage data
Cloud Storage encrypts data on the server side by default
Server-side encryption (enforced by GCP, always done)
Performed by GCS after it receives the data
Use HTTPS to ensure data is safe in transit
Google-managed encryption: default, no config needed; GCP manages the KMS keys, not you (they are not CMKs)
Customer-managed: keys managed by you (CMKs), in Cloud KMS (symmetric encryption)
- The GCS Service Account will need access to your keys in KMS
Customer-supplied: supply the key with every GCS operation; GCP will not store the key, just use it to encrypt; when retrieving data, you need to send the right key
Add API headers:
x-goog-encryption-algorithm
,x-goog-encryption-key
(base64 encryption key),x-goog-encryption-key-sha256
(encryption key hash)Or use gsutil
Client-side encryption (optional)
- You encrypt data before sending to GCP
Cloud Storage metadata
Every Object has Metadata, which is a set of key/value pairs, for example:
Content-Type
Content-Encoding
Content-Disposition (how the content should be displayed)
Cache-Control
Content-Language
Custom key/value tags
Storage class is also a key/value pair, for example (
storageClass:STANDARD
)Types of metadata
Fixed-key (fixed key, like Cache-Control, you can only change value)
Custom: flexible key/value pairs
Non-editable (size, created, last modified, etc.)
Bucket Lock
Helps meet Compliance/Regulatory needs
Configure data retention policy - can only modify objects after the retention period expires
Can be configured when creating a bucket, or at a later point in time; applies to existing objects as well
You can lock a retention policy (optional) => prevents decreasing/deleting the retention policy by anyone, but you can increase it
Buckets cannot be deleted if they contain objects that are younger than the retention period
Transferring data from on-prem to cloud
GCS is the most popular destination (as a staging location)
Online Transfer: gsutil or GCS APIs
Useful for one-time transfers
< 1 TB and when transferring from on-prem, or another GCS bucket
Storage Transfer Service
Lots of data (petabytes) from on-prem, other clouds, or between GCS buckets
You can set up a repeating schedule
Supports incremental transfers (only transfer changed objects)
Reliable and fault tolerant - keeps track of where it left off
Transfer Appliance
Physical data transfer
Recommended when data size > 20 TB, or if transferring data takes > 1 week
TA40, TA300
Google Cloud Storage Transfer Service is best suited for large-scale, scheduled, and automated data transfers, especially when migrating data from other cloud providers or on-premises systems. It provides robust features for data validation, scheduling, and notifications.gsutil is a versatile command-line tool that is ideal for ad hoc transfers, scripting, and automation. It provides a wide range of commands and options for managing GCS data and is suitable for smaller-scale or one-time transfers.
GCS best practices
Avoid sensitive info in bucket or object names
Store data in closest region to the users
Ramp up request rate gradually, above 1000 write/5000 read requests per second
Use exponential backoff when receiving 5xx or 429 (too many requests)
Do not use sequential numbers or timestamps as object keys (add hash value before sequence value)
Use Cloud Storage FUSE to enable file system access to Cloud Storage; allows you to mount Buckets as file systems
Cloud Storage CLI commands
gsutil mb gs://BKT_NAME
- make bucketgsutil ls -a gs://BKT_NAME
- list objects in bucket (-a all objects, not only current)gsutil cp source destination
gsutil mv
gsutil rewrite -s STORAGE_CLASS BUCKET
- change storage classgsutil signurl -d 10m YOUR_KEY BUCKET
- signed URL for temporary access
ACLs (Access Control Lists)
Who has access to your buckets and objects, and at what level
Finer granularity compared to IAM (also called uniform), which works at bucket level
User gets access if he is allowed, by either ACL or IAM (OR not AND)
Use ACLs to customize access to individual objects
Signed URLs
Allow access to a user for a limited time to objects, when the user does not have a Google account
Create a key for a Service Account and use it to create a signed URL:
gsutil signurl -d 10m KEY gs://BUCKET_NAME
Exposing public static websites from Cloud Storage
Create a bucket with the same name as the website name (name of bucket should match DNS name of website)
- Verify that the domain is owned by you
Add files to the bucket (including index.html and error.html)
Add member
allUsers
and grant Storage Object Viewer role, in the Permissions section of the bucket (fine grained/IAM based)
Section 16 - Authentication in Google Cloud with Cloud IAM
Identities can be human, or non-human, which need to access resources and perform actions: Members performs Actions on Resources
Cloud IAM provides authentication and authorization
Identities can be
GCP User
Group of GCP Users
Application running in GCP
Application running in a non-GCP data center
Unauthenticated users
Cloud IAM provides very granular control
Member = who
- Member types are identified by a prefix: user, serviceaccount, group, or domain
Permission = actions that can be taken on resources; have the form
group.subgroup.action
, eg.compute.instances.create
Role = set of permissions
Policy = assign permissions to members: what actions can be performed on which resource, by whom and under what condition
Permissions are NOT directly assigned to members, but via bindings; bindings can be applied at all levels of the GCP hierarchy (organization, folder, project, resource)
When you add members and assign roles at the appropriate level (organization, folder, project, or resource) through the Google Cloud Console, GCP automatically creates the necessary IAM bindings for you. These bindings associate the members with the roles you have selected.
{ "bindings": [ { "role": "roles/storage.objectViewer", "members": [ "user:example-user@example.com" ], "condition": { "title": "Time-based access", "description": "Allows access only between 8 AM and 5 PM UTC", "expression": "request.time >= timestamp('2025-01-15T08:00:00Z') && request.time <= timestamp('2025-01-15T17:00:00Z')" } } ] }
Role: set of permissions to perform specific access on specific resources -> different to AWS (where Roles can be assumed by trusted entities)
Types
Basic (Primitive) - Owner/Editor/Viewer: wide set of permissions
Earliest version, they existed before IAM
NOT recommended - do not normally use in production (can use to give an Auditor read only access to everything, for example)
Predefined: fine grained, managed by Google
Different roles for different purposes
Storage Admin, Storage Object Admin, Storage Object Viewer, Storage Object Creator
Custom: created by you
Policy: binding between Members and Roles
You can use Policy Troubleshooter to test permissions for a given Member (in case they complain that they can’t access a specific Resource that they should have access to)
Service Accounts
Used when an Application/VM needs access to a resource (you shouldn’t use personal credentials for this)
Service Accounts are identified by an e-mail address that ends in something like
gserviceaccount.com
Some Service Accounts are automatically created, for App Engine and Compute Engine default service accounts
Service Accounts can access resources, but they can be accessed themselves as resources, by other Members: “You can allow specific users to have ownership and access to service accounts and their settings. Users with the Owner or Editor basic role on the project can already modify service accounts, but you might want to restrict access for some users so that they can take only specific actions against service account resources”
You can impersonate a service account, if your own principal has the
iam
.serviceAccountTokenCreator
permission
gcloud auth print-access-token --impersonate-service-account=[SERVICE_ACCOUNT_EMAIL]
They do not use passwords, but instead rely on private/public RSA keys for authentication
- You can’t use it to login via UIs or cookies
Types
Default service account - automatically created when some services are used; not recommended, because they have Editor role
User Managed - recommended, provide fine grained access control
Google-managed - created and managed by Google, to perform operations on the user’s behalf
Use cases
VM → Cloud Storage
Uses Google Cloud managed keys, which are automatically rotated
Do NOT delete Service Accounts for running instances => apps will lose access
On-prem → Cloud Storage (long lived)
You cannot directly assign Service Accounts
Create a Service Account user managed key and use the key to authenticate yourself as the Service Account
Make sure you keep the key file secure; you cannot regenerate the same key again => you need to generate a new key, if you lose the file
- Set env variable called GOOGLE_APPLICATION_CREDENTIALS to the path to the key file
Use Google Cloud Client libraries; they use Application Default Credentials (ADC); it uses the GOOGLE_APPLICATION_CREDENTIALS env variable
On-prem → GCP APIs (short lived)
Less risk compared to sharing Service Account keys
Credential types
OAuth 2.0 access tokens
OpenID Connect ID tokens
Self-signed JWTs
Application Default Credentials (ADC) in Google Cloud Platform (GCP) provide a way to automatically find and use the credentials needed for your application to authenticate and make API calls to Google Cloud services. ADC can be used with client libraries, gcloud commands, and other tools to simplify the process of authentication. ADC looks for credentials in the following order:
Environment Variable: If the environment variable
GOOGLE_APPLICATION_CREDENTIALS
is set, ADC uses the credentials pointed to by this variable. This variable should contain the file path to a service account key JSON file.Application Default Credentials File: If the environment variable is not set, ADC checks for the default credentials file in the user's home directory. The default credentials file is located at: ~/.config/gcloud/application_default_credentials.json
Google Cloud SDK: If the credentials file is not found, ADC checks if the Google Cloud SDK is installed and authenticated using
gcloud auth application-default login
.Compute Engine, Kubernetes Engine, App Engine, Cloud Functions: If no credentials are found, ADC checks if the application is running on a Google Cloud service that provides built-in service accounts, such as Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
Section 17 - Exploring Databases in Google Cloud Platform
Increasing availability
- Have multiple standby instances available, with synch replication
Increase durability
- Have multiple copies of the data available across multiple zones, regions
RTO - recovery time objective = maximum acceptable period to recover
RPO - recovery point objective = maximum acceptable data loss
Scenarios
Very small RPO and RTO → hot standby
Very small RPO, moderate RTO → warm standby with automatically synced data; scale up the standby
Very small RPO, longer RTO → regular snapshots and transaction logs + create DBs from snapshots and tx logs
Data can be lost (cached data) → failover from old server to completely new server
Database categories
Relational - predefined schema with tables and relationships; strong transactional capabilities
OLTP
Large number of users make large number of small transactions
Most traditional apps (ERP, CRM, e-commerce, banking apps)
MySQL, Oracle, SQL Server, etc.
Usually use row storage
GCP solutions
Cloud SQL (up to a few TB of data)
Cloud Spanner: unlimited scale (multiple PBs), and 99.999% availability; recommended for global apps with horizontal scaling
OLAP
Allow users to analyze petabytes of data: reporting apps, data warehouses, BI apps, analytics systems
Usually use column storage => high compression, easier to distribute data across nodes
GCP managed service: BigQuery (petabyte scale, distributed data warehouse)
Non-relational (NoSQL = not only SQL); flexible schema, horizontally scalable; typically trade-off strong consistency and SQL features for scalability and high performance
Document
Cloud Datastore: provides ACID transactions, SQL-like queries, indexes; small to medium DBs
Cloud Firestore (next version of Datastore): strong consistency, web and client libraries; small to medium DBs
Wide-column
- BigTable: not serverless, you create an instance and then tables; recommended for data size > 10 TB; not recommended for transactional workloads (does not support multi-row transactions)
Key/Value
Graph
In-memory
Data stored in memory => much faster than disk
Memorystore: caching, session management, gaming leaderboards, geospatial apps
Factors when choosing a type
Fixed schema?
Level of transactional properties?
Latency requirements?
Number of expected transactions?
Amount of data?
Section 18 - Relational Databases for Transactional Applications in Google Cloud Platform
Cloud SQL
Fully managed, relational DB service
Configure your needs and do not worry about managing the DB
Supports MySQL, Postgresql, and SQL Server
It’s regional, NOT global
Supports SSDs (best performance) or HDDs
Provisioned capacity: up to 96 vCPU, 624 GB RAM and 64 TB storage (supports automatic storage increase without downtime)
Important features
Automatic encryption, maintenance and updates
H/A with failover, by using a Standby
Choose Primary and Secondary zones within a region
Changes from Primary are replicated synchronously to Secondary
In case of a Zonal failure, automatic failover is done, but it does NOT revert automatically
You cannot connect to the Secondary, when Primary is running and vice versa; you can only connect to one at a time (if you want to be able to connect to others, use Read Replicas)
Read replicas (cross-zone, cross-region, external)
- Each read replica has its own endpoint that clients can connect to; if you try to write, you get an error:
ERROR 1290 (HY000): The MySQL server is running with the --read-only option so it cannot execute this statement
- Each read replica has its own endpoint that clients can connect to; if you try to write, you get an error:
PITR: point in time recovery (down to the second), up to 35 days
- Cloud SQL offers Point-in-Time Recovery (PITR) by combining daily automated backups with continuous binary logging
Automated backups (daily), and on-demand backups
Migration from other data sources using DMS (Database Migration Service)
Export data in CSV,SQL formats
Use for simple relational use cases, use Spanner (but very expensive) if
Huge volumes of data (100s of TB)
Infinite scaling
Global database needs
Higher availability (99.999%)
Best practices
Use Cloud SQL Auth Proxy to securely connect to Cloud SQL from your apps (GAE, Functions, Cloud Run, GKE, etc.)
- Allows you to connect to your Cloud SQL instances securely without having to configure SSL certificates or IP whitelisting. It provides secure access to your Cloud SQL instances by authenticating and encrypting the connection using your Google Cloud credentials. The proxy can be run as a background process on your local machine, a VM, or as a sidecar container in Kubernetes.
Have multiple small instances, rather than one large, so you can horizontally scale writes (because Cloud SQL doesn’t do horizontal scaling)
Understand back-ups and export
Backups are lightweight and provide PITR; they are deleted when the instance is deleted and you can’t take a backup of a single DB or table
Exports take longer but are more flexible; you can export a single DB or table; exporting large DBs can impact the Cloud SQL instance => you can use a serverless export, using the --offload flag to reduce impact
- Import/export in multiple small batches
Cloud Spanner
Fully managed, mission critical, relational, globally distributed (supports regional, dual region, multi-region set-up) database with very high availability (99.999% compared to 99.95% Cloud SQL)
You choose compute capacity, either in Nodes or Processing Units (smaller units, more granular, increments of 100/1000); you can also choose autoscaling instead of selecting a predefined number of Nodes/PUs
Strong transactional consistency at global scale
Automatically distributes data across nodes (uses partitioning/clustering to decide nodes - see below)
Horizontally scales for both writes and reads (unlike Cloud SQL where only reads can scale, via Read Replica)
Cloud Spanner partitions data into splits, which are contiguous ranges of rows from a table. These splits are distributed across multiple servers (nodes) in the Spanner network. Each split can dynamically move between nodes based on load and size.
For joins that span across multiple nodes, Spanner employs a distributed query execution engine. It breaks down the query into sub-queries that can be processed in parallel on the respective nodes holding the data. The results are then combined to produce the final output.
Pay for nodes and storage: very expensive
Use Cloud Console to export data, there is no gcloud export option; you can also use Cloud Dataflow to automate exports
Google Cloud Spanner uses the gRPC (gRPC Remote Procedure Call) protocol for communication
Section 19 - NoSQL Databases in Google Cloud Platform
Datastore
Single region (99.95% availability) or multi-region (99.999% availability)
Automatically scales and partitions data as it grows (you don’t provision anything)
Recommended for up to a few TBs of data, afterwards use Bigtable
Supports indexes, multi-row transactions and its flavour of SQL (GQL)
- Does not support joins, or aggregate operations
Use namespaces to group Entities
Export only works from CLI (gcloud), NOT console
- Exports contain a metadata file and a folder containing the data
Best practices
Store user profiles and other flexible schema requirements
Use as an index for objects in Cloud Storage (for example, metadata about pictures)
Design keys (no sequential values or timestamps) and indexes (only use indexes when you know they will be used in queries) carefully
Prefer batch operations
Firestore
Datastore++: optimized for multi device access
NoSQL document database built for automatic scaling, high performance, and ease of application development. To use Firestore, create one or more databases. Firestore databases come in two modes: Native mode and Datastore mode (Native mode is basically Firestore)
Offline mode, data sync across multiple devices
Client side libraries: web apps, iOS/Android
Single region (99.95% availability) or multi-region (99.999% availability)
Collection → Documents → Field name/type/value
- Hierarchical: you can add collections inside documents
Modes
Datastore compatible
Native
Bigtable
Bigtable is a fully managed, wide-column NoSQL database that offers low latency and replication for high availability
Petabyte scale, wide column (key value store), NoSQL database
Data is stored in tables, rows, and columns, similar to traditional relational databases but without a fixed schema. Each row can have a different set of columns.
Not normally serverless (but you can select autoscaling instead of provisioning a specific number of nodes), you create an instance, and choose either HDD or SSD (use for most use cases, except when you have huge volumes of non-latency sensitive data); you then configure the Cluster, within the Instance
- A cluster handles application requests for an instance. It contains nodes which determine your cluster's performance and storage limit.Additional clusters can be added at any time.
HBase API compatible
Designed for huge volumes of analytical and operational data: millions of txs per second
Only supports single row transactions (not good for transactional workloads)
Scales horizontally by adding nodes (can be done automatically via autoscaling, without downtime); you can also statically provision a fixed number of nodes
Automatically shards data into multiple tablets across nodes in a cluster
Export only support via a Java util, or HBase commands (no console or gcloud)
The command line tool is cbt not gcloud
Each table is a sorted key/value map
Related columns are grouped into families
Each value in a row is indexed with the row key, there are no other indexes in Bigtable except the row key
Use cases: IoT streams, real time analytics
Use Dataflow to export data to Cloud Storage
Designing Bigtable tables
Know what data you need to store
Know what your frequently used queries look like, and design your row keys based on this
Best practices
- Create multiple clusters, and set-up replication (either cross-zone or cross-region), for H/A and durability + put data closer to customers; can automatically failover
Section 20 - Creating Private Networks with Cloud VPC in Google Cloud Platform
VPC
Create private networks in the Cloud
Traffic within a VPC is isolated, not visible to other VPCs
You control all traffic going in and coming out of a VPC, via firewall rules
VPC is a global resource (unlike in AWS, where VPCs are regional)
Every project has a default VPC
You can create (and should) your own custom VPCs and not rely on the default one
Auto mode => subnets are automatically created in each region
The total IP range for an auto mode VPC is
10.128.0.0/9
, subdivided into 20 pre-defined subnets, one for each region (however, there is no such thing as a VPC CIDR, the CIDRs are only defined at subnet level and there is no restriction that says subnet CIDRs need to be part of a wider VPC CIDR range)Firewall rules automatically added to auto mode VPCs (including default VPC)
Implied Allow Egress (this is applicable across any VPCs and can’t be removed, but has lowest priority): Allows all egress (outbound) traffic from instances to any destination.
Implied Deny Ingress (this is applicable across any VPCs and can’t be removed, but has lowest priority): Denies all ingress (inbound) traffic to instances unless explicitly allowed by other firewall rules.
Default-allow-internal: Allows internal traffic between instances on the same VPC network.
Default-allow-ssh: Allows incoming SSH (TCP port 22) connections from any source.
Default-allow-rdp: Allows incoming RDP (TCP port 3389) connections from any source.
Default-allow-icmp: Allows incoming ICMP traffic (used for ping) from any source.
Custom mode => no subnets automatically created (recommended for production), only the two implied firewall rules applied
When creating VPCs, you are asked to provide firewall rules
- Note that when you create a Compute Engine instance and you select “enable HTTP(S)”, GCP will add a firewall rule to the VPC, with a destination defined by the
http-server
and will tag your instance as well ashttp-server
- Note that when you create a Compute Engine instance and you select “enable HTTP(S)”, GCP will add a firewall rule to the VPC, with a destination defined by the
Subnets
Regional
Allow implementing different network needs (separate public and private resources), depending on workload (for example, an LB may need to be public, but a DB should be private and accessible only from apps inside the network)
Another reason for using multiple subnets is to distribute resources across regions
When creating subnets, you can enable private Google access, which allows workloads inside the subnet to access GCP APIs via Google’s private network, without going over the public internet
You can choose to enable Flow Logs (logging of networking traffic)
In GCP, there isn't a direct equivalent to AWS's "public subnet" concept. Instead, what makes a subnet effectively "public" is a combination of factors:
External IP addresses on the VMs/instances in that subnet
Firewall rules that allow inbound traffic from the internet
Routes that enable internet connectivity (GCP provides a default internet gateway)
All GCP subnets have routes to the internet by default (unlike AWS where you need to attach an internet gateway)
Rules for subnet CIDRs
VPCs do NOT have a predefined CIDR, so subnets can have any ranges, as long as they don’t overlap
Subnet CIDRs must be within the private IP address space as defined by RFC 1918:
10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
The subnet CIDR block must be between
/9
(for large ranges) and/29
(for small ranges)
Firewall rules
In (GCP), Firewall Rules are the primary mechanism for controlling traffic to and from resources in a VPC. GCP does not have an equivalent to AWS's Network Access Control Lists (NACLs)
- AWS does not have an exact equivalent to GCP's firewall rules. However, AWS provides similar functionality through a combination of Security Groups (SGs) and Network Access Control Lists (NACLs)
Control traffic going in and out of the VPC
Stateful (allowed egress means allowed ingress, and vice versa)
Each rule has a priority, the lower the more priority
Implicit rules: allow all egress (65534) and deny all ingress (655534); implicit rules CANNOT be deleted, but they can be overridden with higher priority rules
The default (and auto-mode) VPC has additional four 65534 rules (default-allow-internal, default-allow-rdp, default-allow-ssh, default-allow-icmp)
- When creating your custom subnets, VMs won’t be able to talk to one another, unless you also enable
default-allow-internal
- When creating your custom subnets, VMs won’t be able to talk to one another, unless you also enable
You can also add your own rules
Ingress
Select the target (by service account or network tag - this is how “allow HTTP(S) traffic” works, it adds the
http-server
tag and adds a firewall rule that targets this tag)Select the source (CIDR range, all instances, or instances with a specific tag/SA)
Egress
Select the source (all instances, instances with a specific tag/SA)
Select the destination (CIDR block)
You select the priority
Select an action (Allow/Deny)
Select protocol
Select port
Enable/disable the rule
Best practices
Use network tags combined with firewall rules to control traffic going in/out of a VM
Restrict access to VMs fronted by an LB, by only allowing the LB CIDRs: 130.211.0.0/22 and 35.191.0.0/16
Egress is allowed by default, if you want to restrict it you should create a
deny
egress rule with higher priority
Shared VPC
Allow resources in different projects to talk to one another, as long as they are in the same Organization
One Host project, which contains the shared VPC and multiple Service projects, attached to the Host project, where specific resources can use subnets from the Shared VPC
Achieves separation of duties, by allowing network admins to be responsible for the Host project and Resource user to access the Service projects
VPC Peering
Connects VPC networks (same project, different projects, different Orgs)
Connectivity happens across Google’s private network (secure + free)
Requires non overlapping CIDR ranges for all subnets
VPC Service Controls
Networking feature provided by Google Cloud that helps protect your cloud resources against accidental or targeted data exfiltration. It allows you to create security perimeters around your cloud services to control access and protect data
Security Perimeters: VPC Service Controls allows you to define security perimeters to protect resources and data of services like Cloud Storage and BigQuery. These perimeters act as a firewall for your cloud resources.
Ingress and Egress Rules: You can configure rules to control the access to and from resources within the defined service perimeters. This includes configuring Private Google Access on subnetworks in a VPC, firewall rules for OSI Layer 4 network protection, and IAM best practices.
Adaptive Access Control: VPC Service Controls provides adaptive access control based on IP range or device trust for accessing Google Cloud resources from outside privileged networks.
Section 21 - Operations in Google Cloud Platform
Cloud Monitoring
Tools to monitor your infrastructure
Measures key aspects (Metrics)
Create visualizations around metrics (graphs and dashboards)
Configure Alerts (when metrics signify that services are not healthy); you define the alert policies (condition, notifications, documentation)
- Notification channels: Google Chat, PagerDuty, Slack, Webhooks, E-mail, SMS, Pub/Sub
Workspace
Cloud Monitoring can be used to monitor one or more GCP Projects AND AWS Accounts
You create a Workspace, to organize monitoring information
Created in a specific Project, called the Host Project
You can monitor VMs for
CPU utilization
Disk traffic
Network traffic
For more metrics, install the Cloud Monitoring agent (collectd-based daemon), to send data to Cloud Monitoring
Uptime checks
Allow you to monitor the availability and responsiveness of your applications or resources. They work by sending HTTP, HTTPS, or TCP requests to your application at regular intervals from multiple locations around the world. If the requests fail or the responses are not as expected, an alert can be triggered to notify you of the issue.
Target: HTTP URL, App Engine, Instance, LBs
Cloud Logging
Real time log management and analysis tool
Exabyte scale, fully managed service
Can ingest log data from any source
When logs are ingested via GCP Cloud Logging, they are stored in a resource called a Log Bucket**.** Log Buckets are part of the Cloud Logging service and are used to store, organize, and manage log data. Each Google Cloud project comes with a set of default Log Buckets, but you can also create custom Log Buckets to meet specific needs.
Logs Explorer: search, sort and analyze using flexible queries
Logs Analytics: rich visualization (requires upgrading the Log bucket, at no extra cost)
Logs Metrics: capture metrics from logs (using queries/matching strings)
Logs Router: route different log entries to different destinations
Collection
Most managed GCP services send logs to Cloud Logging (GKE, App Engine, Cloud Run)
From GCE VMs: install Logging Agent (based on fluentd)
From on-prem: use BindPlane tool from Blue Medora, or use the Cloud Logging API
Audit and security logs
Access Transparency Log: captures actions performed by the GCP team on your content (only for Gold support orgs, or higher)
Cloud Audit Logs: who did what, when and where (service name, operation, resource, principal)
Admin activity logs: API calls or other actions that modify the configuration of resources
Data access logs: reading configuration of resources (disabled by default)
System event audit logs: GCP admin actions
Policy denied audit logs: when a principal is denied access
Controlling and Routing
Logs reach the Log Router
It checks against various rules (what to ingest, what to discard, where to send the logs)
Two types of buckets
_Required: holds Admin activity, System Events and Access Transparency Logs (400 days retention); you are not billed
_Default: all other logs (retained for 30 days by default, can configure 1-3650); you are billed, but you can disable ingestion to this bucket
For long term retention, export to Cloud Storage, BigQuery dataset, Pub/Sub topic (Base64 encoded entries)
Create sinks, in the Log Router, to export to the above destinations
Same log entry can be sent to multiple Sinks
Sinks can be disabled
Cloud Trace
Distributed tracing system
You leverage tracing libraries (C sharp, Go, Java, Node.js, PHP, Python, Ruby), that call the Cloud Trace API
Supported for GCE, GKE, AE, etc.
Cloud Debugger has been removed in 2023
Cloud Profiler
Identify performance bottlenecks
Low-overhead, collects CPU and memory usage
Two major components
Agent - collects info
Interface - visualization
Error Reporting
Centralized Error Management console
Realtime exception monitoring
Aggregates and displays errors reported from cloud services (using stack traces)
Use Firebase Crash Reporting for errors from Android and iOS client apps
Errors can be reported by sending them to Cloud Logging, or by calling the Error Reporting API
Section 22 - Exploring IAM and Project Organization in Google Cloud Platform
Resource hierarchy: Organization → Folder → Project → Resources
IAM policies can be applied at any level, and resources inherit policies from all parents (effective policy is the union of all applicable policies); if a permission is given at a higher level, you cannot restrict it at a lower level
Recommendations
Create separate projects for different environments
Create separate projects for different apps
Create separate folders for each department
Create shared folders for shared resources
One project per application per env
- Consider two apps A1, A2, each needing 2 envs => create 4 projects: A1-DEV, A1-PROD, A2-DEV, A2-PROD
Billing Accounts
Mandatory for creating resources
One account can be associated with one or more Projects
Each Project MUST have a Billing Account
Create Billing Accounts that reflect your organization structure
Types
Self Serve: billed directly to card/bank account
Invoiced: for large enterprises
Managing Billing
Set up a Cloud Billing Budget and configure Alerts
You can set up Billing export to either GCS (file export) or BigQuery
IAM best practices
Least Privilege to perform a role (this is why Basic roles are not recommended)
Use Service Accounts with minimal privileges; different Service Accounts for different apps/purposes
Separation of duties: involve at least 2 people in sensitive tasks (for example, separate deployer and traffic migrator roles, in App Engine)
Constant monitoring: Review Cloud Audit Logs; archive Logs in GCS for long term retention
Use Groups when possible
User identity management
For Enterprises, use Google Workspace (formerly G Suite) to manage users and their access
You can also use your own IdP, if you don’t use Google Workspaces, through federation
- SSO: users are redirected to the external IdP, and a SAML assertion is sent to Google Sign-in
Cloud Identity: unified identity, access, app and endpoint management platform (manages identities)
Google Cloud Identity is focused on identity and access management, offering robust security and management features. It is suitable for organizations that need to manage users and devices without the additional productivity tools.
Google Workspace provides a comprehensive suite of productivity tools along with basic identity management features. It is ideal for organizations that need collaboration and productivity tools integrated with user management.
Members and identities
Google Account - represents a person (an e-mail)
Service Account - application account (not person, but still represented with an e-mail)
Group - collection of Google and Service accounts
Google Workspace domain
Cloud Identity domain
Organization Policies
Centralized constraints on all resources created in an Organization (disable creation of Service Accounts, disable creation of resources in a specific region)
Requires the Organization Policy Administrator role
IAM focuses on WHO, Org Policies focus on WHAT
Org policies override IAM
GCP offers some predefined policies, for example “Enforce uniform bucket-level access”
Predefined BigQuery roles
Admin → can do everything
Data Owner → datasets.*, models.*, routines.*, tables.* (does NOT have access to jobs)
Data Editor → tables, models, routines, datasets (no access to jobs)
Data Viewer
Job User → can run queries
To see data, you need roles like User, or Data Viewer, you can’t use Job User roles
Section 23 - Compute Engine Virtual Machines
SSH-ing into Linux VMs
Metadata managed: manually create and manage SSH keys
OS Login: allows you to manage SSH access to your Compute Engine instances using IAM (Identity and Access Management) roles. This eliminates the need to create and manage individual SSH keys for each user
Linux account is linked to Google Identity
Recommended when multiple users SSH into VMs across different Projects
Set the
enable-oslogin
metadata key totrue
You can import Linux accounts from on-prem AD/LDAP
Users need to have the appropriate roles
When using the GCP UI’s SSH button, a temporary ephemeral SSH key is used to log in; this works with both Metadata managed, and OS Login
gcloud compute ssh
will create a username and persistent SSH key pair, which will be re-used for future interactionsYou can also use your own custom SSH keys, for both OS Login and Metadata managed; it’s just different ways to upload the public keys
For Windows, passwords are used
Shutdown scripts
Runs on best effort basis (won’t work for hard resets for example)
Similar to startup script, except that it runs before stopping/terminating/restarting
Work for both preemptible and non-preemptible
Runs as root/system account
Stored as metadata (
shutdown-script
key, similar tostartup-script
)You can also store your startup and shutdown scripts as files in Cloud Storage
Troubleshooting VM startup
Are there Quota errors? (can check this in Notifications)
Is boot disk full?
Check serial port output, from the console, gcloud, Compute Engine API, or Cloud Logging - if you send the output there (every VM has 4 virtual serial ports, where outputs are written to)
Does the disk have a valid file system? Attach the disk to another VM and investigate
Moving instances across zones/regions
Moving between zones is simple:
gcloud compute instances move
Cannot use if instances are part of a MIG
Cannot use if instances have local SSDs
Cannot use if instances are in terminated status
Cannot use across regions
Manual approach
Create snapshots of all attached disks
Create copies of PDs in destination region/zone, using the previously created snapshot
Launch a new instance, attach the PD
Section 24 - Asynchronous Communication in Google Cloud with Cloud Pub Sub
Advantages of async communication with message brokers
Decoupling
Availability
Scalability
Durability
Pub/Sub
Reliable, scalable, fully-managed, async messaging service
Auto-scales to billions of messages per day
Low-cost, pay for use (number of messages)
Use cases: event ingestion, delivery for streaming analytics pipelines
Supports both push/pull message deliveries (unlike separate, for SNS and SQS)
Create a topic, where publishers send messages and receivers process them through subscriptions
Supports DLTs
Can configure retention period, expiration of subscription, retry policies, etc.
Pub Sub Lite is cheaper, but only replicates data in one Zone
Subscriptions
Pull - make HTTPS request to the pub sub API, to ask for messages
Push - Message automatically sent out to all subscribers, to the webhook endpoint registered by the receiver
- In push delivery, Pub/Sub initiates requests to your subscriber application to deliver messages. Messages are delivered to a publicly addressable server or a webhook, such as an HTTPS POST request.
Subscriptions
Discrete pool of messages from a topic
All messages on a topic are available to all subscriptions (they are not like partitions in Kafka, but more like topics)
However, if multiple clients pull from the same subscription, then messages are split between clients (in this respect, a subscription is similar to a consumer group)
Once a client consumes the message, it sends an ACK back to Pub Sub; the ACK is per subscription
After the ACK is received, the message is removed from the subscription
You can take snapshots of subscriptions
Cloud Dataflow
Unified stream and batch processing
Based on Apache Beam (supports many languages)
Serverless and autoscaling
Fast and cost effective
You define jobs, that can be real time or batch
There are a lot of templates available to configure
Can do export/import to a variety of services
Section 25 - Implementing Hybrid Cloud with Google Cloud
Cloud VPN
Connect on-prem to GCP, over the public internet (NOT private)
IPSec VPN tunnel, with encrypted data traveling through it
IKE - internet key exchange protocol is used to encrypt the data
One or more gateways in GCP and one or more in your own data center
Cloud Router enables dynamic routing
Two types
H/A VPN (regional, NOT global)
99.99% availability
Redundant IP addresses: 2 interfaces and 2 IP addresses
Only dynamic routing (BGP) is supported
Classic VPN
99.9% availability
Single IP address
Supports static routing, as well as dynamic
Easy to establish, no extra contracts required
Use cases: lower cost, encryption requirement, experimenting with on-prem to GCP connectivity
Cloud Interconnect
High-speed, highly available, private connection (not over public internet)
Reduces egress cost, because traffic does not go over the internet
Two options
Dedicated
Very high bandwidth, for large transfers
10 GBps, 100 GBps
You can also have multiple circuits: 8 x 10, or 2 x 100, per connection
Can take a few weeks to establish
Partner
Bandwidth needs are lower
50 MBps to 10 GBps
Doesn’t require the same set-up time
Best practices for hybrid connectivity
Resources on the networks should use different CIDRs
Have fallback connections, if the primary fails (dedicated interconnect → VPN)
Section 26 - Exploring Data Warehouses in Google Cloud - BigQuery
Exabyte scale data warehouse solution
Relational database, supports SQL, consistency, etc
Data organized into datasets, which contain tables
Offers traditional (storage + compute) + modern (realtime + serverless) approaches
Importing and exporting
Load from a variety of sources, including streaming data, batch, federation, BigQuery Transfer Service
You can’t just add a Table and add records from the UI, like you can for Firestore for example
Batch is free, streaming is expensive; can batch import from Cloud Storage, or local files
You can preprocess data with Dataflow or Dataproc (managed Hadoop), before loading into BigQuery
Stream from Pub Sub, or use streaming inserts; can still use Dataflow/Dataproc to preprocess
Federated queries allow you to query external sources
BigQuery Data Transfer Service: load from external sources, such as RedShift, S3
Realtime inserts can contain duplicates; add an
insertId
to deduplicate (for up to 1 minute)Max 1 GB/s per Project
If you are streaming millions of records, prefer Bigtable
Load from variety of formats: CSV, JSON, Avro, Parquet, ORC, etc.
Export to GCS (long term storage, archiving) and Data Studio (for visualization) - CSV, JSON, Avro
Data can be auto-expired, after a certain amount of time (configurable table expiration)
You can also query external data sources, without needing to store it in BigQuery (by using permanent or temporary external tables)
Queries can be expensive when running on large datasets; you should first estimate the queries (either via UI, or
bq --dry-run
; you pay for scanned data, not returned dataPartitioning helps divide tables into multiple segments, allowing more cost efficient and faster scans
You can partition on any column of type DATE, DATETIME, TIMESTAMP or INTEGER
Partitions share the same schema as the main table
Avoid creating partitions that are too small (<1 GB); in this case, prefer clustering
Clustering colocates related data, keeping rows with the same value of the cluster key are stored close together
Expiring data
Hierarchy: Dataset → Table → Partition
Expiration can be set at all of these levels
BigQuery best practices
Estimate queries before running them
Use clustering and partitioning
Avoid streaming inserts, prefer batch (free)
Expire data automatically
Consider the long term storage option, similar to GCS Nearline (tables where data has not been edited for 90 days)
Not well optimized for narrow range queries; prefer Bigtable; BigQuery for complex queries
Use Audit Logs
Dataproc
Managed Spark and Hadoop service
Supports Spark, PySpark, Pig, Hive, Hadoop
Used for complex batch processing
You create clusters, and choose the characteristics of VMs (can use preemptible VMs)
You choose either GCE or GKE
Options
Standard (1 master, N nodes)
Single node (1 master, 0 nodes)
H/A (3 masters, N nodes)
You submit jobs, such as through jar files
You can also submit workflows
- A Workflow is an operation that runs a Directed Acyclic Graph (DAG) of jobs on a cluster
BigQuery is an alternative, if you need to use SQL queries on terabytes of data
Section 27 - Exploring Data Lifecycle and Data Architectures in Google Cloud
Data lifecycle
Ingest
Stream
- Pub Sub: you can use the Google Cloud Console, the gcloud command-line tool, the Pub/Sub API, or the Cloud Client Libraries to publish messages
Batch
- Storage Transfer Service, BigQuery Transfer Service, Transfer Appliance, gsutil
Database migration
Database Migration Service to simplify migrations to Cloud SQL
For non Cloud SQL, you can batch transfer DB backups to Cloud Storage and load data from GCS to DB using Dataflow
Store
Durably, and cost efficiently, in a convenient format
Cloud Storage (object storage, unstructured data)
Cloud SQL (managed MySQL, PostgreSQL and MS SQL dbs, pre-defined schema, strong transactions, regional, can only horizontally scale reads)
Cloud Spanner (global, horizontally scalable for both writes and reads, 99.999% availability, strong consistency)
Cloud Firestore (document NoSQL DB, scalable, supports txs)
Cloud Bigtable (wide-column NoSQL DB, huge data, large scale analytical time series workloads, only supports single-row txs)
BigQuery (complex analytics, managed data warehouse)
You can also create your own custom DB, from Cloud Marketplace (MongoDB, Cassandra, etc.); you are responsible for everything non functional related to it
Process and analyze
Convert data to information (normalization or aggregations): raw data > actionable information (clean, transform)
Dataprep (powered by Trifacta): clean and prepare data, No Ops, clean data from external sources, prepare data for ML; visual approach for non programmers
Data Loss Prevention (DLP): scan, discover, classify and report on data in GCS, BigQuery and Datastore (mask, tokenize and transform sensitive elements)
Dataflow: more flexible ETL pipelines (fully managed, NoOps, support both batch and streaming)
Dataproc: managed service for Spark and Hadoop (batch); needs a cluster; Machine Learning or existing Spark workloads that you want to move to the cloud
Explore and visualize
Get and share insights
BigQuery
ML - Pre built models: Vision API, Speech-to-Text, Natural Language API, Video Intelligence API, etc.
ML - Custom models: use AI Platform (managed service, based on TensorFlow); use Dataflow to preprocess
Datalab (WAS SHUTDOWN and migrated to Vertex AI workbench): web based tool to explore, analyze and visualize data; based on Jupyter notebooks (use Python, SQL queries, etc.); support for popular toolkits like pandas, numpy, scikit-learn
Data Catalog: data discovery and metadata management; unified view of all datasets; you can tag sensitive data using DLP (Data Catalog is deprecated as of February 3, 2025 and will be discontinued by January 2026. Transition to Dataplex Catalog by January 2026 for continued catalog usage)
Data Catalog allows you to discover, manage, and understand data assets across Google Cloud Platform. Data Catalog API natively indexes Cloud BigQuery, Cloud Storage, and Cloud Pub/Sub data assets.
The Data Catalog API can be used to:Search for data assets across different projects and GCP resources
Create tags and tag templates to add structured business metadata unique to your organization
Automate the tagging process of data assets as part of a data ingestion pipeline
Big Data and Analytics in GCP
Pub/Sub: foundation for stream analytics and event driven systems
BigQuery: serverless data warehouse to analyze petabytes of data
Data Studio (now Looker): managed visual analytics service
Dataflow: data pipelines (Stream + Batch), take data from a source and dump it to a sink
- Google Cloud Dataflow is ideal for unified stream and batch data processing at scale, creating data pipelines that can read, transform, and write data to various destinations. It is particularly suited for real-time analytics, machine learning, and ETL operations, leveraging Apache Beam's model for both batch and streaming tasks, making it perfect for complex data integration tasks that need robust scalability.
Dataproc: managed Spark and Hadoop clusters
- Google Cloud Dataproc is best for managing Hadoop and Spark clusters, making it suitable for big data processing, ETL, and machine learning relying on these ecosystems. It simplifies cluster creation, management, and scaling, supports tools like Hive and Pig, and integrates with Google Cloud services, ideal for migrating on-premises Hadoop and Spark workloads to the cloud with features like autoscaling.
Dataprep: clean and prepare data (structured and unstructured)
Datalab (deprecated and replaced with Vertex AI workbench)
Cloud Composer: managed workflow orchestration; based on Apache Airflow; create pipelines across clouds and on-prem data centres
- Ideal for managing complex workflows that involve multiple cloud services and require coordination and scheduling of tasks. Example: orchestrating a series of jobs across BigQuery, Dataflow, and other services.
Big Data flows
Batch ingest into BigQuery: Cloud Storage → Dataproc / Dataprep / Dataflow → BigQuery
Realtime querying based on streaming data: Pub Sub → Dataflow → BigQuery / Bigtable
IoT
- IoT Core (registration, AuthN/Z of devices) → Pub Sub → Dataflow → Datastore / BigQuery / Bigtable
Data lakes
Storage: GCS
Ingestion: Pub Sub + Dataflow, or Transfer Service / Transfer Appliance / gsutil
Processing/analytics: BigQuery (federated queries)
Section 28 - Caching
Cache data which changes infrequently
Returned data may be stale - you can configure a TTL
Use cases: infrequently changing DB data, user sessions, static content
Memorystore
Fully managed (provisioning, replication, failover and patching)
99.9% availability SLA
Redis
Use for low latency access with persistence and high availability
Tiers: Basic (no H/A), Standard (failover replica in a separate zone)
Supports up to 300 GB storage space
Memcached
Use for reference data, database query caching, session store
Configure region/zone
Configure 1-20 nodes, memory per node (1-256GB) and CPUs per node
Data is ephemeral, if a node crashes, the data is lost
Valkey
- Memorystore for Valkey is a fully-managed, horizontally scalable Valkey service for workloads that demand the lowest possible latencies
Can be accessed from most compute services: Compute Engine, App Engine, GKE, Cloud Functions
APIs/SDKs are available
App Engine has a legacy in-memory data cache
Shared memcache: free
Dedicated memcache: paid, predictable performance
Cloud CDN
Use Google’s global edge network to serve content with low latency
Integrates with External HTTP(S) Load Balancing
Supported back-ends
Cloud Storage bucket
Instance Groups
App Engine
Cloud Run
Cloud Functions
Endpoints outside of Google Cloud (custom origins)
External HTTP(S) LB uses proxies called Google Front Ends
Requests arrive at a GFE
If URL maps to a back-end with CDN configured, then GFE checks cache and either returns or forwards to back-end
When you configure the back-ends for an external HTTPS Load Balancer, there is a checkbox for “Enable Cloud CDN”, and you can configure the type of content to cache (static content, use origin settings, force cache all content)
You can choose which components of the HTTP requests to be used for caching (protocol, host, query string)
Supports TTL
Best practices
Cache static content
Be careful with expiring time-sensitive (dynamic) content - use shorter cache periods
Use custom cache keys to improve cache hit ratio - customize cache key
Use versioned URLs to update content (enables you to have high TTLs, because a new version causes a cache miss anyway)
Section 29 - Agile, DevOps, SRE and SDLC Evolution
DevOps
Getting better at the three elements of great software teams: Business, Development, Operations
Communication - get teams together
Feedback - find problems early
Automation - automate testing, infrastructure provisioning, deployment, monitoring
DevOps practices
Automated CI/CD pipelines
Tests - unit, integration, system, sanity/regression
Static code analysis, including static security checks
Runtime checks - vulnerability scanners
DevOps in Google Cloud
Cloud Source Repositories - fully-featured, private Git repositories
Container Registry - store Docker images
Cloud Build - build deployable artifacts (jars, or Docker images) from source code and configuration; works off of a
cloudbuild.yaml
file, Dockerfile, or BuildpacksSpinnaker - supports deployment to Compute Engine, GKE, App Engine
Infrastructure-as-Code
Treat infrastructure the same way as application code
Track changes over time
Bring repeatability
Infrastructure provisioning: Terraform, Google Cloud Deployment Manager
Configuration management (installing software on provisioned resources): Chef, Puppet, Ansible, SaltStack
Cloud Deployment Manager
Version control for your environments
Equivalent to CloudFormation in AWS
Free to use, but you pay for the resources
Automate deployment and modification of GCP resources, in multiple environments
Avoid configuration drift
Tip: always modify resources via Deployment Manager, not manually!
Configuration defined in a YAML file; understands dependencies between resources
Automatic rollbacks on errors (easy to retry)
Terminology
Configuration file: YAML file with resource definitions for a deployment
Templates: reusable resource definitions that can be used in multiple configuration files
- Python (preferred) or JinJa2 (for simple scripts)
Deployment: collection of resources that are deployed and managed together
Manifests: read-only object containing the original deployment configuration; automatically generated and contains fully-expanded resource list
Cloud Marketplace (Cloud Launcher)
Central repo of easily deployable apps and datasets
Similar to App Store/Play Store
You search and install a complete stack (SAP HANA, LAMP, WordPress, Cassandra, Jenkins, etc.)
SRE
SRE teams focus on all aspects of an application: availability, latency, performance, efficiency, change management, monitoring, capacity planning
Key principles
Manage by SLOs
Minimize toil (manual work) - automate as much as possible
Move fast by reducing the Cost of Failure (frequent small releases + automation)
Share ownership with developers (responsible for monitoring, not just developing)
Key metrics
SLI (Service Level Indicator): quantitative measure of an aspect of the service
Availability, latency, throughput, durability, error rate
Typically aggregated (over x minutes)
SLO = SLI + target
99.99% availability
99th percentile response time of 1 second
SLA = SLO + consequences (contract)
- Have stricter internal SLOs than external SLAs
Error budgets = 100% minus SLO
How well is a team meeting their reliability objectives
Used to manage development velocity
Best practices
Handling excess loads
Load shedding - API limits, drop some data in a streaming application
Reduce QoS - instead of talking to a recommendations API, return a hardcoded set of products
Avoid cascading failures
- Plan to avoid thrashing - circuit breaker, reduced QoS
Penetration testing
- No need to inform Google, but you are only allowed to test your projects
Load testing
Resilience testing
Chaos Testing
Add stress to some layers
Include the network in your testing (VPN, Cloud Interconnect)
Section 30 - Release Management in Google Cloud Platform
Goals - depending on these, there are different ways to perform releases
Zero downtime?
Only one version live?
Minimize costs?
Test V2 with some traffic before rolling out?
Best practices
Small incremental changes
Automation
Deployment approaches
Recreate
Replace V1 with V2 in-place (on the same machines)
Terminate V1, rollout V2 immediately
App will be down during the release
Rollback needs redeployment and more downtime
Cost effective and fast
You avoid the need for backward compatibility (with DBs and other apps), because only one version is live at one point in time
Canary
Initial small V2 rollout to a subset of instances (some life traffic will be served V1, others V2), followed by rollout to the rest
Fast
Zero downtime
No extra infrastructure
Minimize impact to users
Needs backward compatibility
Supports A/B testing (check if users like a feature)
Rolling
V2 rolled out to a percentage of instances, then gradually rolled out to the rest
Slow
Zero downtime
Requires automation and additional setup
No extra infra required
Minimize impact to users
Needs backward compatibility
An alternative is rolling with additional batch (a little bit of extra infrastructure, for the benefit of 0 reduction in capacity to handle traffic)
Blue/Green
Create a parallel environment, with V2
Switch all traffic from V1 to V2, at once
From the POV of the user, they only see one version at a point in time
Zero downtime
Easy rollback (you point traffic to the Blue env)
Needs additional infra
Needs backward compatibility
Allows for shadow testing: mirror traffic to both V1 and V2, to see how each responds; allows you to capture prod traffic and replay it; complicated
MIG deployment approaches
Rolling:
gcloud compute instance-groups managed rolling-action start-update my-mig --version=template=v2-template
(--max-surge and --max-unavailable flags can be used)Canary:
gcloud compute instance-groups managed rolling-action start-update my-mig --version=template=v1-template --canary-revsion=template=v2-template,target-size=10%
Blue Green: manual approach; create a new MIG and adjust LB back-ends as needed
App Engine - releasing new versions
gcloud app deploy
- deploy and shift all traffic at once--no-promote
- deploy without shifting trafficgcloud app services set-traffic s1 --splits V2=1
- shift all traffic--migrate
- gradual migrationgcloud app services set-traffic s1 --splits=v2=.5,v1=.5
- A/B testing
GKE - releasing new versions
Recreate (set strategy type on Deployment to Recreate)
Rolling update; default (RollingUpdate, configure maxSurge and maxUnavailable)
Blue Green: create new deployment and use Ingress or Service to control traffic
Canary: requires a service mesh, like Istio
Section 31 - Compliance and Regulations for Your Google Cloud Solutions
GCP is compliant with IEC 27001, IEC 27017, IEC 27018, IEC 27701, PCI DSS, SOC 1, SOC 2, COPPA, HIPAA, GDPR
HIPAA compliance
Shared responsibility: GCP supports HIPAA compliance, customer must evaluate compliance with HIPAA
Execute a Google Cloud Business Associate Agreement (BAA), from your account manager
Do not use GCP products that are not covered by BAA
Follow IAM best practices
Enable object versioning
Export audit logs to GCS and BigQuery
Disable request caching for PHI in Cloud CDN
PCI DSS compliance
Create a new Google Cloud account for payments processing
Restrict access to this env by using the principle of least privilege
Control inbound and outbound traffic (firewall rules to only allow HTTPS requests from customers and TPPs)
GCE and GKE are recommended, because the others do not support egress firewall rules (App Engine, Cloud Functions)
Create an HTTPS load balancer with signed SSL certificate
Use hardened VM images
Implement Forseti Security, by using the following tools: Inventory, Scanner, Enforcer, Explain
Enable VPC flow logs, access transparency logs, firewall rules logging and configure monitoring alerts
Use DLP to sanitize data
Section 32 - Planning Cloud Migrations to Google Cloud Platform
Phase 1: assess the workloads to be migrated
Take inventory and catalog apps
Experiment and design proofs of concept (move a sample app/DB to cloud)
Calculate TCO (total cost of ownership)
Choose which workloads to migrate first
Phase 2: plan the foundation
Design resource org hierarchy (folders, project), how to configure IAM (users, groups, AD integration), VPCs, networks, subnets and how to connect from on-prem (VPN, Interconnect)
Plan for security, monitoring and governance
Plan your migration team
Phase 3: gradually deploy the workloads
Migrate data: take into account cost, time, offline/online transfers
Deploy apps (choose the right compute options)
Prefer automation (Ansible, Chef, Puppet, Jenkins, Spinnaker, Terraform, Deployment Manager)
Phase 4: optimize environment
Logging, monitoring, alerting
Use managed services
Enable auto-scaling
Approaches
Rehosting (lift and shift)
Replatforming (few adjustments to suit the cloud, for example containerizing)
Repurchasing: move to a new Cloud product, like a dB
Refactoring: make use of serverless architectures
Retiring: end of service
Retaining: keep as is, on-prem
MS SQL Server to GCP migration
Create a Cloud SQL instance
Move backup of your DB to Cloud Storage
Import data into Cloud SQL for SQL Server
Deploying containers to GCP
App Engine flexible: highly scalable, but doesn’t scale down to 0; can’t customize GCE VMs
Cloud Run and Cloud Run for Anthos: highly scalable and can scale down to 0; cannot customize the env
GKE and Anthos clusters (K8s in different clouds and on-prem): lot of management overhead, but lots of flexibility
Compute Engine: use COS image with Docker installed; not recommended, because you need to manage everything
Understand business requirements such as
Reduce costs - CapEx vs OpEx; TCO (licensing, compute, storage, networking, ingress, egress, personnel)
Use managed services
Use auto-scaling
Use preemptible VMs
Increase pace of innovation
- Evaluate and adopt emerging processes like DevOps and SRE, microservices
Reduce mean time to recovery
Improve compliance
More visibility into apps and infra
More intelligence from available data
KPIs for business requirements
Key Performance Indicators: measure of how well a business is doing with respective to an objective
Examples: number of new customers onboarded, % of VMs running in the cloud, operational cost per customer
Define technical requirements
Functional: must use containers, must use hardened OS, must be auto-scaling, private network, huge volumes of data etc.
Non-functional
Availability
Geographical distribution (Global → Multi-regional → Regional → Zonal)
Compute Engine: Live Migration, MIGs with Auto Scaling and Health Checks; distribute instances using LBs
GKE: multi-master, regional clusters with Pod and cluster autoscaling
Managed services: App Engine, Cloud Functions, GCS, Filestore, Datastore, BigQuery
Persistent Disks: live resizing, use regional PDs
Bigtable: place clusters in different zones or regions
Datastore: use multi-region locations
Cloud SQL: HA configuration (regional)
Network Tier: prefer Premium Network Tier
Hybrid: Dedicated IC > Partner IC > VPN
Scalability
VMs in MIG, configured with Instance Template + Load Balancing
Pod and Cluster Autoscaling in GKE
Be cautious with resources that cannot scale fast (Cloud SQL)
Persistent disks can be scaled both horizontally and vertically (increase size, attach more disks)
All serverless offerings are autoscaling (Bigtable, Cloud Spanner, Cloud SQL, Dataproc are NOT serverless, you need to provision compute power)
Durability
Security
Confidentiality: follow IAM best practices, encrypt data in transit and at rest
Integrity: protect data from unauthorized change; RBAC, separation of duties, hash verifications and digital signatures
Availability: system/data is available for the users when they need them; firewalls (permit only the right kind of traffic and prevent attacks), redundancy, auto-failover, DDoS protection (Cloud Armor)
Digital signatures - Cloud KMS
Verification of integrity of signed data + non-repudiation if the signer claims the signature is not authentic
Workflow:
Sender performs private key operation on the data, to create a digital signature
Recipient uses the public key to verify the digital signature
Cloud KMS can be used to create an asymmetric key pair that supports digital signing
gcloud kms asymmetric-sign
Use cases: validating code builds
Cloud Armor
Google Cloud Armor is a Web Application Firewall (WAF). It provides protection for web applications by filtering and monitoring HTTP traffic between a web application and the Internet. Cloud Armor is part of Google Cloud Platform and offers features similar to other WAFs, including protection against DDoS attacks, SQL injection, and cross-site scripting (XSS)
Protect apps from DDoS and OWASP top 10 attacks
Protects apps deployed in GCP, hybrid, or in a multi-cloud architecture
Preconfigured security policies exist, such as OWASP Top 10 risks
Use cases
Enable access for users at specific IP addresses, using allowlists
Block access from IP addresses with denylists
Secret Manager
Store API keys, passwords, etc.
Supports multiple versions of secrets, secret rotation using Cloud Functions, auditing with Cloud Audit Logs
Encrypted by default
Secrets can be stored in multiple regions
Data Management
How does the data come in?
What rate will we receive it?
What kind of data?
How much data?
For how long?
Who will have access?
How will we use it?
Section 34 - Exploring More Google Cloud Platform (GCP) Services
Cloud Scheduler
Batch, big data jobs, cloud infra operations, etc.
Cron format
Supported targets: HTTP, Pub/Sub, App Engine, Workflows
Provides automated retries
Use case: set of GCE instances and you want to execute a task across all of them: use Cloud Scheduling to put a message on Pub Sub and have instances listen to it
Needs an App Engine App in the Project (it’s built on an older App Engine Cron Service)
Cloud Emulators
Develop GCP apps locally
Supports emulation of: Bigtable, Datastore, Firestore, Pub Sub, Spanner
Cloud DNS
Global Domain Name System
Private and public managed DNS zones (container for records)
gcloud dns managed-zones create ZONE_NAME
--description
--dns-name
--visibility
--networks
Three steps to add records
Start transaction for Zone
Make changes
End transaction for Zone
Pricing Calculator
Estimate the cost of a GCP solution
Estimates for 40+ services: GCE, GKE, Cloud Run, AE
Just estimates
Anthos
GCP Anthos, introduced by Google Cloud, is a hybrid and multi-cloud platform designed to build and manage modern hybrid applications across different environments. It is powered by Kubernetes and other open-source technologies, allowing you to run applications unmodified on existing on-premises hardware or in public cloud environments. Anthos provides a container environment that works across multiple clouds and bridges the gap between legacy software and cloud hardware.
Consistent cluster management, development and ops experience
Centralized config management (Git repo): logically group and normalize clusters as environs
- Use Namespaces, labels and annotations to decide which clusters to apply changes on
Provides a Service Mesh (based on Istio)
Sidecar to implement common features (logging, AuthN/Z, tracing, metrics, etc.)
A/B testing, canary rollouts
Machine Learning
Prebuilt APIs: no in-house ML expertise; easy to use; example: Vision API, Video API, Natural Language API, Speech-to-text API, Text-to-speech API, Translation API
Cloud AutoML: build custom ML models with developers having limited ML expertise
AI Platform: based on Tensorflow Enterprise; serverless and autoscaling; training + serving phases are supported; Explainable AI - interpret models with confidence (why a decision was made)
Data management: Cloud Storage and BigQuery; BigQuery ML - build models directly from data in BigQuery
Automation and instrumentation: AI Platform Pipelines & Cloud Composer to orchestrate and automate data/ML pipelines
Apigee API Management
Design, secure, publish, analyze, monitor, monetize and scale API anywhere
APIs can be on-prem, GCP, or hybrid
Provides AI-powered API monitoring
Enable caching with Cloud CDN
Create Developer Portals
Use cases: abstraction layer on top of legacy services, expose ML models as APIs
Identity Platform
Customer Identity and Access Management
IAM is for controlling access to Google Cloud resources; works with Members, Roles, Policies and Service Accounts
Identity Platform is CIAM (customer identity and access management)
- Authentication and Authorization for end users
Key features
AuthN/Z for web and mobile applications
Multiple methods: SAML, OIDC, email/password, social
User sign-up and sign-in
MFA
An upgrade from Firebase Authentication Legacy
Integrates well with IAP
Scenarios
App on GCE VM needs access to GCS → IAM Service Account
Enterprise user needs access to upload objects to GCS → IAM
Manage end users for apps → Identity Platform
Login via Facebook → Identity Platform
Sign-up and sign-in workflows → Identity Platform
Eventarc
Simplifies event driven architectures
Adheres to CloudEvents (cloudevents.io)
Event provider: who can trigger events?
Direct: Pub Sub, Cloud Storage, Cloud Functions, IoT, Memorystore
Indirect: Cloud Audit Logs entries
Event destination
- Cloud Functions (2nd gen), Cloud Run, GKE services
Uses Pub Sub topics in the background
Observability and OpenTelemetry
Observability: measure the internal state of a system by examining its outputs
Three pillars of observability: logs, metrics, traces (these are the outputs)
Earlier, there were different standards for each of these
And different approaches across different programming languages
OpenTelemetry aims to be a unified standard: collection of technologies to collect and export telemetry (part of CNCF)
GCP supports OpenTelemetry
Steps to use
Add OTel libraries for your language
Instrument your code to export telemetry
Once your app is deployed to GCP, you’ll be able to see the data in the cloud specific tools
Service Directory
A single place to publish, discover and connect services
Workloads can be running in GCP, on-prem or other clouds
Managed service: H/A and scalable
Register/resolve services using DNS, HTTP and gRPC
Service Directory client libraries are available for multiple languages
Audit logging and request/response logs
Section 35 - Understanding Google Cloud Architecture Framework
Best practices and implementation recommendations to help your GCP deployment
Focus on designing robust, secure and scalable systems
Four principles
Operational excellence
Ensure operational readiness and performance using CloudOps: Ensure that cloud solutions meet operational and performance requirements by defining service level objectives (SLOs) and by performing comprehensive monitoring, performance testing, and capacity planning.
Define SMART SLOs and SLAs
Implement comprehensive observability
Implement performance and load testing
Plan and manage capacity
Continuously monitor and optimize
Manage incidents and problems: Minimize the impact of cloud incidents and prevent recurrence through comprehensive observability, clear incident response procedures, thorough retrospectives, and preventive measures.
Establish clear incident response procedures
Centralize incident management
Conduct thorough post-incident reviews
Maintain a knowledge base
Automate incident response
Manage and optimize cloud resources: Optimize and manage cloud resources through strategies like right-sizing, autoscaling, and by using effective cost monitoring tools.
Right-size resources
Use autoscaling (MIGs, GKE cluster/Pod autoscaler, Cloud Run is serverless and autoscales already)
Leverage cost-optimization strategies (committed use, sustained use, preemptible)
Track resource usage and cost (tags and labels)
Establish cost allocation and budgeting
Automate and manage change: Automate processes, streamline change management, and alleviate the burden of manual labor.
Adopt IaC
Implement version control
Build CI/CD pipelines
Use configuration management tools
Automate testing
Continuously improve and innovate: Focus on ongoing enhancements and the introduction of new solutions to stay competitive.
Foster a culture of learning
Conduct regular retrospectives
Stay up-to-date with cloud technologies
Actively seek and incorporate feedback
Measure and track progress
Security, privacy and compliance
Security principles
Build a layered security approach
Design for secured decoupled systems
Automate deployment of sensitive tasks
Automate security monitoring
Meet the compliance requirements for your regions
Comply with data residency and sovereignty requirements
Shift security left
Manage your assets
Use cloud asset management tools
Automate asset management (IaC)
Monitor for deviations from your compliance policies
Integrate with your existing asset management monitoring systems
Use data analysis to enrich your monitoring
Manage identity and access
Use a single IdP
Protect the super admin account
Plan your use of service accounts
Update identity processes for the cloud
Set up SSO and MFA
Implement least privilege and separation of duties
Audit access
Automate your policy controls
Set restrictions on resources
Implement compute and container security
Use hardened and curated VM images
Use Confidential Computing for processing sensitive data: Confidential Computing builds on Shielded VM. It protects data in use by performing computation in a hardware-based trusted execution environment
Protect VMs and Containers
Disable external IP addresses unless they are necessary
Monitor your compute instance and GKE usage
Keep images and clusters up to date
Control access to images and clusters
Isolate containers in a sandbox: Use GKE Sandbox to deploy multi-tenant applications that need an extra layer of security and isolation from their host kernel. For example, use GKE Sandbox when you are executing unknown or untrusted code. GKE Sandbox is a container isolation solution that provides a second layer of defense between containerized workloads on GKE.
Secure your network
Deploy zero trust networks
Secure connections to on-prem or multi cloud envs
Disable default networks
Secure your perimeter
Inspect network traffic
Use a web application firewall
Automate infrastructure provisioning
Monitor your network
Implement data security
Automatically classify your data
Manage data governance using metadata
Protect data according to its lifecycle phase and classification
Encrypt your data
Control cloud admin’s access to your data
Configure where your data is stored and where users can access it from
Manage secrets using Secret Manager
Monitor your data
Deploy applications securely
Automate secure releases
Ensure that application deployments follow an approved process
Scan for known vulnerabilities before deployment
Monitor application code for known vulnerabilities
Control movement of data across perimeters
Encrypt your container images
Manage compliance obligations
Asses your compliance needs
Deploy Assured Workloads: Assured Workloads is the Google Cloud tool that builds on the controls within Google Cloud to help you meet your compliance obligations.
Review blueprints for templates and best practices that apply to your compliance regime
Monitor your compliance
Automate your compliance
Implement data residency and sovereignty requirements
Manage your data sovereignty
Manage your operational sovereignty
Manage software sovereignty
Control data residency
Implement privacy requirements
Classify your confidential data
Lock down access to sensitive dataq
Monitor for phishing attacks
Extend zero trust security to your hybrid workforce
Implement logging and detective controls
Monitor network performance
Monitor and prevent data exfiltration
Centralize monitoring
Monitor for threats
Reliability
Set a clear expectation that the entire organization is responsible for reliability, including development, product management, operations, platform engineering, and site reliability engineering (SRE). Even the business-focused groups, like marketing and sales, can influence reliability.
Principles
Define reliability based on user-experience goals
Measure user experience
Analyze user journeys (Cloud Trace)
Set realistic targets for reliability
Accept some failure and prioritize components
Balance reliability and cost (remember to consider your RTOs and RPOs)
Build high availability through redundancy
Identify failure domains and replicate services
Detect and address issues promptly (health checks)
Take advantage of horizontal scalability
Use managed services (MIGs, Cloud Run)
Promote modular design
Implement a stateless design
Detect potential failure by using observability
Metrics, logs and traces
Gain comprehensive insights (Cloud Logging, Cloud Monitoring)
Perform proactive troubleshooting (Cloud Storage access logs, VPC Flow Logs)
Optimize resource utilization
Prioritize alerts
Design for graceful degradation
Implement throttling (Apigee)
Drop excess requests early
Handle partial errors and retries
Test overload scenarios
Monitor traffic spikes
Perform testing for recovery from failures
Define testing objectives and scope
Prepare the environment for testing
Simulate failure scenarios
Monitor system behaviour
Verify recovery against RTO and RPO
Document and analyze results
Iterate and improve
Perform testing for recovery from data loss
Verify backup consistency and test restoration processes
Schedule regular and frequent backoffs
Define and monitor RPO
Monitor backup health
Plan for scenarios beyond backup
Conduct thorough postmortems
Conduct blameless postmortems
Make postmortem report readable by all intended audiences
Avoid complex or over-engineered solutions
Share the postmortem as widely as possible
Cost optimization
Align spending with business value
Prioritize managed services and serverless products
Balance cost efficiency with business agility (adopt DORA metrics, follow SRE practices)
Enable self-service optimization
Adopt and implement FinOps
Promote a value-driven and TCO-aware mindset
Foster a culture of cost awareness
Provide org-wide cost visibility
Understand how cloud resources are billed
Understand resource-based cost optimization options (always-allocated CPUs for Cloud Run, BigQuery slot commitments)
Understand discount-based cost optimization options (CUDs, sustained use discounts, Spot VMs)
Incorporate cost estimates into architecture blueprints
Use a consistent and standard set of labels for all resources
Share cost reports with team members
Optimize resource usage
Choose environment-specific resources (prod, dev/test, others)
Choose workload-specific resources (mission-critical, non-critical, event-driven, experimental)
Select regions based on cost requirements
Use built-in cost optimization options
Optimize resource sharing
Develop and maintain reference architectures
Enforce cost discipline by using organization policies
Estimate realistic budgets and set financial boundaries
Optimize continuously
Focus on business-relevant metrics (latency, error rates, throughput, customer satisfaction, DORA, SRE)
Use observability for resource optimization
Balance troubleshooting needs with cost
Tailor data collection to roles and set role-specific retention policies
Consider regulatory and compliance requirements
Implement smart alerting (customer impacting first)
Performance optimization
Core principles
Define requirements
Design and deploy
Monitor and analyze
Optimize
Plan resource allocation
Configure and manage quotas
Educate and promote awareness
Monitor performance metrics
Take advantage of elasticity
Plan for peak load periods
Use predictive scaling
Implement serverless architectures
Use Autopilot mode for GKE
Promote modular design
Design for loose coupling
Design for concurrency and parallelism
Balance modularity for flexible resource allocation
Use stateless models
Choose complementary technologies
Continuously monitor and improve performance
Define clear performance goals and metrics
Monitor performance
Incentivize continuous improvement
Section 36 - Case Studies - Solutions - Professional Cloud Architect
EHR Healthcare
https://services.google.com/fh/files/blogs/master_case_study_ehr_healthcare.pdf
Anthos - allows you to run K8s clusters anywhere; centralized management; service mesh; CI/CD - watch Git repo for changes and apply changes to all clusters
Cloud Logging & Monitoring - alerting policies and notifications
Cloud Logging sinks to Cloud Storage/BigQuery - log retention
BigQuery - make predictions and generate reports; can get data into it via Pub/Sub (streaming) or GCS (batch)
Dedicated Interconnect - secure and high performance connection between on-prem and GCP
Cloud CDN - reduced latency
ADFS and Cloud Directory Sync - SSO
MySql/MS SQL → Cloud SQL
Redis → Memorystore
MongoDB → https://cloud.google.com/mongodb?hl=en or Marketplace/Datastore/Firestore
Helicopter Racing League
https://services.google.com/fh/files/blogs/master_case_study_helicopter_racing_league.pdf
BigQuery + AI platform → ML/AI + analytics
Cloud Storage + Cloud CDN → move realtime and recorded content closer to the users
Apigee → Expose predictive models to partners
Pub/Sub + Dataflow + BigQuery → realtime transformations, messaging
Transcoder API → increase transcoding performance
Video Intelligence API
Mountkirk Games
https://services.google.com/fh/files/blogs/master_case_study_mountkirk_games.pdf
Agones + Kubernetes → game servers (https://agones.dev/site/ )
MIGs + VMs + Global LB + GPUs
Cloud Storage → store activity logs
Cloud Spanner or Memorystore → realtime leaderboard
Cloud Build + Spinnaker/Jenkins → CI/CD
Spanner + Memorystore → match history DB with regional caching
Bigtable → time series events
Firestore → player DB
BigQuery/AI Platform → analytics and intelligence
TerramEarth
https://services.google.com/fh/files/blogs/master_case_study_terramearth.pdf
IoT Core, Pub/Sub, Dataflow, BigQuery →
Cloud Storage, Dataflow, BigQuery
Apigee
Cloud Build, Spinnaker, Jenkins
Google Workspace - collaboration
Secret Manager - cloud native solutions for keys and secret management
GKE - container based workloads
Dedicated Interconnect - networking, high data volume
Subscribe to my newsletter
Read articles from Radu Pana directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
