Background

I recently (Feb 2025) recertified as a Professional Cloud Architect, within Google Cloud Platform. I’ve been working professionally with both AWS and GCP for a few years now, and I enjoy keeping up to date with their expansive product offering, especially since my day-to-day usage at work really only revolves around a small set of services, such as GKE, Cloud Storage, BigQuery and the cross-cutting products such as IAM and KMS.

This post contains the notes I used to study for the exam.

Overview

Notes for the GCP Professional Cloud Architect: Google Cloud Certification course.

Section 2 - Google Cloud Regions and Zones

20+ regions - specific geographical location where resources are hosted
Using multiple regions => high availability, low latency, global footprint, govt regulations
Zones => help achieve H/A in the same region
Each region has at least three zones
Each zone has one or more clusters (physical infrastructure)
Zones within a region are connected with low latency links
It’s important to know which resources are zonal, which are regional and which are global; for example
- Zonal: VMs, PDs, local SSDs, GPUs
- Regional: regional MIGs, regional PDs, Cloud SQL, Filestore, Internal LBs
- Global: Cloud CDN, Cloud Spanner, buckets (depending on config), global LBs, Cloud CDN, BigQuery, Pub/Sub, IAM

Section 3 - Google Compute Engine

Virtual machines = virtual servers in GCP
Google Compute Engine (GCE) - service to provision and manage virtual machines
Support for load balancing (either through discrete VMs that are created as a back-end services with network endpoint groups, or instance groups, managed or unmanaged)
Supports auto-scaling for multiple VM instances, through MIGs (managed instance groups)
You can attach storage to VM instances (persistent disks - PDs, or local SSDs)
You can manage network connectivity (through VPC firewall rules) and configuration of VM instances
Machine families and machine types define the type of hardware that VMs run on
Families
- General Purpose (E2, N2, N2D, N1): best price/performance ratio; web/application servers, small/medium databases, dev environments
- Memory Optimized (M2, M1): large databases, in-memory analytics
- Compute Optimized (C2): gaming application
- Storage Optimized (Z3)
Types - define exactly how much CPU and memory is provisioned
- For each family, there are a number of available types, for example: e2-standard-16 (16 vCPUs), m2-megamem, m2-ultramem, etc.
- Memory and networking capabilities increase with CPU (this is why you only see the CPU value in the type name)
Image - define which OS and software you want installed
- Public and custom (yours) images
- This will be defined as part of the boot disk, for which you can also select the type (HDD, SSD, etc.) and provisioned IOPS
- You can also use snapshots and existing disks to define your boot disk
External IPs are publicly addressable (access can still be restricted through firewall rules)
Internal IPs are internal and private, can only be used inside the VPC network (not across networks, unless VPC peering is enabled); an instance’s internal IP address must come from the range of the subnet it resides in
All VM instances are assigned at least one internal IP address; External IPs are optional and can be opted into; stopping a VM instance leads to losing its IPs, unless using static IPs
Static IP addresses (Option to Reserve Static IP Address) allow you to assign a constant IP address (internal or external) to a VM instance
- Regional or global (you should create the IP address in the region where you have the VMs)
- IPv4 or IPv6
- Standard or premium network
- Billed even when not in use (higher bill than when in use)
Startup script
- Allows bootstrapping of VMs (install OS patches, software, etc.)
- Configured when creating a virtual machine (Management → Startup Script)
- Takes time, and needs to be executed every time you start the VM => better alternative is to create a custom image, that has everything you need already installed
Instance template
- Describe a VM instance once and then use that template to create groups of identical instances
- Define the machine type, image, labels, startup script, etc. in a reusable template
- Used to create individual instances or instance groups (useful as back-ends to LBs)
- Immutable, can’t be changed once created, but can be copied and a new version created
- Can specify an image family instead of a specific image (in which case, the latest version is used)
- Can be used across regions/zones
  - Global instance templates can be used in any region. To lower the impact of outages outside your region and gain data residency within your region, use a regional instance template
Custom image
- Installing OS patches and software at the launch of an instance (through startup scripts) increases boot time
- The alternative is to create a custom image with OS patches and software pre-installed
- An image is a replica of a disk that contains the applications and operating system needed to start a VM. You can create custom images or use public images pre-configured with Linux or Windows OSes
- An image is a bootable data disk that you can use to create virtual machine (VM) instances. It contains the operating system (OS), configured settings, and installed applications
- Can be created from
  - an existing VM instance
  - a persistent boot disk
  - a snapshot (point in time extract of the data on a PD - stored in GCS)
  - another image
  - a file in Cloud Storage
- Can be shared across Projects
- You can deprecate old images and specify replacement image
- You can create a corporate image that incorporates all your security standards (a hardened image)
- You should prefer custom images to startup scripts
- Images can be stored regionally or multi-regionally
- It’s not recommended to create images from disks attached to a running instance; stop the instance first

Section 4 - Instance Groups and Load Balancing

Instance Group - a group of VM instances, managed as a single entity
- Can be zonal or regional (each instance is still zonal, but there are multiple instances across multiple zones), but NOT global
- Can be managed or unmanaged
Managed (MIGs)
- Identical VMs created using an Instance Template (same image, type, config) - you MUST use an instance template
- Features: auto scaling, auto healing, managed releases
- Can be used as a back-end for an LB
Unmanaged
- Can have different configurations (image, hardware)
- Does not offer auto scaling/healing and other services
- Not recommended, unless you really need different kinds of VMs
- You can use an instance template to create individual VMs, if you want, or you can add different instances to the IG manually
- Can be used as a back-end for an LB
MIGs
- Can maintain a certain number of instances
- Can detect application failures using health checks (self healing)
  - You need to make sure that the health checkers can reach the instances, by adding the source IP addresses to firewall rules
  - Health checks determine if applications on your VMs respond to requests. They’re used for load balancing and with autohealing in managed instance groups
- Increase and decrease instances based on load (auto scaling) - you configure min, max instances and the metrics to track; can also define a cooldown period to prevent frequent scale-up/scale-down
- Can add Load Balancer in front of them, to distribute load
- Can create instances in multiple zones (regional MIGs)
- Can release new application versions without downtime (rolling updates, or canary deployments)

Updating a MIG
- Rolling update - gradual update of instances to a new instance template
  - Optionally you can specify a template for canary testing
  - Specify when the update should happen (proactive = immediately, or opportunistic)
  - Maximum surge = maximum number of instances added at a point in time
  - Maximum unavailable = how many instances can be offline during the update
- Rolling restart/replace
  - No change in template, but you want to restart/replace
Cloud Load Balancing
- Distributes traffic across instances of an application in a single region or multiple regions
- Enables H/A, auto scaling, resiliency
- Fully distributed, software defined, managed service
- Important features
  - Health check - routes to healthy instances (this can be a different health check to the one configured on MIGs, but they are configured in the same place in the UI)
  - Auto scaling (through back-end services, such as MIGs)
  - Supports global load balancing with single anycast IP (for global LBs)
  - Supports internal load balancing, as well as external

HTTP(S) load balancer configs
- Backend services - can be instance groups, buckets or network endpoint groups
  - You can have multiple backend services configured for a single HTTP(S) LB, and have routing rules to each
  - You can enable Cloud CDN to cache static content
  - You configure the health check (mandatory)
  - Each back-end service points to one or more back-ends; for example, a back-end service for Service A may have two back-end MIGs, each one in a different region
    - The load balancer uses a set of policies and algorithms to distribute traffic among the backends. It does not simply use round-robin distribution; instead, it considers factors like backend health, capacity, and proximity to the client.
    - For instance, if backend service 1 has two backends (one MIG in eu-west2 and one MIG in US-east-1), the load balancer will typically route the request to the backend that is geographically closest to the client, provided that the backend is healthy and not overloaded.
    - The exact selection criteria can include factors like latency, load, and health checks.
- Host and path rules (routing rules)
  - Supports advanced rules, like URL redirect and URL rewrite, based on host, path, and headers
  - URL Maps - route requests to backend services or backend buckets (for example, map /service-a to Backend Service A and /service-b to Backend Service B )
- Frontend configuration
  - Protocol and port
  - For SSL, a certificate must also be selected

SSL/TLS termination/offloading
- Client to load balancer: HTTPS
- Load balancer to VM instance: HTTP (inside GCP network, so safer) - reduce the load on the instances, they don’t need to worry about TLS/HTTPS
Choosing an LB: https://cloud.google.com/static/load-balancing/images/lb-product-tree.svg → note that ALBs are always proxy, whereas with NLBs you can decide on proxy vs pass-through
Features
- External HTTP(S) - Global, External, Proxy only
- Internal HTTP(S) - Regional, Internal, Proxy only
- SSL Proxy - Global, External, TCP with SSL offload, Proxy
- TCP Proxy - Global, External, TCP without SSL offload, Proxy
- External Network - Regional, External, TCP/UDP, Passthrough
- Internal TCP/UDP - Regional, Internal, TCP/UDP, Passthrough
LB across MIGs in multiple regions
- You need to use the Premium networking tier, for global routing to work (in standard tier, the forwarding rule and external IP are regional)
- Use HTTPS LB for this use case
- You create multiple Regional MIGs in different regions, in the same Project
- HTTPS LB can distribute load to the MIGs behind a single external IP address - requests are redirected to the nearest region
- If all backends within a region are unhealthy, traffic is distributed to healthy instances in other regions

Section 5 - Google Cloud Compute Engine and Load Balancing for Architects

For the architect certification, you not only need to know the services and how to use them, but also how to build resilient, H/A, scalable, secure, performant solutions
Availability
- Percentage of time an application provides the expected operations
- 99.99% availability is an example (4.5 minutes of downtime per month) - most online apps aim for this value
- Achieving H/A in Compute Engine
  - Use multiple regional instance groups for each microservice
  - Distribute load using a Global HTTPS Load Balancing (Cloud Load Balancing is H/A by default, managed service)
  - Configure Health Checks for Instance Groups and Load Balancing
  - Enable Live Migrations for VM instances
- Live Migration
  - Your running instance is migrated to another host in the same zone
  - Supported even for instances with Local SSD
  - Not supported for GPUs and preemptible instances
- Availability Policy (a config when you create VM instances)
  - On host maintenance - select either migration of instances, or restart
  - Automatic restart - GCP can auto restart instances if they were terminated for non-user initiated reasons
Scalability
- Can we handle growth in users, traffic, or data size, without drop in performance?
- Ability to adapt to changes in demand
- Options
  - Deploy to a bigger instance, with bigger CPU/memory (vertical); vertical scaling has limits and can be expensive
    - Machine type can be changed, to achieve vertical scaling: e2-standard-2 → e2-standard-4, for example; changing the type requires stopping the machine => drop in availability
  - Increase the number of app instances and set-up a LB (horizontal); also increases availability, whereas vertical does not; but needs additional infrastructure, such as load balancers
    - Distribute VM instances (either in a single zone, or multiple zones, in a region, or even in multiple zones across regions)
    - Auto scale, using MIGs
    - Distribute load using LBs
Security
- Use firewall rules to restrict traffic (ingress and egress)
- Use internal IP addresses as much as possible
- Use sole-tenant nodes when you have regulatory needs
- Create a hardened custom image to launch VMs
Performance
- Choose the right machine family for your workload
- Use GPUs (AI/ML) and TPUs (massive matrix operations) to increase performance
- Prefer creating custom image, to installing custom software at boot time
Resiliency
- Ability of system to provide acceptable behaviour even when one or more parts fail
- Run VMs in MIGs behind global LB
- Use Cloud Monitoring for monitoring
- Install logging agent to send logs to Cloud Logging
- Enable Live Migration and Automatic Restart
- Upto date image copied to multiple regions, to support disaster recovery
GPUs
- Used to accelerate math intensive and graphics-intensive workloads, for AI/ML, etc.
- You can attach GPUs to VMs, but they are expensive, and you need to use images with GPU libraries
- Not supported on all machine types
Cost
- Sustained use discounts
  - Automatic (no action required on your part) discount for running VMs for a significant portion of the billing month
  - Discounts increase with usage
  - Applicable to instances created by GKE and Compute Engine
  - Not applicable on certain machine types, or VMs created by App Engine Flexible and Dataflow
- Committed use discounts
  - Use these for workloads with predictable resource needs
  - You commit for 1 or 3 years
  - You commit for a region and a specific machine type
  - Up to 70% discount based on machine type and GPUs - higher than sustained use discounts
  - Applicable to instances created by GKE and Compute Engine
  - Not applicable on VMs created by App Engine Flexible and Dataflow
- Preemptible VMs
  - Short-lived (lasts only 24 hours), cheaper compute instances - similar to AWS spot instances (new version of Preemptible is actually called Spot)
  - Can be stopped by GCP any time, before the 24 hours expire, and you get a 30 seconds warning
  - Use these if your apps are fault tolerant, you are cost sensitive, workload is not immediate (batch programs that can be delayed)
  - Not always available, no SLA, cannot be migrated to regular VMs, cannot be automatically restarted, free tier credits not available
  - Latest version of preemptible VMs are called Spot instances - these do not have a max runtime (24 hours), the other properties are the same
Billing for Compute Engine
- You are billed by the second, after a minimum of 1 minute
- You are NOT billed for stopped instances, but are still billed for attached storage
- Always create a Budget and attach an Alert to get alerted of over spend
- Cost efficiency
  - Use auto scaling - have optimal number and type of VM instances running
  - Understand sustained use discounts and committed use discounts
  - Use preemptible VMs

Section 6 - Getting Started with Gcloud

The gcloud command has a standard format: gcloud GROUP SUBGROUP ACTION, for example gcloud compute instances listor gcloud compute instances create
gcloud config list will show you configuration such as the default region to be used
You can use --filter to filter the results, for example --filter zone:europe-west1
Cloud Shell
- Backed by a VM instances that is automatically provisioned; 5 GB of persistent storage available in $HOME; prepackaged with the latest version of Cloud SDK, Docker, etc.; files in the $HOME directory persist between session; instance is terminated after 20 mins of inactivity
- Can be used to SSH into virtual machines using their private IP addresses
  - Cloud Shell runs in a Google-managed VPC network which is VPC peered with the project’s VPC network (you can configure which VPC it should use)

Section 7 - Getting Started with Google Cloud Platform Managed Services

IaaS - Only use infrastructure from the cloud provider; for example, using a VM to deploy your applications and DBs
- You are responsible for app code and runtime, load balancing, auto scaling, OS upgrades and patches, availability, etc.
- Provider is responsible for virtualization, physical hardware, networking
PaaS
- Cloud provider responsible for OS, application runtime, auto-scaling, availability and LB
- You are responsible for app config, app code
Serverless
- You don’t need to worry about where to deploy, the types of servers, OS, scaling, availability
- Zero visibility into infrastructure
- Pay for use - pay for requests, not servers
- Focus on code
Managed services for Compute
- Compute Engine - IaaS
- App Engine - PaaS
- Cloud Run - CaaS (container as a service); does not need a Cluster
- GKE - CaaS; needs a Cluster
- Cloud Functions - FaaS

Section 8 - Getting Started with Google Cloud App Engine

PaaS - lower flexibility, but fewer responsibilities compared to Compute Engine
Simplest way to deploy and scale apps in GCP; provides end-to-end app management
Supports Go, Java, .NET, Node.js, PHP, Python, Ruby (in Standard mode) and anything basically (containers) in Flexible mode
Has good integration to various storage products, like Cloud SQL
No usage charges - pay for resources provisioned
Provides automatic load balancing and auto-scaling
Provides app health monitoring
Provides app versioning + traffic splitting
Types
- Standard
  - Apps run in language specific sandboxes; complete isolation from OS/disk/other apps
  - V1: Java, Python, PHP, Go (older versions)
  - V2: Java, Python, PHP, Node.js, Ruby, GO (newer versions)
- Flexible
  - Apps run within Docker containers
  - Makes use of Compute Engine virtual machines
  - Support any runtime (built-in support for Python, Java, Node.js, Go, Ruby, PHP, or .NET)
  - Provides access to background processes and local disks
Component hierarchy
- Application → Service → Version → Instance
- Application
  - One App per Project; the container for everything created as part of App Engine
  - Limited to a single region, so be careful which region you select
  - app.yaml file is used to define configuration of the application
- Service
  - You normally define one per microservice
  - Each service can have its own configuration, scaling settings, and runtime
  - Each service has its own URL
  - Each Service can have different versions
- Version(s)
  - A Version is a specific iteration of a service
  - Code + configuration
  - Multiple versions can co-exist, and support traffic split
  - gcloud app deploy --version=v2 (traffic will automatically be switched over to this version; you can use --no-promote to avoid this)
  - gcloud app services set-traffic splits=v3=.5,v2=.5
- Instance(s)
  - An Instance is a single copy of a version of a service. Instances are the actual compute resources that handle incoming requests and run your code
Standard vs Flexible

Feature	Standard	Flexible
Pricing	Instance hours	vCPU, memory, persistent disks
Scaling	Manual, Basic, Automatic	Manual, Automatic
Scaling to 0	Yes	No
Instance startup time	Seconds	Minutes
Rapid scaling	Yes	No
Max request timeout	1 to 10 mins	60 mins
Local disk	Mostly. Can write to /tmp	Yes. Ephemeral
SSH for debugging	No	Yes

Scaling App Engine instances
- Automatic - recommended for continuously running workloads;
  - Can scale based on target CPU utilization; target throughout utilization; max concurrent requests;
  - Can configure max/min instances
- Basic - instances are created as and when requests are received
  - Recommended for adhoc workloads
  - High latency is possible
  - Tries to keep costs low
  - Not supported by Flexible
  - Configure Max Instances and Idle Timeout
- Manual

Section 9 - Google Cloud Kubernetes Engine for Cloud Architects

Managed Kubernetes service
Each cluster can have different kinds of VMs (Nodes), defined in Node Pools (you can have multiple Node Pools per cluster) - you can use labels and affinity to define which Node Pool a Pod should be placed on
Provides auto-repair and auto-upgrade
Clusters can be zonal or regional (NOT global)
Provides Pod and Cluster Autoscaling
- HPA automatically scales the number of pod replicas in a deployment or replication controller based on observed CPU utilization (or other select metrics)
- VPA automatically adjusts the resource requests and limits (CPU and memory) for containers in your pods to match the required resources (useful for stateful apps, where horizontal scaling isn’t sufficient)
Integrates well with Cloud Logging and Monitoring
Uses Container-Optimized OS, a hardened OS, purpose built to run containers, built by Google
Supports mounting PDs and local SSDs; for example:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-ssd
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  replication-type: none


apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-dynamic-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard-ssd

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx
    volumeMounts:
    - mountPath: "/mnt/data"
      name: my-dynamic-storage
  volumes:
  - name: my-dynamic-storage
    persistentVolumeClaim:
      claimName: my-dynamic-pvc

Two ways to create clusters: Autopilot (hands-off, pay per pod), Standard (full ownership)
Autopilot
- It’s now the standard UI that pops up when you try to create a cluster; it’s hard to find the button that takes you to Standard
- Reduce operational costs; hands-off experience, no managing of Nodes and Node Pools
- Provides security hardened opinionated settings
Commands
- gcloud container clusters resize my-cluster --node-pool default-pool --num-nodes 5 --zone eu-west1-b
- kubectl autoscale deployment hello-world-deployment --max=4 --cpu-percent=70 (will create a Horizontal Pod Autoscaler which references a Deployment)
- gcloud container clusters update my-cluster --enable-autoscaling --min-nodes=1 --max-nodes=10
  - Will add up to 10 Nodes to a cluster, depending on resource requests
Cluster
- Master node(s)
  - Manages the cluster
  - Not visible to you, Google managed
  - Components
    - API Server - handles all communications for a K8s cluster (from nodes and from outside)
    - Scheduler - decides placement of Pods
    - Controller Manager - manages deployments and replica sets
    - etcd - distributed database, storing the cluster state
- Worker node(s)
  - You can actually see them in Compute Engine, when using Standard, and not Autopilot
  - Runs Pods
  - Kubelet - manages communication with the Master node(s)
Cluster types
- Zonal - single zone, single control plane; all Nodes run in the same zone; can be multi-zonal: single control plane, but nodes run in multiple zones
- Regional - replicas of control plane run in multiple zones within a region; Nodes also run in the zones where the control plane Nodes run
- Private - VPC-native cluster; Nodes only have internal IP addresses
- Alpha - created with early feature APIs
GCR - Google Container Registry
- Alternative to DockerHub
- Fully managed container registry
- Naming: HostName/ProjectId/Image:Tag => gcr.io/projectName/helloworld:1.0
Scenarios
- Keep costs low and optimize GKE implementation => preemptible VMs, use correct region, committed-use discounts
- Efficient, completely auto-scaling GKE solution => Use HPA and Cluster Autoscaler
- Execute untrusted third-party code in K8s cluster => Create a separate Node Pool with GKE sandbox and run it there
- Enable only internal communication between microservices => Service of type ClusterIP
- Pod stays Pending => Pod cannot be scheduled because of insufficient resources
- Pod stays Waiting => Problem with pulling the image

Section 10 - Getting Started with Google Cloud Functions

Run code in response to events
Node.js, Python, Go, Java, .NET, Ruby
You don’t need to worry about servers, scaling, availability; you can focus on code
Even though it’s serverless, you must still choose a region for the function to run in, as well a the memory (which implies CPU)
Pay for what you use - number of invocations, time of execution, memory and CPU used
Default timeout of 1 min, configurable to up to 60 mins (for HTTPS triggers)
Two versions: 1st gen, 2nd gen (built on Cloud Run and EventArc)
Event = something that happened
Trigger = respond to an event with a Function call; the Event data is passed to the function
Events can be triggered from: Cloud Storage, Cloud Pub/Sub, HTTP call, Firebase, Firestore, Stack driver logging; there is an Other option, where you can basically choose any kind of EventArc trigger
Cloud Functions is currently being integrated into the Cloud Run UI; in fact, the product is now called “Cloud Run Functions”
You can select memory and CPU, but memory is dependent on CPU (“Selected memory requires a minimum of N vCPUs”)
You can choose either unauthenticated access or authenticated access when selecting HTTP triggers
You can select the maximum number of concurrent requests that can reach a container instance
You can select the service account that the Function assumes (in case it needs to interact with other GCP products)
You can select the networking: allow all traffic, allow internal traffic (from VPC networks in the same project), internal traffic + from Cloud Load Balancing
For egress, the function can’t normally access VPCs, but you can set this up using a VPC connector
Supports traffic splitting between versions (gen 2)

Section 11 - Getting Started with Google Cloud Run

“Container to Production in Seconds”
Built on top of Knative
Pay per use (CPU, Memory, Requests and Networking)
Cloud Run supports yaml configuration to deploy containers
Supports traffic splitting (similar to Cloud Functions)
Anthos - run Kubernetes clusters anywhere, including on your data centres
- Cloud Run for Anthos - deploy workloads to Anthos clusters running on-prem or on Google Cloud
gcloud run deploy
gcloud run revisions list
gcloud run services update-traffic myservice --to-revisions=v2=10,v1=90

Cloud Run is getting more and more similar to Cloud Functions, especially now that Cloud Functions are actually powered by Cloud Run, but there are nuances:
- Cloud Run primarily uses HTTP as triggers, whereas Functions offer more trigger options
- Cloud Functions is fully serverless, you don’t manage anything, whereas with Cloud Run you get more flexibility
- In Cloud Functions, you directly deploy code, whereas in Cloud Run you use containers
- Cloud Functions supports predefined runtimes, whereas Cloud Run supports anything via containerization
- Use Cloud Functions for simple event-driven functions and Cloud Run for flexible microservices and APIs

Feature	Cloud Functions	Cloud Run	App Engine
Type	Function-as-a-Service (FaaS)	Container-as-a-Service (CaaS)	Platform-as-a-Service (PaaS)
Use Case	Event-driven, single-purpose functions	Containerized microservices, APIs	Full web applications, backend services
Deployment Unit	Individual functions	Containers	Applications with multiple services
Language Support	Limited set of runtimes (Node.js, Python, Go, Java, .NET, Ruby, PHP)	Any language that can run in a container	Standard env: Limited runtimes Flexible env: Any via containers
Statelessness	Stateless	Stateless	Can be stateful with datastore
Request Timeout	Up to 9 minutes (2nd gen)	Up to 60 minutes	Standard: 1 minute Flexible: No hard limit
Cold Start Time	Low to medium	Medium	Standard: Very low Flexible: Higher
Scaling	Automatic scaling to zero	Automatic scaling to zero	Standard: Automatic scaling to zero Flexible: Scales to minimum instances
Resource Allocation	Memory: 256MB-8GB CPU: Limited options	Memory: 512MB-32GB CPU: Full control	Standard: Limited Flexible: Customizable
Pricing Model	Pay per invocation and compute time	Pay per request and container instance time	Standard: Pay per instance hour Flexible: VM instance pricing
Free Tier	2M invocations/month 400,000 GB-seconds/month	2M requests/month 360,000 vCPU-seconds/month 180,000 GiB-hours/month	28 instance hours/day 9 instance hours/day (flexible)
Execution Environment	Fully managed	Fully managed	Fully managed
Custom Domain Support	Via API Gateway	Built-in	Built-in
VPC Networking	Supported	Supported	Supported (flexible only)
Startup Time	Fast	Medium	Standard: Very fast Flexible: Slower
Dependencies	Limited	Any dependencies in container	Standard: Limited Flexible: Any
Background Processes	Limited	Supported	Standard: Not supported Flexible: Supported

Section 12 - Exploring Google Cloud Functions 2nd generation

Longer request timeout (up to 60 mins for HTTP-triggered functions)
Larger instance sizes (up to 32 GiB of memory with 8 vCPU)
Concurrency: up to 1000 concurrent requests per function instance (v1 only supported 1 request per instance)
Built on top of Cloud Run (supports revisions + traffic splitting)
Support for 90+ event types, via EventArc
When you deploy a Cloud Function, behind the scenes a Cloud Build build is triggered, that creates a container and deploys it
Scaling and concurrency
- Autoscaling - in the past, as new invocations come in, new function instances are created; 1 request per instance (this is how 1st gen works)
  - A function that has completed execution can be used by future requests
- Typical problem is cold start (new function instance takes time to be ready)
  - A typical solution is to configure a minimum number of instances, regardless of number of requests => increased cost
- 2nd gen supports concurrency - how many invocations a single instance can handle, up to 1000 (but your code should be able to handle concurrent requests)
- gcloud functions deploy <name>
  - --source (Zip file in GS, or source repo, or local filesystem)
  - --trigger-bucket or --trigger-http or --trigger-topic or --trigger-event-filters (only for gen2, uses EventArc events)
  - --docker-registry (default is container-registry, an alternative is artifact-registry)
  - --docker-repository
  - --gen2
  - --runtime
  - --service-account (by default, 1st gen use default app engine service account and 2nd gen uses default compute engine service account
  - --timeout
  - --min-instances
  - --max-instances
- Best practices
  - Set minimum number of instances, to avoid cold starts
  - Configure max number of instances (protect from abnormally high request levels)
  - Use Cloud Endpoints (or Apigee or API Gateway) for versioning
  - Use Cloud Run revisions for safer releases
  - Use Secret Manager to securely store secrets
  - Use individual Service Accounts for each function

Section 13 - Getting Started with Cloud KMS

Data can be at rest, in transit (in motion), either within the cloud or in/out of the cloud, and in use (non-persistent, in volatile memory)
Defense in depth => encrypt all data (hard disks, databases, in transit)
Symmetric key encryption uses the same data encryption key for encryption and decryption
- You need to choose the right algorithm (strong)
- You need to secure the encryption key
- You need to figure out how to share the encryption key
Asymmetric key encryption uses a public and a private key; the public key is publicly available; encrypt with public, decrypt with private
- Finding the private key from a public key, when using a strong algorithm, is unfeasible
Cloud KMS creates and manages cryptographic keys (both symmetric and asymmetric)
You can control the keys' usage in your apps and GCP services
KMS provides APIs to encrypt, decrypt and sign data
Integrates with Google services
- Google-managed key: no action on your part
- Customer-managed key: you create a key in KMS and use it
- Customer-supplied key: provide your own
Using KMS
- Before creating a key, you need to create a keyring, where you can attach multiple keys (keyrings can be global or regional)
- You can integrate an external key manager with KMS (only for regional keyrings)
- You choose the protection level: Software, or HSM
- You choose the purpose: symmetric encrypt/decrypt, asymmetric sign, asymmetric decrypt
- GCP can automatically rotate keys
- You need to configure IAM access accordingly; for example, if using a customer managed key to encrypt disks for Compute Engine, the Compute Engine service account needs access to the key

Section 14 - Exploring Block and File Storage

Block storage
- Hard disks attached to computers
- Typically, one block storage device is attached to one virtual server (read-only can be attached to multiple machines)
- You can connect multiple block storage devices to one server
- Used as: DAS (direct-attached storage), SAN (storage area network)
File storage
- Shared by several servers
- You can connect to it from GKE or Compute Engine instances
Block storage in GCP
- Persistent Disks: network block storage
  - Zonal or regional (2x the cost of Zonal): defines how data is replicated across zones (when writing data to a regional PD, GCP ensures that it is synchronously written to the replica)
  - Durable, lifecycle not tied to the host
  - Provisioned capacity, you configure the size you want; you can increase size while attached to VM
  - Performance scales with size
  - By default, a boot disk is attached to your VMs, but you can add more
  - By default, disks are deleted when the VM is deleted; can be configured when creating the instance (Keep boot disk)
  - Scenarios
    - Improve performance: increase size, add more PDs, increase vCPUs of the instance
    - Increase durability: use regional PDs (replicated across 2 zones)
    - Hourly backups: schedule snapshots
    - Delete old snapshots: configure part of the schedule
  - Attaching a PD to a running instance
    - Attach disk to the running/stopped VM: gcloud compute instances attach-disk <instance_name> --disk <disk_name>
    - Format the disk
    - Mount the disk
  - Resizing PDs (even when running)
    - gcloud compute disks resize <disk_name> --size <new_disk_size>
    - Take a snapshot, for restore, in case there are issues
    - Resize the file system and partitions
  - Types
    - Standard
      - Underlying storage: Hard Disk Drive
      - Sequential IOPS: Good
      - Random IOPS: Bad
      - Cost: cheap
      - Use case: Big Data (cost efficient)
    - Balanced
      - Underlying storage: SSD
      - Sequential IOPS: Good
      - Random IOPS: Good
      - Cost: cheaper than SSD
      - Use case: Balance between cost and performance
    - SSD
      - Underlying storage: SSD
      - Sequential IOPS: Very Good
      - Random IOPS: Very Good
      - Cost: expensive
      - Use case: high performance
  - Snapshots
    - Point in time backups of PDs
    - Can be scheduled
    - Can be auto-deleted after X days
    - Can be multi-regional or regional => increase durability
    - Can be shared across projects
    - Disks can be created from snapshots; if it’s a boot disk, VM instances can be launched from it
    - Incremental - 2nd snapshot only contains changes since 1st snapshot; deleting a snapshot only deletes that that is not needed by other snapshots
    - You should keep similar data together on a PD: separate OS, volatile data and permanent data
    - Snapshots are encrypted in the same way as the main disk
    - Recommendations
      - Avoid taking them more often than once per hour → they can reduce performance; schedule them during off-peak hours
      - Creating snapshots from disk is faster than creating from images, but creating disks from images is faster than creating from snapshots; snapshots are incremental, require some calculation before creating a disk; if you are repeatedly using a snapshot to create disks, you should create an image first
  - Machine Images
    - Different from an Image
      - A Compute Engine Image is a template for creating new VMs. It captures the boot disk of a VM and can include an OS, applications, and basic configurations.
    - A machine image is a Compute Engine resource that stores all the configuration, metadata, permissions, and data from multiple disks of a virtual machine (VM) instance. You can use a machine image in many system maintenance, backup and recovery, and instance cloning scenarios.
- Local SSDs: present (physically attached) on the same host as the VM; not available for all machine types
  - Temporary data (ephemeral storage), does not survive maintenance events unless Live Migration is enabled
  - Lifecycle tied to the host, and they aren’t backed up by snapshots or by machine images
  - High IOPS
  - Data automatically encrypted but you can’t configure the keys
  - Only supported by some machines
  - Supports SCSI and NVMe interfaces
  - Performance depends on the size of the disk and the vCPUs of the instance
  - Advantages: very fast I/O (10-100x compared to PDs)
  - Disadvantages: lower durability, lower availability, snapshots not supported
File Storage in GCP - Filestore: high performance file storage
- Shared cloud file storage
- Supports NFSv3
- Provisioned capacity (1 - 64 TB, or 100 TB for high capacity)
- Performance scales with capacity
- Supports HDD (general purpose) and SSD (performance-critical workloads)
- Use cases: file share, media workflows, content management
Global, Regional and Zonal resources
- Global: images, snapshots, instance templates
- Regional: regional MIGs, regional PDs
- Zonal: zonal MIGs, VM instances, PDs (you can only attach PDs to instances in the same zone)
Storage scenarios
- Very high IOPS but fine to lose data - Local SSDs
- High performance file sharing system, attachable to multiple VMs - Filestore
- Backup VM configs along with all PDs - Machine Images
- Make it easy to launch VMs with hardened OS and customized software - Custom Image

Section 15 - Exploring Object Storage (Cloud Storage)

Most popular, flexible, inexpensive
Serverless: autoscaling and infinite scale
Treats entire object as a unit (no partial updates)
Access control can be set at object level (ACL), but it’s not recommended (uniform access control is recommended, which acts at bucket level)
Provides REST API to access and modify objects
- Also provides CLI (gsutil, NOT gcloud) and client libraries (C++, C sharp, Java, Node.js, PHP, Python, Ruby)
Can be used to store any file types: media files, archives, backups, staging data during migrations
- Many GCP services support import/export to Cloud Storage
Objects (key + value) are stored in Buckets
Buckets
- Bucket names are globally unique → used in the object URL, so there are some restrictions on the name (only lower case, numbers, hyphens, underscores and periods; max size 63, min size 3)
- Can upload an unlimited number of objects
- Each Bucket belongs to a specific Project
- Keys are unique in a bucket
- Max object size is 5 TB
- Can be regional, dual-regional (but only a subset of combinations), multi-regional (but on the same continent) options for storage
- You can set-up cross-bucket replication between Buckets (best effort basis, but Turbo is available which provides an SLA of 15 mins replication across regions)
- You can choose the storage class - there are huge variations in access patterns (some accessed every day, others not at all, others once a month) => cheaper price for infrequently accessed objects;
  - Storage class can be defaulted at bucket level, but can be overwritten at object level
  - Some features are independent of storage class
    - 11 9s of durability
    - Low latency (first byte typically in tens of ms) → this is different to AWS
    - Unlimited storage and autoscaling
    - No minimum object size → this is different to AWS
  - Standard
    - No minimum storage duration
    - Availability of > 99.99% in multi-region or 99.99% in single region
    - Frequently used, short period of time
  - Nearline
    - Minimum storage duration of 30 days
    - 99.95% availability in multi region and dual region, 99.9% in regional
    - Read or modify once a month on average
  - Coldline
    - Minimum storage duration of 90 days
    - 99.95% in multi/dual region, 99.9% in regional
    - Accessed at most once a quarter
  - Archive
    - Minimum storage duration of 365 days
    - 99.95% in multi/dual region, 99.9% for regional
    - Accessed less than once a year
  - Auto-storage
    - Automatically transitions between Standard and Nearline
- You can choose access control
  - ACLs - Fine grained (object level)
  - Uniform (bucket level) - works by attaching policies to the bucket, which define mappings between roles and principals
- Versioning
  - Prevents accidental deletion and provides a historical view of an Object
  - Enabled at bucket level; can be turned on and off at any time
  - Live version is the latest version; if you delete a live object, it becomes non-current; if you delete non-current version, it is completely deleted
  - Versions are identified by the object key + a generation number; use this ID to perform operations with an older version
  - You can reduce costs by deleting older (noncurrent) versions - can use lifecycle management policies for this
- Lifecycle management
  - Generally, usage reduces overtime => you should automatically change storage classes/delete files that are old
  - Rules are configured at the bucket level
  - Identify objects using conditions (age, created time, live/noncurrent, number of newer versions); you can set multiple conditions
  - Supported actions
    - Set storage class, with supported transitions Standard → Nearline/Coldline/Archive, Nearline → Coldline/Archive, Coldline → Archive (basically, it flows downward only)
    - Deletion
Encrypting Cloud Storage data
- Cloud Storage encrypts data on the server side by default
- Server-side encryption (enforced by GCP, always done)
  - Performed by GCS after it receives the data
  - Use HTTPS to ensure data is safe in transit
  - Google-managed encryption: default, no config needed; GCP manages the KMS keys, not you (they are not CMKs)
  - Customer-managed: keys managed by you (CMKs), in Cloud KMS (symmetric encryption)
    - The GCS Service Account will need access to your keys in KMS
  - Customer-supplied: supply the key with every GCS operation; GCP will not store the key, just use it to encrypt; when retrieving data, you need to send the right key
    - Add API headers: x-goog-encryption-algorithm, x-goog-encryption-key (base64 encryption key), x-goog-encryption-key-sha256 (encryption key hash)
    - Or use gsutil
- Client-side encryption (optional)
  - You encrypt data before sending to GCP
Cloud Storage metadata
- Every Object has Metadata, which is a set of key/value pairs, for example:
  - Content-Type
  - Content-Encoding
  - Content-Disposition (how the content should be displayed)
  - Cache-Control
  - Content-Language
  - Custom key/value tags
- Storage class is also a key/value pair, for example (storageClass:STANDARD)
- Types of metadata
  - Fixed-key (fixed key, like Cache-Control, you can only change value)
  - Custom: flexible key/value pairs
  - Non-editable (size, created, last modified, etc.)
Bucket Lock
- Helps meet Compliance/Regulatory needs
- Configure data retention policy - can only modify objects after the retention period expires
  - Can be configured when creating a bucket, or at a later point in time; applies to existing objects as well
  - You can lock a retention policy (optional) => prevents decreasing/deleting the retention policy by anyone, but you can increase it
  - Buckets cannot be deleted if they contain objects that are younger than the retention period
Transferring data from on-prem to cloud
- GCS is the most popular destination (as a staging location)
- Online Transfer: gsutil or GCS APIs
  - Useful for one-time transfers
  - < 1 TB and when transferring from on-prem, or another GCS bucket
- Storage Transfer Service
  - Lots of data (petabytes) from on-prem, other clouds, or between GCS buckets
  - You can set up a repeating schedule
  - Supports incremental transfers (only transfer changed objects)
  - Reliable and fault tolerant - keeps track of where it left off
- Transfer Appliance
  - Physical data transfer
  - Recommended when data size > 20 TB, or if transferring data takes > 1 week
  - TA40, TA300
- Google Cloud Storage Transfer Service is best suited for large-scale, scheduled, and automated data transfers, especially when migrating data from other cloud providers or on-premises systems. It provides robust features for data validation, scheduling, and notifications.gsutil is a versatile command-line tool that is ideal for ad hoc transfers, scripting, and automation. It provides a wide range of commands and options for managing GCS data and is suitable for smaller-scale or one-time transfers.
GCS best practices
- Avoid sensitive info in bucket or object names
- Store data in closest region to the users
- Ramp up request rate gradually, above 1000 write/5000 read requests per second
- Use exponential backoff when receiving 5xx or 429 (too many requests)
- Do not use sequential numbers or timestamps as object keys (add hash value before sequence value)
- Use Cloud Storage FUSE to enable file system access to Cloud Storage; allows you to mount Buckets as file systems
Cloud Storage CLI commands
- gsutil mb gs://BKT_NAME - make bucket
- gsutil ls -a gs://BKT_NAME - list objects in bucket (-a all objects, not only current)
- gsutil cp source destination
- gsutil mv
- gsutil rewrite -s STORAGE_CLASS BUCKET - change storage class
- gsutil signurl -d 10m YOUR_KEY BUCKET - signed URL for temporary access
ACLs (Access Control Lists)
- Who has access to your buckets and objects, and at what level
- Finer granularity compared to IAM (also called uniform), which works at bucket level
- User gets access if he is allowed, by either ACL or IAM (OR not AND)
- Use ACLs to customize access to individual objects
Signed URLs
- Allow access to a user for a limited time to objects, when the user does not have a Google account
- Create a key for a Service Account and use it to create a signed URL: gsutil signurl -d 10m KEY gs://BUCKET_NAME
Exposing public static websites from Cloud Storage
- Create a bucket with the same name as the website name (name of bucket should match DNS name of website)
  - Verify that the domain is owned by you
- Add files to the bucket (including index.html and error.html)
- Add member allUsers and grant Storage Object Viewer role, in the Permissions section of the bucket (fine grained/IAM based)

Section 16 - Authentication in Google Cloud with Cloud IAM

Identities can be human, or non-human, which need to access resources and perform actions: Members performs Actions on Resources
Cloud IAM provides authentication and authorization
Identities can be
- GCP User
- Group of GCP Users
- Application running in GCP
- Application running in a non-GCP data center
- Unauthenticated users
Cloud IAM provides very granular control
Member = who
- Member types are identified by a prefix: user, serviceaccount, group, or domain
Permission = actions that can be taken on resources; have the form group.subgroup.action, eg. compute.instances.create
Role = set of permissions
Policy = assign permissions to members: what actions can be performed on which resource, by whom and under what condition
Permissions are NOT directly assigned to members, but via bindings; bindings can be applied at all levels of the GCP hierarchy (organization, folder, project, resource)
When you add members and assign roles at the appropriate level (organization, folder, project, or resource) through the Google Cloud Console, GCP automatically creates the necessary IAM bindings for you. These bindings associate the members with the roles you have selected.

  { "bindings": [ { "role": "roles/storage.objectViewer", "members": [ "user:example-user@example.com" ], "condition": { "title": "Time-based access", "description": "Allows access only between 8 AM and 5 PM UTC", "expression": "request.time >= timestamp('2025-01-15T08:00:00Z') && request.time <= timestamp('2025-01-15T17:00:00Z')" } } ] }

Role: set of permissions to perform specific access on specific resources -> different to AWS (where Roles can be assumed by trusted entities)
- Types
  - Basic (Primitive) - Owner/Editor/Viewer: wide set of permissions
    - Earliest version, they existed before IAM
    - NOT recommended - do not normally use in production (can use to give an Auditor read only access to everything, for example)
  - Predefined: fine grained, managed by Google
    - Different roles for different purposes
    - Storage Admin, Storage Object Admin, Storage Object Viewer, Storage Object Creator
  - Custom: created by you
Policy: binding between Members and Roles
You can use Policy Troubleshooter to test permissions for a given Member (in case they complain that they can’t access a specific Resource that they should have access to)
Service Accounts
- Used when an Application/VM needs access to a resource (you shouldn’t use personal credentials for this)
- Service Accounts are identified by an e-mail address that ends in something like gserviceaccount.com
- Some Service Accounts are automatically created, for App Engine and Compute Engine default service accounts
- Service Accounts can access resources, but they can be accessed themselves as resources, by other Members: “You can allow specific users to have ownership and access to service accounts and their settings. Users with the Owner or Editor basic role on the project can already modify service accounts, but you might want to restrict access for some users so that they can take only specific actions against service account resources”
- You can impersonate a service account, if your own principal has the iam.serviceAccountTokenCreator permission

gcloud auth print-access-token --impersonate-service-account=[SERVICE_ACCOUNT_EMAIL]

They do not use passwords, but instead rely on private/public RSA keys for authentication
- You can’t use it to login via UIs or cookies
Types
- Default service account - automatically created when some services are used; not recommended, because they have Editor role
- User Managed - recommended, provide fine grained access control
- Google-managed - created and managed by Google, to perform operations on the user’s behalf
Use cases
- VM → Cloud Storage
  - Uses Google Cloud managed keys, which are automatically rotated
  - Do NOT delete Service Accounts for running instances => apps will lose access
- On-prem → Cloud Storage (long lived)
  - You cannot directly assign Service Accounts
  - Create a Service Account user managed key and use the key to authenticate yourself as the Service Account
  - Make sure you keep the key file secure; you cannot regenerate the same key again => you need to generate a new key, if you lose the file
    - Set env variable called GOOGLE_APPLICATION_CREDENTIALS to the path to the key file
  - Use Google Cloud Client libraries; they use Application Default Credentials (ADC); it uses the GOOGLE_APPLICATION_CREDENTIALS env variable
- On-prem → GCP APIs (short lived)
  - Less risk compared to sharing Service Account keys
  - Credential types
    - OAuth 2.0 access tokens
    - OpenID Connect ID tokens
    - Self-signed JWTs
- Application Default Credentials (ADC) in Google Cloud Platform (GCP) provide a way to automatically find and use the credentials needed for your application to authenticate and make API calls to Google Cloud services. ADC can be used with client libraries, gcloud commands, and other tools to simplify the process of authentication. ADC looks for credentials in the following order:
  1. Environment Variable: If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC uses the credentials pointed to by this variable. This variable should contain the file path to a service account key JSON file.
  2. Application Default Credentials File: If the environment variable is not set, ADC checks for the default credentials file in the user's home directory. The default credentials file is located at: ~/.config/gcloud/application_default_credentials.json
  3. Google Cloud SDK: If the credentials file is not found, ADC checks if the Google Cloud SDK is installed and authenticated using gcloud auth application-default login.
  4. Compute Engine, Kubernetes Engine, App Engine, Cloud Functions: If no credentials are found, ADC checks if the application is running on a Google Cloud service that provides built-in service accounts, such as Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.

Section 17 - Exploring Databases in Google Cloud Platform

Increasing availability
- Have multiple standby instances available, with synch replication
Increase durability
- Have multiple copies of the data available across multiple zones, regions
RTO - recovery time objective = maximum acceptable period to recover
RPO - recovery point objective = maximum acceptable data loss
Scenarios
- Very small RPO and RTO → hot standby
- Very small RPO, moderate RTO → warm standby with automatically synced data; scale up the standby
- Very small RPO, longer RTO → regular snapshots and transaction logs + create DBs from snapshots and tx logs
- Data can be lost (cached data) → failover from old server to completely new server
Database categories
- Relational - predefined schema with tables and relationships; strong transactional capabilities
  - OLTP
    - Large number of users make large number of small transactions
    - Most traditional apps (ERP, CRM, e-commerce, banking apps)
    - MySQL, Oracle, SQL Server, etc.
    - Usually use row storage
    - GCP solutions
      - Cloud SQL (up to a few TB of data)
      - Cloud Spanner: unlimited scale (multiple PBs), and 99.999% availability; recommended for global apps with horizontal scaling
  - OLAP
    - Allow users to analyze petabytes of data: reporting apps, data warehouses, BI apps, analytics systems
    - Usually use column storage => high compression, easier to distribute data across nodes
    - GCP managed service: BigQuery (petabyte scale, distributed data warehouse)
- Non-relational (NoSQL = not only SQL); flexible schema, horizontally scalable; typically trade-off strong consistency and SQL features for scalability and high performance
  - Document
    - Cloud Datastore: provides ACID transactions, SQL-like queries, indexes; small to medium DBs
    - Cloud Firestore (next version of Datastore): strong consistency, web and client libraries; small to medium DBs
  - Wide-column
    - BigTable: not serverless, you create an instance and then tables; recommended for data size > 10 TB; not recommended for transactional workloads (does not support multi-row transactions)
  - Key/Value
  - Graph
  - In-memory
    - Data stored in memory => much faster than disk
    - Memorystore: caching, session management, gaming leaderboards, geospatial apps
Factors when choosing a type
- Fixed schema?
- Level of transactional properties?
- Latency requirements?
- Number of expected transactions?
- Amount of data?

Section 18 - Relational Databases for Transactional Applications in Google Cloud Platform

Cloud SQL
- Fully managed, relational DB service
- Configure your needs and do not worry about managing the DB
- Supports MySQL, Postgresql, and SQL Server
- It’s regional, NOT global
- Supports SSDs (best performance) or HDDs
- Provisioned capacity: up to 96 vCPU, 624 GB RAM and 64 TB storage (supports automatic storage increase without downtime)
- Important features
  - Automatic encryption, maintenance and updates
  - H/A with failover, by using a Standby
    - Choose Primary and Secondary zones within a region
    - Changes from Primary are replicated synchronously to Secondary
    - In case of a Zonal failure, automatic failover is done, but it does NOT revert automatically
    - You cannot connect to the Secondary, when Primary is running and vice versa; you can only connect to one at a time (if you want to be able to connect to others, use Read Replicas)
  - Read replicas (cross-zone, cross-region, external)
    - Each read replica has its own endpoint that clients can connect to; if you try to write, you get an error: ERROR 1290 (HY000): The MySQL server is running with the --read-only option so it cannot execute this statement
  - PITR: point in time recovery (down to the second), up to 35 days
    - Cloud SQL offers Point-in-Time Recovery (PITR) by combining daily automated backups with continuous binary logging
  - Automated backups (daily), and on-demand backups
  - Migration from other data sources using DMS (Database Migration Service)
  - Export data in CSV,SQL formats
- Use for simple relational use cases, use Spanner (but very expensive) if
  - Huge volumes of data (100s of TB)
  - Infinite scaling
  - Global database needs
  - Higher availability (99.999%)
- Best practices
  - Use Cloud SQL Auth Proxy to securely connect to Cloud SQL from your apps (GAE, Functions, Cloud Run, GKE, etc.)
    - Allows you to connect to your Cloud SQL instances securely without having to configure SSL certificates or IP whitelisting. It provides secure access to your Cloud SQL instances by authenticating and encrypting the connection using your Google Cloud credentials. The proxy can be run as a background process on your local machine, a VM, or as a sidecar container in Kubernetes.
  - Have multiple small instances, rather than one large, so you can horizontally scale writes (because Cloud SQL doesn’t do horizontal scaling)
  - Understand back-ups and export
    - Backups are lightweight and provide PITR; they are deleted when the instance is deleted and you can’t take a backup of a single DB or table
    - Exports take longer but are more flexible; you can export a single DB or table; exporting large DBs can impact the Cloud SQL instance => you can use a serverless export, using the --offload flag to reduce impact
      - Import/export in multiple small batches
Cloud Spanner
- Fully managed, mission critical, relational, globally distributed (supports regional, dual region, multi-region set-up) database with very high availability (99.999% compared to 99.95% Cloud SQL)
- You choose compute capacity, either in Nodes or Processing Units (smaller units, more granular, increments of 100/1000); you can also choose autoscaling instead of selecting a predefined number of Nodes/PUs
- Strong transactional consistency at global scale
- Automatically distributes data across nodes (uses partitioning/clustering to decide nodes - see below)
- Horizontally scales for both writes and reads (unlike Cloud SQL where only reads can scale, via Read Replica)
  - Cloud Spanner partitions data into splits, which are contiguous ranges of rows from a table. These splits are distributed across multiple servers (nodes) in the Spanner network. Each split can dynamically move between nodes based on load and size.
  - For joins that span across multiple nodes, Spanner employs a distributed query execution engine. It breaks down the query into sub-queries that can be processed in parallel on the respective nodes holding the data. The results are then combined to produce the final output.
- Pay for nodes and storage: very expensive
- Use Cloud Console to export data, there is no gcloud export option; you can also use Cloud Dataflow to automate exports
- Google Cloud Spanner uses the gRPC (gRPC Remote Procedure Call) protocol for communication

Section 19 - NoSQL Databases in Google Cloud Platform

Datastore
- Single region (99.95% availability) or multi-region (99.999% availability)
- Automatically scales and partitions data as it grows (you don’t provision anything)
- Recommended for up to a few TBs of data, afterwards use Bigtable
- Supports indexes, multi-row transactions and its flavour of SQL (GQL)
  - Does not support joins, or aggregate operations
- Use namespaces to group Entities
- Export only works from CLI (gcloud), NOT console
  - Exports contain a metadata file and a folder containing the data
- Best practices
  - Store user profiles and other flexible schema requirements
  - Use as an index for objects in Cloud Storage (for example, metadata about pictures)
  - Design keys (no sequential values or timestamps) and indexes (only use indexes when you know they will be used in queries) carefully
  - Prefer batch operations
Firestore
- Datastore++: optimized for multi device access
- NoSQL document database built for automatic scaling, high performance, and ease of application development. To use Firestore, create one or more databases. Firestore databases come in two modes: Native mode and Datastore mode (Native mode is basically Firestore)
- Offline mode, data sync across multiple devices
- Client side libraries: web apps, iOS/Android
- Single region (99.95% availability) or multi-region (99.999% availability)
- Collection → Documents → Field name/type/value
  - Hierarchical: you can add collections inside documents
- Modes
  - Datastore compatible
  - Native
Bigtable
- Bigtable is a fully managed, wide-column NoSQL database that offers low latency and replication for high availability
- Petabyte scale, wide column (key value store), NoSQL database
- Data is stored in tables, rows, and columns, similar to traditional relational databases but without a fixed schema. Each row can have a different set of columns.
- Not normally serverless (but you can select autoscaling instead of provisioning a specific number of nodes), you create an instance, and choose either HDD or SSD (use for most use cases, except when you have huge volumes of non-latency sensitive data); you then configure the Cluster, within the Instance
  - A cluster handles application requests for an instance. It contains nodes which determine your cluster's performance and storage limit.Additional clusters can be added at any time.
- HBase API compatible
- Designed for huge volumes of analytical and operational data: millions of txs per second
- Only supports single row transactions (not good for transactional workloads)
- Scales horizontally by adding nodes (can be done automatically via autoscaling, without downtime); you can also statically provision a fixed number of nodes
- Automatically shards data into multiple tablets across nodes in a cluster
- Export only support via a Java util, or HBase commands (no console or gcloud)
- The command line tool is cbt not gcloud
- Each table is a sorted key/value map
  - Related columns are grouped into families
  - Each value in a row is indexed with the row key, there are no other indexes in Bigtable except the row key
- Use cases: IoT streams, real time analytics
- Use Dataflow to export data to Cloud Storage
- Designing Bigtable tables
  - Know what data you need to store
  - Know what your frequently used queries look like, and design your row keys based on this
- Best practices
  - Create multiple clusters, and set-up replication (either cross-zone or cross-region), for H/A and durability + put data closer to customers; can automatically failover

Section 20 - Creating Private Networks with Cloud VPC in Google Cloud Platform

VPC
- Create private networks in the Cloud
- Traffic within a VPC is isolated, not visible to other VPCs
- You control all traffic going in and coming out of a VPC, via firewall rules
- VPC is a global resource (unlike in AWS, where VPCs are regional)
- Every project has a default VPC
- You can create (and should) your own custom VPCs and not rely on the default one
  - Auto mode => subnets are automatically created in each region
    - The total IP range for an auto mode VPC is 10.128.0.0/9, subdivided into 20 pre-defined subnets, one for each region (however, there is no such thing as a VPC CIDR, the CIDRs are only defined at subnet level and there is no restriction that says subnet CIDRs need to be part of a wider VPC CIDR range)
    - Firewall rules automatically added to auto mode VPCs (including default VPC)
      - Implied Allow Egress (this is applicable across any VPCs and can’t be removed, but has lowest priority): Allows all egress (outbound) traffic from instances to any destination.
      - Implied Deny Ingress (this is applicable across any VPCs and can’t be removed, but has lowest priority): Denies all ingress (inbound) traffic to instances unless explicitly allowed by other firewall rules.
      - Default-allow-internal: Allows internal traffic between instances on the same VPC network.
      - Default-allow-ssh: Allows incoming SSH (TCP port 22) connections from any source.
      - Default-allow-rdp: Allows incoming RDP (TCP port 3389) connections from any source.
      - Default-allow-icmp: Allows incoming ICMP traffic (used for ping) from any source.
  - Custom mode => no subnets automatically created (recommended for production), only the two implied firewall rules applied
- When creating VPCs, you are asked to provide firewall rules
  - Note that when you create a Compute Engine instance and you select “enable HTTP(S)”, GCP will add a firewall rule to the VPC, with a destination defined by the http-server and will tag your instance as well as http-server
Subnets
- Regional
- Allow implementing different network needs (separate public and private resources), depending on workload (for example, an LB may need to be public, but a DB should be private and accessible only from apps inside the network)
- Another reason for using multiple subnets is to distribute resources across regions
- When creating subnets, you can enable private Google access, which allows workloads inside the subnet to access GCP APIs via Google’s private network, without going over the public internet
- You can choose to enable Flow Logs (logging of networking traffic)
- In GCP, there isn't a direct equivalent to AWS's "public subnet" concept. Instead, what makes a subnet effectively "public" is a combination of factors:
  1. External IP addresses on the VMs/instances in that subnet
  2. Firewall rules that allow inbound traffic from the internet
  3. Routes that enable internet connectivity (GCP provides a default internet gateway)
- All GCP subnets have routes to the internet by default (unlike AWS where you need to attach an internet gateway)
- Rules for subnet CIDRs
  - VPCs do NOT have a predefined CIDR, so subnets can have any ranges, as long as they don’t overlap
  - Subnet CIDRs must be within the private IP address space as defined by RFC 1918:
    - 10.0.0.0/8
    - 172.16.0.0/12
    - 192.168.0.0/16
  - The subnet CIDR block must be between /9 (for large ranges) and /29 (for small ranges)
Firewall rules
- In (GCP), Firewall Rules are the primary mechanism for controlling traffic to and from resources in a VPC. GCP does not have an equivalent to AWS's Network Access Control Lists (NACLs)
  - AWS does not have an exact equivalent to GCP's firewall rules. However, AWS provides similar functionality through a combination of Security Groups (SGs) and Network Access Control Lists (NACLs)
- Control traffic going in and out of the VPC
- Stateful (allowed egress means allowed ingress, and vice versa)
- Each rule has a priority, the lower the more priority
- Implicit rules: allow all egress (65534) and deny all ingress (655534); implicit rules CANNOT be deleted, but they can be overridden with higher priority rules
- The default (and auto-mode) VPC has additional four 65534 rules (default-allow-internal, default-allow-rdp, default-allow-ssh, default-allow-icmp)
  - When creating your custom subnets, VMs won’t be able to talk to one another, unless you also enable default-allow-internal
- You can also add your own rules
  - Ingress
    - Select the target (by service account or network tag - this is how “allow HTTP(S) traffic” works, it adds the http-server tag and adds a firewall rule that targets this tag)
    - Select the source (CIDR range, all instances, or instances with a specific tag/SA)
  - Egress
    - Select the source (all instances, instances with a specific tag/SA)
    - Select the destination (CIDR block)
  - You select the priority
  - Select an action (Allow/Deny)
  - Select protocol
  - Select port
  - Enable/disable the rule
- Best practices
  - Use network tags combined with firewall rules to control traffic going in/out of a VM
  - Restrict access to VMs fronted by an LB, by only allowing the LB CIDRs: 130.211.0.0/22 and 35.191.0.0/16
  - Egress is allowed by default, if you want to restrict it you should create a deny egress rule with higher priority
Shared VPC
- Allow resources in different projects to talk to one another, as long as they are in the same Organization
- One Host project, which contains the shared VPC and multiple Service projects, attached to the Host project, where specific resources can use subnets from the Shared VPC
- Achieves separation of duties, by allowing network admins to be responsible for the Host project and Resource user to access the Service projects
VPC Peering
- Connects VPC networks (same project, different projects, different Orgs)
- Connectivity happens across Google’s private network (secure + free)
- Requires non overlapping CIDR ranges for all subnets
VPC Service Controls
- Networking feature provided by Google Cloud that helps protect your cloud resources against accidental or targeted data exfiltration. It allows you to create security perimeters around your cloud services to control access and protect data
- Security Perimeters: VPC Service Controls allows you to define security perimeters to protect resources and data of services like Cloud Storage and BigQuery. These perimeters act as a firewall for your cloud resources.
- Ingress and Egress Rules: You can configure rules to control the access to and from resources within the defined service perimeters. This includes configuring Private Google Access on subnetworks in a VPC, firewall rules for OSI Layer 4 network protection, and IAM best practices.
- Adaptive Access Control: VPC Service Controls provides adaptive access control based on IP range or device trust for accessing Google Cloud resources from outside privileged networks.

Section 21 - Operations in Google Cloud Platform

Cloud Monitoring
- Tools to monitor your infrastructure
- Measures key aspects (Metrics)
- Create visualizations around metrics (graphs and dashboards)
- Configure Alerts (when metrics signify that services are not healthy); you define the alert policies (condition, notifications, documentation)
  - Notification channels: Google Chat, PagerDuty, Slack, Webhooks, E-mail, SMS, Pub/Sub
- Workspace
  - Cloud Monitoring can be used to monitor one or more GCP Projects AND AWS Accounts
  - You create a Workspace, to organize monitoring information
  - Created in a specific Project, called the Host Project
- You can monitor VMs for
  - CPU utilization
  - Disk traffic
  - Network traffic
  - For more metrics, install the Cloud Monitoring agent (collectd-based daemon), to send data to Cloud Monitoring
- Uptime checks
  - Allow you to monitor the availability and responsiveness of your applications or resources. They work by sending HTTP, HTTPS, or TCP requests to your application at regular intervals from multiple locations around the world. If the requests fail or the responses are not as expected, an alert can be triggered to notify you of the issue.
  - Target: HTTP URL, App Engine, Instance, LBs
Cloud Logging
- Real time log management and analysis tool
- Exabyte scale, fully managed service
- Can ingest log data from any source
- When logs are ingested via GCP Cloud Logging, they are stored in a resource called a Log Bucket**.** Log Buckets are part of the Cloud Logging service and are used to store, organize, and manage log data. Each Google Cloud project comes with a set of default Log Buckets, but you can also create custom Log Buckets to meet specific needs.
- Logs Explorer: search, sort and analyze using flexible queries
- Logs Analytics: rich visualization (requires upgrading the Log bucket, at no extra cost)
- Logs Metrics: capture metrics from logs (using queries/matching strings)
- Logs Router: route different log entries to different destinations
- Collection
  - Most managed GCP services send logs to Cloud Logging (GKE, App Engine, Cloud Run)
  - From GCE VMs: install Logging Agent (based on fluentd)
  - From on-prem: use BindPlane tool from Blue Medora, or use the Cloud Logging API
- Audit and security logs
  - Access Transparency Log: captures actions performed by the GCP team on your content (only for Gold support orgs, or higher)
  - Cloud Audit Logs: who did what, when and where (service name, operation, resource, principal)
    - Admin activity logs: API calls or other actions that modify the configuration of resources
    - Data access logs: reading configuration of resources (disabled by default)
    - System event audit logs: GCP admin actions
    - Policy denied audit logs: when a principal is denied access
- Controlling and Routing
  - Logs reach the Log Router
  - It checks against various rules (what to ingest, what to discard, where to send the logs)
  - Two types of buckets
    - _Required: holds Admin activity, System Events and Access Transparency Logs (400 days retention); you are not billed
    - _Default: all other logs (retained for 30 days by default, can configure 1-3650); you are billed, but you can disable ingestion to this bucket
  - For long term retention, export to Cloud Storage, BigQuery dataset, Pub/Sub topic (Base64 encoded entries)
    - Create sinks, in the Log Router, to export to the above destinations
    - Same log entry can be sent to multiple Sinks
    - Sinks can be disabled

Cloud Trace
- Distributed tracing system
- You leverage tracing libraries (C sharp, Go, Java, Node.js, PHP, Python, Ruby), that call the Cloud Trace API
- Supported for GCE, GKE, AE, etc.
Cloud Debugger has been removed in 2023
Cloud Profiler
- Identify performance bottlenecks
- Low-overhead, collects CPU and memory usage
- Two major components
  - Agent - collects info
  - Interface - visualization
Error Reporting
- Centralized Error Management console
- Realtime exception monitoring
- Aggregates and displays errors reported from cloud services (using stack traces)
- Use Firebase Crash Reporting for errors from Android and iOS client apps
- Errors can be reported by sending them to Cloud Logging, or by calling the Error Reporting API

Section 22 - Exploring IAM and Project Organization in Google Cloud Platform

Resource hierarchy: Organization → Folder → Project → Resources
IAM policies can be applied at any level, and resources inherit policies from all parents (effective policy is the union of all applicable policies); if a permission is given at a higher level, you cannot restrict it at a lower level
Recommendations
- Create separate projects for different environments
- Create separate projects for different apps
- Create separate folders for each department
- Create shared folders for shared resources
- One project per application per env
  - Consider two apps A1, A2, each needing 2 envs => create 4 projects: A1-DEV, A1-PROD, A2-DEV, A2-PROD
Billing Accounts
- Mandatory for creating resources
- One account can be associated with one or more Projects
- Each Project MUST have a Billing Account
- Create Billing Accounts that reflect your organization structure
- Types
  - Self Serve: billed directly to card/bank account
  - Invoiced: for large enterprises
Managing Billing
- Set up a Cloud Billing Budget and configure Alerts
- You can set up Billing export to either GCS (file export) or BigQuery
IAM best practices
- Least Privilege to perform a role (this is why Basic roles are not recommended)
- Use Service Accounts with minimal privileges; different Service Accounts for different apps/purposes
- Separation of duties: involve at least 2 people in sensitive tasks (for example, separate deployer and traffic migrator roles, in App Engine)
- Constant monitoring: Review Cloud Audit Logs; archive Logs in GCS for long term retention
- Use Groups when possible
User identity management
- For Enterprises, use Google Workspace (formerly G Suite) to manage users and their access
- You can also use your own IdP, if you don’t use Google Workspaces, through federation
  - SSO: users are redirected to the external IdP, and a SAML assertion is sent to Google Sign-in
- Cloud Identity: unified identity, access, app and endpoint management platform (manages identities)

Google Cloud Identity is focused on identity and access management, offering robust security and management features. It is suitable for organizations that need to manage users and devices without the additional productivity tools.
Google Workspace provides a comprehensive suite of productivity tools along with basic identity management features. It is ideal for organizations that need collaboration and productivity tools integrated with user management.
Members and identities
- Google Account - represents a person (an e-mail)
- Service Account - application account (not person, but still represented with an e-mail)
- Group - collection of Google and Service accounts
- Google Workspace domain
- Cloud Identity domain
Organization Policies
- Centralized constraints on all resources created in an Organization (disable creation of Service Accounts, disable creation of resources in a specific region)
- Requires the Organization Policy Administrator role
- IAM focuses on WHO, Org Policies focus on WHAT
- Org policies override IAM
- GCP offers some predefined policies, for example “Enforce uniform bucket-level access”
Predefined BigQuery roles
- Admin → can do everything
- Data Owner → datasets.*, models.*, routines.*, tables.* (does NOT have access to jobs)
- Data Editor → tables, models, routines, datasets (no access to jobs)
- Data Viewer
- Job User → can run queries
- To see data, you need roles like User, or Data Viewer, you can’t use Job User roles

Section 23 - Compute Engine Virtual Machines

SSH-ing into Linux VMs
- Metadata managed: manually create and manage SSH keys
- OS Login: allows you to manage SSH access to your Compute Engine instances using IAM (Identity and Access Management) roles. This eliminates the need to create and manage individual SSH keys for each user
  - Linux account is linked to Google Identity
  - Recommended when multiple users SSH into VMs across different Projects
  - Set the enable-oslogin metadata key to true
  - You can import Linux accounts from on-prem AD/LDAP
  - Users need to have the appropriate roles
- When using the GCP UI’s SSH button, a temporary ephemeral SSH key is used to log in; this works with both Metadata managed, and OS Login
- gcloud compute ssh will create a username and persistent SSH key pair, which will be re-used for future interactions
- You can also use your own custom SSH keys, for both OS Login and Metadata managed; it’s just different ways to upload the public keys
For Windows, passwords are used
Shutdown scripts
- Runs on best effort basis (won’t work for hard resets for example)
- Similar to startup script, except that it runs before stopping/terminating/restarting
- Work for both preemptible and non-preemptible
- Runs as root/system account
- Stored as metadata (shutdown-script key, similar to startup-script)
- You can also store your startup and shutdown scripts as files in Cloud Storage
Troubleshooting VM startup
- Are there Quota errors? (can check this in Notifications)
- Is boot disk full?
- Check serial port output, from the console, gcloud, Compute Engine API, or Cloud Logging - if you send the output there (every VM has 4 virtual serial ports, where outputs are written to)
- Does the disk have a valid file system? Attach the disk to another VM and investigate
Moving instances across zones/regions
- Moving between zones is simple: gcloud compute instances move
  - Cannot use if instances are part of a MIG
  - Cannot use if instances have local SSDs
  - Cannot use if instances are in terminated status
  - Cannot use across regions
- Manual approach
  - Create snapshots of all attached disks
  - Create copies of PDs in destination region/zone, using the previously created snapshot
  - Launch a new instance, attach the PD

Section 24 - Asynchronous Communication in Google Cloud with Cloud Pub Sub

Advantages of async communication with message brokers
- Decoupling
- Availability
- Scalability
- Durability
Pub/Sub
- Reliable, scalable, fully-managed, async messaging service
- Auto-scales to billions of messages per day
- Low-cost, pay for use (number of messages)
- Use cases: event ingestion, delivery for streaming analytics pipelines
- Supports both push/pull message deliveries (unlike separate, for SNS and SQS)
- Create a topic, where publishers send messages and receivers process them through subscriptions
- Supports DLTs
- Can configure retention period, expiration of subscription, retry policies, etc.
- Pub Sub Lite is cheaper, but only replicates data in one Zone
- Subscriptions
  - Pull - make HTTPS request to the pub sub API, to ask for messages
  - Push - Message automatically sent out to all subscribers, to the webhook endpoint registered by the receiver
    - In push delivery, Pub/Sub initiates requests to your subscriber application to deliver messages. Messages are delivered to a publicly addressable server or a webhook, such as an HTTPS POST request.
- Subscriptions
  - Discrete pool of messages from a topic
  - All messages on a topic are available to all subscriptions (they are not like partitions in Kafka, but more like topics)
  - However, if multiple clients pull from the same subscription, then messages are split between clients (in this respect, a subscription is similar to a consumer group)
  - Once a client consumes the message, it sends an ACK back to Pub Sub; the ACK is per subscription
  - After the ACK is received, the message is removed from the subscription
  - You can take snapshots of subscriptions

Cloud Dataflow
- Unified stream and batch processing
- Based on Apache Beam (supports many languages)
- Serverless and autoscaling
- Fast and cost effective
- You define jobs, that can be real time or batch
- There are a lot of templates available to configure
- Can do export/import to a variety of services

Section 25 - Implementing Hybrid Cloud with Google Cloud

Cloud VPN
- Connect on-prem to GCP, over the public internet (NOT private)
- IPSec VPN tunnel, with encrypted data traveling through it
- IKE - internet key exchange protocol is used to encrypt the data
- One or more gateways in GCP and one or more in your own data center
- Cloud Router enables dynamic routing
- Two types
  - H/A VPN (regional, NOT global)
    - 99.99% availability
    - Redundant IP addresses: 2 interfaces and 2 IP addresses
    - Only dynamic routing (BGP) is supported
  - Classic VPN
    - 99.9% availability
    - Single IP address
    - Supports static routing, as well as dynamic
- Easy to establish, no extra contracts required
- Use cases: lower cost, encryption requirement, experimenting with on-prem to GCP connectivity

Cloud Interconnect
- High-speed, highly available, private connection (not over public internet)
- Reduces egress cost, because traffic does not go over the internet
- Two options
  - Dedicated
    - Very high bandwidth, for large transfers
    - 10 GBps, 100 GBps
    - You can also have multiple circuits: 8 x 10, or 2 x 100, per connection
    - Can take a few weeks to establish
  - Partner
    - Bandwidth needs are lower
    - 50 MBps to 10 GBps
    - Doesn’t require the same set-up time

Best practices for hybrid connectivity
- Resources on the networks should use different CIDRs
- Have fallback connections, if the primary fails (dedicated interconnect → VPN)

Section 26 - Exploring Data Warehouses in Google Cloud - BigQuery

Exabyte scale data warehouse solution
Relational database, supports SQL, consistency, etc
Data organized into datasets, which contain tables
Offers traditional (storage + compute) + modern (realtime + serverless) approaches
Importing and exporting
- Load from a variety of sources, including streaming data, batch, federation, BigQuery Transfer Service
  - You can’t just add a Table and add records from the UI, like you can for Firestore for example
  - Batch is free, streaming is expensive; can batch import from Cloud Storage, or local files
  - You can preprocess data with Dataflow or Dataproc (managed Hadoop), before loading into BigQuery
  - Stream from Pub Sub, or use streaming inserts; can still use Dataflow/Dataproc to preprocess
  - Federated queries allow you to query external sources
  - BigQuery Data Transfer Service: load from external sources, such as RedShift, S3
  - Realtime inserts can contain duplicates; add an insertId to deduplicate (for up to 1 minute)
    - Max 1 GB/s per Project
    - If you are streaming millions of records, prefer Bigtable
- Load from variety of formats: CSV, JSON, Avro, Parquet, ORC, etc.
- Export to GCS (long term storage, archiving) and Data Studio (for visualization) - CSV, JSON, Avro
Data can be auto-expired, after a certain amount of time (configurable table expiration)
You can also query external data sources, without needing to store it in BigQuery (by using permanent or temporary external tables)
Queries can be expensive when running on large datasets; you should first estimate the queries (either via UI, or bq --dry-run; you pay for scanned data, not returned data
Partitioning helps divide tables into multiple segments, allowing more cost efficient and faster scans
- You can partition on any column of type DATE, DATETIME, TIMESTAMP or INTEGER
- Partitions share the same schema as the main table
- Avoid creating partitions that are too small (<1 GB); in this case, prefer clustering
Clustering colocates related data, keeping rows with the same value of the cluster key are stored close together
Expiring data
- Hierarchy: Dataset → Table → Partition
- Expiration can be set at all of these levels
BigQuery best practices
- Estimate queries before running them
- Use clustering and partitioning
- Avoid streaming inserts, prefer batch (free)
- Expire data automatically
- Consider the long term storage option, similar to GCS Nearline (tables where data has not been edited for 90 days)
- Not well optimized for narrow range queries; prefer Bigtable; BigQuery for complex queries
- Use Audit Logs
Dataproc
- Managed Spark and Hadoop service
- Supports Spark, PySpark, Pig, Hive, Hadoop
- Used for complex batch processing
- You create clusters, and choose the characteristics of VMs (can use preemptible VMs)
  - You choose either GCE or GKE
  - Options
    - Standard (1 master, N nodes)
    - Single node (1 master, 0 nodes)
    - H/A (3 masters, N nodes)
- You submit jobs, such as through jar files
- You can also submit workflows
  - A Workflow is an operation that runs a Directed Acyclic Graph (DAG) of jobs on a cluster
- BigQuery is an alternative, if you need to use SQL queries on terabytes of data

Section 27 - Exploring Data Lifecycle and Data Architectures in Google Cloud

Data lifecycle
- Ingest
  - Stream
    - Pub Sub: you can use the Google Cloud Console, the gcloud command-line tool, the Pub/Sub API, or the Cloud Client Libraries to publish messages
  - Batch
    - Storage Transfer Service, BigQuery Transfer Service, Transfer Appliance, gsutil
- - Database migration
    - Database Migration Service to simplify migrations to Cloud SQL
    - For non Cloud SQL, you can batch transfer DB backups to Cloud Storage and load data from GCS to DB using Dataflow
- Store
  - Durably, and cost efficiently, in a convenient format
  - Cloud Storage (object storage, unstructured data)
  - Cloud SQL (managed MySQL, PostgreSQL and MS SQL dbs, pre-defined schema, strong transactions, regional, can only horizontally scale reads)
  - Cloud Spanner (global, horizontally scalable for both writes and reads, 99.999% availability, strong consistency)
  - Cloud Firestore (document NoSQL DB, scalable, supports txs)
  - Cloud Bigtable (wide-column NoSQL DB, huge data, large scale analytical time series workloads, only supports single-row txs)
  - BigQuery (complex analytics, managed data warehouse)
  - You can also create your own custom DB, from Cloud Marketplace (MongoDB, Cassandra, etc.); you are responsible for everything non functional related to it
- Process and analyze
  - Convert data to information (normalization or aggregations): raw data > actionable information (clean, transform)
  - Dataprep (powered by Trifacta): clean and prepare data, No Ops, clean data from external sources, prepare data for ML; visual approach for non programmers
  - Data Loss Prevention (DLP): scan, discover, classify and report on data in GCS, BigQuery and Datastore (mask, tokenize and transform sensitive elements)
  - Dataflow: more flexible ETL pipelines (fully managed, NoOps, support both batch and streaming)
  - Dataproc: managed service for Spark and Hadoop (batch); needs a cluster; Machine Learning or existing Spark workloads that you want to move to the cloud
- Explore and visualize
  - Get and share insights
  - BigQuery
  - ML - Pre built models: Vision API, Speech-to-Text, Natural Language API, Video Intelligence API, etc.
  - ML - Custom models: use AI Platform (managed service, based on TensorFlow); use Dataflow to preprocess
  - Datalab (WAS SHUTDOWN and migrated to Vertex AI workbench): web based tool to explore, analyze and visualize data; based on Jupyter notebooks (use Python, SQL queries, etc.); support for popular toolkits like pandas, numpy, scikit-learn
  - Data Catalog: data discovery and metadata management; unified view of all datasets; you can tag sensitive data using DLP (Data Catalog is deprecated as of February 3, 2025 and will be discontinued by January 2026. Transition to Dataplex Catalog by January 2026 for continued catalog usage)
    - Data Catalog allows you to discover, manage, and understand data assets across Google Cloud Platform. Data Catalog API natively indexes Cloud BigQuery, Cloud Storage, and Cloud Pub/Sub data assets.
      The Data Catalog API can be used to:
      - Search for data assets across different projects and GCP resources
      - Create tags and tag templates to add structured business metadata unique to your organization
      - Automate the tagging process of data assets as part of a data ingestion pipeline
Big Data and Analytics in GCP
- Pub/Sub: foundation for stream analytics and event driven systems
- BigQuery: serverless data warehouse to analyze petabytes of data
- Data Studio (now Looker): managed visual analytics service
- Dataflow: data pipelines (Stream + Batch), take data from a source and dump it to a sink
  - Google Cloud Dataflow is ideal for unified stream and batch data processing at scale, creating data pipelines that can read, transform, and write data to various destinations. It is particularly suited for real-time analytics, machine learning, and ETL operations, leveraging Apache Beam's model for both batch and streaming tasks, making it perfect for complex data integration tasks that need robust scalability.
- Dataproc: managed Spark and Hadoop clusters
  - Google Cloud Dataproc is best for managing Hadoop and Spark clusters, making it suitable for big data processing, ETL, and machine learning relying on these ecosystems. It simplifies cluster creation, management, and scaling, supports tools like Hive and Pig, and integrates with Google Cloud services, ideal for migrating on-premises Hadoop and Spark workloads to the cloud with features like autoscaling.
- Dataprep: clean and prepare data (structured and unstructured)
- Datalab (deprecated and replaced with Vertex AI workbench)
- Cloud Composer: managed workflow orchestration; based on Apache Airflow; create pipelines across clouds and on-prem data centres
  - Ideal for managing complex workflows that involve multiple cloud services and require coordination and scheduling of tasks. Example: orchestrating a series of jobs across BigQuery, Dataflow, and other services.
Big Data flows
- Batch ingest into BigQuery: Cloud Storage → Dataproc / Dataprep / Dataflow → BigQuery
- Realtime querying based on streaming data: Pub Sub → Dataflow → BigQuery / Bigtable
IoT
- IoT Core (registration, AuthN/Z of devices) → Pub Sub → Dataflow → Datastore / BigQuery / Bigtable
Data lakes
- Storage: GCS
- Ingestion: Pub Sub + Dataflow, or Transfer Service / Transfer Appliance / gsutil
- Processing/analytics: BigQuery (federated queries)

Section 28 - Caching

Cache data which changes infrequently
Returned data may be stale - you can configure a TTL
Use cases: infrequently changing DB data, user sessions, static content
Memorystore
- Fully managed (provisioning, replication, failover and patching)
- 99.9% availability SLA
- Redis
  - Use for low latency access with persistence and high availability
  - Tiers: Basic (no H/A), Standard (failover replica in a separate zone)
  - Supports up to 300 GB storage space
- Memcached
  - Use for reference data, database query caching, session store
  - Configure region/zone
  - Configure 1-20 nodes, memory per node (1-256GB) and CPUs per node
  - Data is ephemeral, if a node crashes, the data is lost
- Valkey
  - Memorystore for Valkey is a fully-managed, horizontally scalable Valkey service for workloads that demand the lowest possible latencies
- Can be accessed from most compute services: Compute Engine, App Engine, GKE, Cloud Functions
- APIs/SDKs are available
- App Engine has a legacy in-memory data cache
  - Shared memcache: free
  - Dedicated memcache: paid, predictable performance
- Cloud CDN
  - Use Google’s global edge network to serve content with low latency
  - Integrates with External HTTP(S) Load Balancing
  - Supported back-ends
    - Cloud Storage bucket
    - Instance Groups
    - App Engine
    - Cloud Run
    - Cloud Functions
    - Endpoints outside of Google Cloud (custom origins)
  - External HTTP(S) LB uses proxies called Google Front Ends
    - Requests arrive at a GFE
    - If URL maps to a back-end with CDN configured, then GFE checks cache and either returns or forwards to back-end
    - When you configure the back-ends for an external HTTPS Load Balancer, there is a checkbox for “Enable Cloud CDN”, and you can configure the type of content to cache (static content, use origin settings, force cache all content)
    - You can choose which components of the HTTP requests to be used for caching (protocol, host, query string)
  - Supports TTL
  - Best practices
    - Cache static content
    - Be careful with expiring time-sensitive (dynamic) content - use shorter cache periods
    - Use custom cache keys to improve cache hit ratio - customize cache key
    - Use versioned URLs to update content (enables you to have high TTLs, because a new version causes a cache miss anyway)

Section 29 - Agile, DevOps, SRE and SDLC Evolution

DevOps
- Getting better at the three elements of great software teams: Business, Development, Operations
- Communication - get teams together
- Feedback - find problems early
- Automation - automate testing, infrastructure provisioning, deployment, monitoring
DevOps practices
- Automated CI/CD pipelines
- Tests - unit, integration, system, sanity/regression
- Static code analysis, including static security checks
- Runtime checks - vulnerability scanners
DevOps in Google Cloud
- Cloud Source Repositories - fully-featured, private Git repositories
- Container Registry - store Docker images
- Cloud Build - build deployable artifacts (jars, or Docker images) from source code and configuration; works off of a cloudbuild.yaml file, Dockerfile, or Buildpacks
- Spinnaker - supports deployment to Compute Engine, GKE, App Engine
Infrastructure-as-Code
- Treat infrastructure the same way as application code
- Track changes over time
- Bring repeatability
- Infrastructure provisioning: Terraform, Google Cloud Deployment Manager
- Configuration management (installing software on provisioned resources): Chef, Puppet, Ansible, SaltStack
Cloud Deployment Manager
- Version control for your environments
- Equivalent to CloudFormation in AWS
- Free to use, but you pay for the resources
- Automate deployment and modification of GCP resources, in multiple environments
- Avoid configuration drift
- Tip: always modify resources via Deployment Manager, not manually!
- Configuration defined in a YAML file; understands dependencies between resources
- Automatic rollbacks on errors (easy to retry)
- Terminology
  - Configuration file: YAML file with resource definitions for a deployment
  - Templates: reusable resource definitions that can be used in multiple configuration files
    - Python (preferred) or JinJa2 (for simple scripts)
  - Deployment: collection of resources that are deployed and managed together
  - Manifests: read-only object containing the original deployment configuration; automatically generated and contains fully-expanded resource list
Cloud Marketplace (Cloud Launcher)
- Central repo of easily deployable apps and datasets
- Similar to App Store/Play Store
- You search and install a complete stack (SAP HANA, LAMP, WordPress, Cassandra, Jenkins, etc.)
SRE
- SRE teams focus on all aspects of an application: availability, latency, performance, efficiency, change management, monitoring, capacity planning
- Key principles
  - Manage by SLOs
  - Minimize toil (manual work) - automate as much as possible
  - Move fast by reducing the Cost of Failure (frequent small releases + automation)
  - Share ownership with developers (responsible for monitoring, not just developing)
- Key metrics
  - SLI (Service Level Indicator): quantitative measure of an aspect of the service
    - Availability, latency, throughput, durability, error rate
    - Typically aggregated (over x minutes)
  - SLO = SLI + target
    - 99.99% availability
    - 99th percentile response time of 1 second
  - SLA = SLO + consequences (contract)
    - Have stricter internal SLOs than external SLAs
  - Error budgets = 100% minus SLO
    - How well is a team meeting their reliability objectives
    - Used to manage development velocity
- Best practices
  - Handling excess loads
    - Load shedding - API limits, drop some data in a streaming application
    - Reduce QoS - instead of talking to a recommendations API, return a hardcoded set of products
  - Avoid cascading failures
    - Plan to avoid thrashing - circuit breaker, reduced QoS
  - Penetration testing
    - No need to inform Google, but you are only allowed to test your projects
  - Load testing
  - Resilience testing
    - Chaos Testing
    - Add stress to some layers
    - Include the network in your testing (VPN, Cloud Interconnect)

Section 30 - Release Management in Google Cloud Platform

Goals - depending on these, there are different ways to perform releases
- Zero downtime?
- Only one version live?
- Minimize costs?
- Test V2 with some traffic before rolling out?
Best practices
- Small incremental changes
- Automation
Deployment approaches
- Recreate
  - Replace V1 with V2 in-place (on the same machines)
  - Terminate V1, rollout V2 immediately
  - App will be down during the release
  - Rollback needs redeployment and more downtime
  - Cost effective and fast
  - You avoid the need for backward compatibility (with DBs and other apps), because only one version is live at one point in time
- Canary
  - Initial small V2 rollout to a subset of instances (some life traffic will be served V1, others V2), followed by rollout to the rest
  - Fast
  - Zero downtime
  - No extra infrastructure
  - Minimize impact to users
  - Needs backward compatibility
  - Supports A/B testing (check if users like a feature)
- Rolling
  - V2 rolled out to a percentage of instances, then gradually rolled out to the rest
  - Slow
  - Zero downtime
  - Requires automation and additional setup
  - No extra infra required
  - Minimize impact to users
  - Needs backward compatibility
  - An alternative is rolling with additional batch (a little bit of extra infrastructure, for the benefit of 0 reduction in capacity to handle traffic)
- Blue/Green
  - Create a parallel environment, with V2
  - Switch all traffic from V1 to V2, at once
  - From the POV of the user, they only see one version at a point in time
  - Zero downtime
  - Easy rollback (you point traffic to the Blue env)
  - Needs additional infra
  - Needs backward compatibility
  - Allows for shadow testing: mirror traffic to both V1 and V2, to see how each responds; allows you to capture prod traffic and replay it; complicated
- MIG deployment approaches
  - Rolling: gcloud compute instance-groups managed rolling-action start-update my-mig --version=template=v2-template(--max-surge and --max-unavailable flags can be used)
  - Canary: gcloud compute instance-groups managed rolling-action start-update my-mig --version=template=v1-template --canary-revsion=template=v2-template,target-size=10%
  - Blue Green: manual approach; create a new MIG and adjust LB back-ends as needed
- App Engine - releasing new versions
  - gcloud app deploy - deploy and shift all traffic at once
  - --no-promote - deploy without shifting traffic
  - gcloud app services set-traffic s1 --splits V2=1 - shift all traffic
  - --migrate - gradual migration
  - gcloud app services set-traffic s1 --splits=v2=.5,v1=.5 - A/B testing
- GKE - releasing new versions
  - Recreate (set strategy type on Deployment to Recreate)
  - Rolling update; default (RollingUpdate, configure maxSurge and maxUnavailable)
  - Blue Green: create new deployment and use Ingress or Service to control traffic
  - Canary: requires a service mesh, like Istio

Section 31 - Compliance and Regulations for Your Google Cloud Solutions

GCP is compliant with IEC 27001, IEC 27017, IEC 27018, IEC 27701, PCI DSS, SOC 1, SOC 2, COPPA, HIPAA, GDPR
HIPAA compliance
- Shared responsibility: GCP supports HIPAA compliance, customer must evaluate compliance with HIPAA
- Execute a Google Cloud Business Associate Agreement (BAA), from your account manager
- Do not use GCP products that are not covered by BAA
- Follow IAM best practices
- Enable object versioning
- Export audit logs to GCS and BigQuery
- Disable request caching for PHI in Cloud CDN
PCI DSS compliance
- Create a new Google Cloud account for payments processing
- Restrict access to this env by using the principle of least privilege
- Control inbound and outbound traffic (firewall rules to only allow HTTPS requests from customers and TPPs)
- GCE and GKE are recommended, because the others do not support egress firewall rules (App Engine, Cloud Functions)
- Create an HTTPS load balancer with signed SSL certificate
- Use hardened VM images
- Implement Forseti Security, by using the following tools: Inventory, Scanner, Enforcer, Explain
- Enable VPC flow logs, access transparency logs, firewall rules logging and configure monitoring alerts
- Use DLP to sanitize data

Section 32 - Planning Cloud Migrations to Google Cloud Platform

Phase 1: assess the workloads to be migrated
- Take inventory and catalog apps
- Experiment and design proofs of concept (move a sample app/DB to cloud)
- Calculate TCO (total cost of ownership)
- Choose which workloads to migrate first
Phase 2: plan the foundation
- Design resource org hierarchy (folders, project), how to configure IAM (users, groups, AD integration), VPCs, networks, subnets and how to connect from on-prem (VPN, Interconnect)
- Plan for security, monitoring and governance
- Plan your migration team
Phase 3: gradually deploy the workloads
- Migrate data: take into account cost, time, offline/online transfers
- Deploy apps (choose the right compute options)
- Prefer automation (Ansible, Chef, Puppet, Jenkins, Spinnaker, Terraform, Deployment Manager)
Phase 4: optimize environment
- Logging, monitoring, alerting
- Use managed services
- Enable auto-scaling
Approaches
- Rehosting (lift and shift)
- Replatforming (few adjustments to suit the cloud, for example containerizing)
- Repurchasing: move to a new Cloud product, like a dB
- Refactoring: make use of serverless architectures
- Retiring: end of service
- Retaining: keep as is, on-prem
MS SQL Server to GCP migration
- Create a Cloud SQL instance
- Move backup of your DB to Cloud Storage
- Import data into Cloud SQL for SQL Server
Deploying containers to GCP
- App Engine flexible: highly scalable, but doesn’t scale down to 0; can’t customize GCE VMs
- Cloud Run and Cloud Run for Anthos: highly scalable and can scale down to 0; cannot customize the env
- GKE and Anthos clusters (K8s in different clouds and on-prem): lot of management overhead, but lots of flexibility
- Compute Engine: use COS image with Docker installed; not recommended, because you need to manage everything
Understand business requirements such as
- Reduce costs - CapEx vs OpEx; TCO (licensing, compute, storage, networking, ingress, egress, personnel)
  - Use managed services
  - Use auto-scaling
  - Use preemptible VMs
- Increase pace of innovation
  - Evaluate and adopt emerging processes like DevOps and SRE, microservices
- Reduce mean time to recovery
- Improve compliance
- More visibility into apps and infra
- More intelligence from available data
KPIs for business requirements
- Key Performance Indicators: measure of how well a business is doing with respective to an objective
- Examples: number of new customers onboarded, % of VMs running in the cloud, operational cost per customer
Define technical requirements
- Functional: must use containers, must use hardened OS, must be auto-scaling, private network, huge volumes of data etc.
- Non-functional
  - Availability
    - Geographical distribution (Global → Multi-regional → Regional → Zonal)
    - Compute Engine: Live Migration, MIGs with Auto Scaling and Health Checks; distribute instances using LBs
    - GKE: multi-master, regional clusters with Pod and cluster autoscaling
    - Managed services: App Engine, Cloud Functions, GCS, Filestore, Datastore, BigQuery
    - Persistent Disks: live resizing, use regional PDs
    - Bigtable: place clusters in different zones or regions
    - Datastore: use multi-region locations
    - Cloud SQL: HA configuration (regional)
    - Network Tier: prefer Premium Network Tier
    - Hybrid: Dedicated IC > Partner IC > VPN
  - Scalability
    - VMs in MIG, configured with Instance Template + Load Balancing
    - Pod and Cluster Autoscaling in GKE
    - Be cautious with resources that cannot scale fast (Cloud SQL)
    - Persistent disks can be scaled both horizontally and vertically (increase size, attach more disks)
    - All serverless offerings are autoscaling (Bigtable, Cloud Spanner, Cloud SQL, Dataproc are NOT serverless, you need to provision compute power)
  - Durability
  - Security
    - Confidentiality: follow IAM best practices, encrypt data in transit and at rest
    - Integrity: protect data from unauthorized change; RBAC, separation of duties, hash verifications and digital signatures
    - Availability: system/data is available for the users when they need them; firewalls (permit only the right kind of traffic and prevent attacks), redundancy, auto-failover, DDoS protection (Cloud Armor)
Digital signatures - Cloud KMS
- Verification of integrity of signed data + non-repudiation if the signer claims the signature is not authentic
- Workflow:
  - Sender performs private key operation on the data, to create a digital signature
  - Recipient uses the public key to verify the digital signature
- Cloud KMS can be used to create an asymmetric key pair that supports digital signing
  - gcloud kms asymmetric-sign
- Use cases: validating code builds
Cloud Armor
- Google Cloud Armor is a Web Application Firewall (WAF). It provides protection for web applications by filtering and monitoring HTTP traffic between a web application and the Internet. Cloud Armor is part of Google Cloud Platform and offers features similar to other WAFs, including protection against DDoS attacks, SQL injection, and cross-site scripting (XSS)
- Protect apps from DDoS and OWASP top 10 attacks
- Protects apps deployed in GCP, hybrid, or in a multi-cloud architecture
- Preconfigured security policies exist, such as OWASP Top 10 risks
- Use cases
  - Enable access for users at specific IP addresses, using allowlists
  - Block access from IP addresses with denylists
Secret Manager
- Store API keys, passwords, etc.
- Supports multiple versions of secrets, secret rotation using Cloud Functions, auditing with Cloud Audit Logs
- Encrypted by default
- Secrets can be stored in multiple regions
Data Management
- How does the data come in?
- What rate will we receive it?
- What kind of data?
- How much data?
- For how long?
- Who will have access?
- How will we use it?

Section 34 - Exploring More Google Cloud Platform (GCP) Services

Cloud Scheduler
- Batch, big data jobs, cloud infra operations, etc.
- Cron format
- Supported targets: HTTP, Pub/Sub, App Engine, Workflows
- Provides automated retries
- Use case: set of GCE instances and you want to execute a task across all of them: use Cloud Scheduling to put a message on Pub Sub and have instances listen to it
- Needs an App Engine App in the Project (it’s built on an older App Engine Cron Service)
Cloud Emulators
- Develop GCP apps locally
- Supports emulation of: Bigtable, Datastore, Firestore, Pub Sub, Spanner
Cloud DNS
- Global Domain Name System
- Private and public managed DNS zones (container for records)
- gcloud dns managed-zones create ZONE_NAME
  - --description
  - --dns-name
  - --visibility
  - --networks
- Three steps to add records
  - Start transaction for Zone
  - Make changes
  - End transaction for Zone

Pricing Calculator
- Estimate the cost of a GCP solution
- Estimates for 40+ services: GCE, GKE, Cloud Run, AE
- Just estimates
Anthos
- GCP Anthos, introduced by Google Cloud, is a hybrid and multi-cloud platform designed to build and manage modern hybrid applications across different environments. It is powered by Kubernetes and other open-source technologies, allowing you to run applications unmodified on existing on-premises hardware or in public cloud environments. Anthos provides a container environment that works across multiple clouds and bridges the gap between legacy software and cloud hardware.
- Consistent cluster management, development and ops experience
- Centralized config management (Git repo): logically group and normalize clusters as environs
  - Use Namespaces, labels and annotations to decide which clusters to apply changes on
- Provides a Service Mesh (based on Istio)
  - Sidecar to implement common features (logging, AuthN/Z, tracing, metrics, etc.)
  - A/B testing, canary rollouts
Machine Learning
- Prebuilt APIs: no in-house ML expertise; easy to use; example: Vision API, Video API, Natural Language API, Speech-to-text API, Text-to-speech API, Translation API
- Cloud AutoML: build custom ML models with developers having limited ML expertise
- AI Platform: based on Tensorflow Enterprise; serverless and autoscaling; training + serving phases are supported; Explainable AI - interpret models with confidence (why a decision was made)
- Data management: Cloud Storage and BigQuery; BigQuery ML - build models directly from data in BigQuery
- Automation and instrumentation: AI Platform Pipelines & Cloud Composer to orchestrate and automate data/ML pipelines
Apigee API Management
- Design, secure, publish, analyze, monitor, monetize and scale API anywhere
- APIs can be on-prem, GCP, or hybrid
- Provides AI-powered API monitoring
- Enable caching with Cloud CDN
- Create Developer Portals
- Use cases: abstraction layer on top of legacy services, expose ML models as APIs
Identity Platform
- Customer Identity and Access Management
- IAM is for controlling access to Google Cloud resources; works with Members, Roles, Policies and Service Accounts
- Identity Platform is CIAM (customer identity and access management)
  - Authentication and Authorization for end users
- Key features
  - AuthN/Z for web and mobile applications
  - Multiple methods: SAML, OIDC, email/password, social
  - User sign-up and sign-in
  - MFA
  - An upgrade from Firebase Authentication Legacy
  - Integrates well with IAP
- Scenarios
  - App on GCE VM needs access to GCS → IAM Service Account
  - Enterprise user needs access to upload objects to GCS → IAM
  - Manage end users for apps → Identity Platform
  - Login via Facebook → Identity Platform
  - Sign-up and sign-in workflows → Identity Platform
- Eventarc
  - Simplifies event driven architectures
  - Adheres to CloudEvents (cloudevents.io)
  - Event provider: who can trigger events?
    - Direct: Pub Sub, Cloud Storage, Cloud Functions, IoT, Memorystore
    - Indirect: Cloud Audit Logs entries
  - Event destination
    - Cloud Functions (2nd gen), Cloud Run, GKE services
  - Uses Pub Sub topics in the background

Observability and OpenTelemetry
- Observability: measure the internal state of a system by examining its outputs
- Three pillars of observability: logs, metrics, traces (these are the outputs)
  - Earlier, there were different standards for each of these
  - And different approaches across different programming languages
- OpenTelemetry aims to be a unified standard: collection of technologies to collect and export telemetry (part of CNCF)
- GCP supports OpenTelemetry
- Steps to use
  - Add OTel libraries for your language
  - Instrument your code to export telemetry
- Once your app is deployed to GCP, you’ll be able to see the data in the cloud specific tools
Service Directory
- A single place to publish, discover and connect services
- Workloads can be running in GCP, on-prem or other clouds
- Managed service: H/A and scalable
- Register/resolve services using DNS, HTTP and gRPC
- Service Directory client libraries are available for multiple languages
- Audit logging and request/response logs

Section 35 - Understanding Google Cloud Architecture Framework

https://cloud.google.com/architecture/framework
Best practices and implementation recommendations to help your GCP deployment
Focus on designing robust, secure and scalable systems
Four principles
- Operational excellence
  - Ensure operational readiness and performance using CloudOps: Ensure that cloud solutions meet operational and performance requirements by defining service level objectives (SLOs) and by performing comprehensive monitoring, performance testing, and capacity planning.
    - Define SMART SLOs and SLAs
    - Implement comprehensive observability
    - Implement performance and load testing
    - Plan and manage capacity
    - Continuously monitor and optimize
  - Manage incidents and problems: Minimize the impact of cloud incidents and prevent recurrence through comprehensive observability, clear incident response procedures, thorough retrospectives, and preventive measures.
    - Establish clear incident response procedures
    - Centralize incident management
    - Conduct thorough post-incident reviews
    - Maintain a knowledge base
    - Automate incident response
  - Manage and optimize cloud resources: Optimize and manage cloud resources through strategies like right-sizing, autoscaling, and by using effective cost monitoring tools.
    - Right-size resources
    - Use autoscaling (MIGs, GKE cluster/Pod autoscaler, Cloud Run is serverless and autoscales already)
    - Leverage cost-optimization strategies (committed use, sustained use, preemptible)
    - Track resource usage and cost (tags and labels)
    - Establish cost allocation and budgeting
  - Automate and manage change: Automate processes, streamline change management, and alleviate the burden of manual labor.
    - Adopt IaC
    - Implement version control
    - Build CI/CD pipelines
    - Use configuration management tools
    - Automate testing
  - Continuously improve and innovate: Focus on ongoing enhancements and the introduction of new solutions to stay competitive.
    - Foster a culture of learning
    - Conduct regular retrospectives
    - Stay up-to-date with cloud technologies
    - Actively seek and incorporate feedback
    - Measure and track progress
- Security, privacy and compliance
  - Security principles
    - Build a layered security approach
    - Design for secured decoupled systems
    - Automate deployment of sensitive tasks
    - Automate security monitoring
    - Meet the compliance requirements for your regions
    - Comply with data residency and sovereignty requirements
    - Shift security left
  - Manage your assets
    - Use cloud asset management tools
    - Automate asset management (IaC)
    - Monitor for deviations from your compliance policies
    - Integrate with your existing asset management monitoring systems
    - Use data analysis to enrich your monitoring
  - Manage identity and access
    - Use a single IdP
    - Protect the super admin account
    - Plan your use of service accounts
    - Update identity processes for the cloud
    - Set up SSO and MFA
    - Implement least privilege and separation of duties
    - Audit access
    - Automate your policy controls
    - Set restrictions on resources
  - Implement compute and container security
    - Use hardened and curated VM images
    - Use Confidential Computing for processing sensitive data: Confidential Computing builds on Shielded VM. It protects data in use by performing computation in a hardware-based trusted execution environment
    - Protect VMs and Containers
    - Disable external IP addresses unless they are necessary
    - Monitor your compute instance and GKE usage
    - Keep images and clusters up to date
    - Control access to images and clusters
    - Isolate containers in a sandbox: Use GKE Sandbox to deploy multi-tenant applications that need an extra layer of security and isolation from their host kernel. For example, use GKE Sandbox when you are executing unknown or untrusted code. GKE Sandbox is a container isolation solution that provides a second layer of defense between containerized workloads on GKE.
  - Secure your network
    - Deploy zero trust networks
    - Secure connections to on-prem or multi cloud envs
    - Disable default networks
    - Secure your perimeter
    - Inspect network traffic
    - Use a web application firewall
    - Automate infrastructure provisioning
    - Monitor your network
  - Implement data security
    - Automatically classify your data
    - Manage data governance using metadata
    - Protect data according to its lifecycle phase and classification
    - Encrypt your data
    - Control cloud admin’s access to your data
    - Configure where your data is stored and where users can access it from
    - Manage secrets using Secret Manager
    - Monitor your data
  - Deploy applications securely
    - Automate secure releases
    - Ensure that application deployments follow an approved process
    - Scan for known vulnerabilities before deployment
    - Monitor application code for known vulnerabilities
    - Control movement of data across perimeters
    - Encrypt your container images
  - Manage compliance obligations
    - Asses your compliance needs
    - Deploy Assured Workloads: Assured Workloads is the Google Cloud tool that builds on the controls within Google Cloud to help you meet your compliance obligations.
    - Review blueprints for templates and best practices that apply to your compliance regime
    - Monitor your compliance
    - Automate your compliance
  - Implement data residency and sovereignty requirements
    - Manage your data sovereignty
    - Manage your operational sovereignty
    - Manage software sovereignty
    - Control data residency
  - Implement privacy requirements
    - Classify your confidential data
    - Lock down access to sensitive dataq
    - Monitor for phishing attacks
    - Extend zero trust security to your hybrid workforce
  - Implement logging and detective controls
    - Monitor network performance
    - Monitor and prevent data exfiltration
    - Centralize monitoring
    - Monitor for threats
- Reliability
  - Set a clear expectation that the entire organization is responsible for reliability, including development, product management, operations, platform engineering, and site reliability engineering (SRE). Even the business-focused groups, like marketing and sales, can influence reliability.
  - Principles
    - Define reliability based on user-experience goals
      - Measure user experience
      - Analyze user journeys (Cloud Trace)
    - Set realistic targets for reliability
      - Accept some failure and prioritize components
      - Balance reliability and cost (remember to consider your RTOs and RPOs)
    - Build high availability through redundancy
      - Identify failure domains and replicate services
      - Detect and address issues promptly (health checks)
    - Take advantage of horizontal scalability
      - Use managed services (MIGs, Cloud Run)
      - Promote modular design
      - Implement a stateless design
    - Detect potential failure by using observability
      - Metrics, logs and traces
      - Gain comprehensive insights (Cloud Logging, Cloud Monitoring)
      - Perform proactive troubleshooting (Cloud Storage access logs, VPC Flow Logs)
      - Optimize resource utilization
      - Prioritize alerts
    - Design for graceful degradation
      - Implement throttling (Apigee)
      - Drop excess requests early
      - Handle partial errors and retries
      - Test overload scenarios
      - Monitor traffic spikes
    - Perform testing for recovery from failures
      - Define testing objectives and scope
      - Prepare the environment for testing
      - Simulate failure scenarios
      - Monitor system behaviour
      - Verify recovery against RTO and RPO
      - Document and analyze results
      - Iterate and improve
    - Perform testing for recovery from data loss
      - Verify backup consistency and test restoration processes
      - Schedule regular and frequent backoffs
      - Define and monitor RPO
      - Monitor backup health
      - Plan for scenarios beyond backup
    - Conduct thorough postmortems
      - Conduct blameless postmortems
      - Make postmortem report readable by all intended audiences
      - Avoid complex or over-engineered solutions
      - Share the postmortem as widely as possible
- Cost optimization
  - Align spending with business value
    - Prioritize managed services and serverless products
    - Balance cost efficiency with business agility (adopt DORA metrics, follow SRE practices)
    - Enable self-service optimization
    - Adopt and implement FinOps
    - Promote a value-driven and TCO-aware mindset
  - Foster a culture of cost awareness
    - Provide org-wide cost visibility
    - Understand how cloud resources are billed
    - Understand resource-based cost optimization options (always-allocated CPUs for Cloud Run, BigQuery slot commitments)
    - Understand discount-based cost optimization options (CUDs, sustained use discounts, Spot VMs)
    - Incorporate cost estimates into architecture blueprints
    - Use a consistent and standard set of labels for all resources
    - Share cost reports with team members
  - Optimize resource usage
    - Choose environment-specific resources (prod, dev/test, others)
    - Choose workload-specific resources (mission-critical, non-critical, event-driven, experimental)
    - Select regions based on cost requirements
    - Use built-in cost optimization options
    - Optimize resource sharing
    - Develop and maintain reference architectures
    - Enforce cost discipline by using organization policies
    - Estimate realistic budgets and set financial boundaries
  - Optimize continuously
    - Focus on business-relevant metrics (latency, error rates, throughput, customer satisfaction, DORA, SRE)
    - Use observability for resource optimization
    - Balance troubleshooting needs with cost
    - Tailor data collection to roles and set role-specific retention policies
    - Consider regulatory and compliance requirements
    - Implement smart alerting (customer impacting first)
- Performance optimization
  - Core principles
    - Define requirements
    - Design and deploy
    - Monitor and analyze
    - Optimize
  - Plan resource allocation
    - Configure and manage quotas
    - Educate and promote awareness
    - Monitor performance metrics
  - Take advantage of elasticity
    - Plan for peak load periods
    - Use predictive scaling
    - Implement serverless architectures
    - Use Autopilot mode for GKE
  - Promote modular design
    - Design for loose coupling
    - Design for concurrency and parallelism
    - Balance modularity for flexible resource allocation
    - Use stateless models
    - Choose complementary technologies
  - Continuously monitor and improve performance
    - Define clear performance goals and metrics
    - Monitor performance
    - Incentivize continuous improvement

Section 36 - Case Studies - Solutions - Professional Cloud Architect

EHR Healthcare
- https://services.google.com/fh/files/blogs/master_case_study_ehr_healthcare.pdf
- Anthos - allows you to run K8s clusters anywhere; centralized management; service mesh; CI/CD - watch Git repo for changes and apply changes to all clusters
- Cloud Logging & Monitoring - alerting policies and notifications
- Cloud Logging sinks to Cloud Storage/BigQuery - log retention
- BigQuery - make predictions and generate reports; can get data into it via Pub/Sub (streaming) or GCS (batch)
- Dedicated Interconnect - secure and high performance connection between on-prem and GCP
- Cloud CDN - reduced latency
- ADFS and Cloud Directory Sync - SSO
- MySql/MS SQL → Cloud SQL
- Redis → Memorystore
- MongoDB → https://cloud.google.com/mongodb?hl=en or Marketplace/Datastore/Firestore
Helicopter Racing League
- https://services.google.com/fh/files/blogs/master_case_study_helicopter_racing_league.pdf
- BigQuery + AI platform → ML/AI + analytics
- Cloud Storage + Cloud CDN → move realtime and recorded content closer to the users
- Apigee → Expose predictive models to partners
- Pub/Sub + Dataflow + BigQuery → realtime transformations, messaging
- Transcoder API → increase transcoding performance
- Video Intelligence API
Mountkirk Games
- https://services.google.com/fh/files/blogs/master_case_study_mountkirk_games.pdf
- Agones + Kubernetes → game servers (https://agones.dev/site/ )
- MIGs + VMs + Global LB + GPUs
- Cloud Storage → store activity logs
- Cloud Spanner or Memorystore → realtime leaderboard
- Cloud Build + Spinnaker/Jenkins → CI/CD
- Spanner + Memorystore → match history DB with regional caching
- Bigtable → time series events
- Firestore → player DB
- BigQuery/AI Platform → analytics and intelligence
TerramEarth
- https://services.google.com/fh/files/blogs/master_case_study_terramearth.pdf
- IoT Core, Pub/Sub, Dataflow, BigQuery →
- Cloud Storage, Dataflow, BigQuery
- Apigee
- Cloud Build, Spinnaker, Jenkins
- Google Workspace - collaboration
- Secret Manager - cloud native solutions for keys and secret management
- GKE - container based workloads
- Dedicated Interconnect - networking, high data volume

GCP - Professional Cloud Architect - Study Guide & Notes

Table of contents