Google Cloud - Professional Data Engineer - Study Guide and Exam Experience (2025)

Radu PanaRadu Pana
61 min read

Background

I’ve been working with cloud service providers for a few years now, first using AWS and then moving onto GCP, which is the current provider I use professionally. During this time, I attained a few certifications, such as the AWS Architect Associate & Professional, as well as the GCP Engineer & Architect.

Recently, I’ve found myself increasingly more involved with data engineering, and figured I’d dig deeper into GCP’s data product catalog, as well as obtain their Data Engineer certification.

Exam experience

Let’s start with the most important aspect, the exam itself. I sat and passed the exam on Friday, Aug 22, 2025.

There were 50 multiple choice (four or five options) questions in total. Most of them are single answer, but there were about five where I was asked to select two options, instead of one. I got through all the questions in about 40 minutes (the maximum allowed time is 120 minutes). I personally find that I do better when I don’t go back to check my answers, because I end up second guessing myself. So, instead, I spend just enough time on each question so that I am as confident as I could be about my choice.

The questions were slightly more complex than the samples provided by Google, but not by much. Most of them were fairly easily solved by process of elimination, combined with spotting the two answers which mention the same product/service, but have a slight difference between them. In these cases, I would discard the other two options and focus on the difference(s) between the remaining ones.

As expected, BigQuery, Dataflow and Dataproc appeared in most of the questions, between them, accounting for more than half of the total questions.

Encountered topics:

  • A lot of BigQuery, Dataflow, Dataproc

  • A lot of governance: there were a lot of questions on Dataplex and unified governance/visibility over data lakes across GCS and BigQuery.

  • Quite a few BigQuery PII de-identifying and preventing teams from accessing row/column data

  • Dataform appeared on two questions

  • HDFS on GCS appeared once or twice

  • Memorystore appeared in two questions

Things I didn’t expect/didn’t know much about:

  • Dataform

  • Detailed networking setup (SSH tunnel, private connectivity) to use Datastream between an on-prem SQL DB and BigQuery

Study guide I followed

Having had a fair few years experience within AWS and GCP, I felt fairly confident in my knowledge of their product portfolio, including services pertaining to data engineering. I signed up for a couple of Udemy courses through my employer and also made extensive use of ChatGPT and Claude, to create detailed study plans per product. I ended up studying about one hour per day, for around a month, before sitting the exam.

I found that the knowledge required to pass the exam is fairly surface level. You don’t need to have in depth knowledge on subjects like Colossus, Capacitor, Andromeda, BigQuery/Bigtable architecture, etc. It’s sufficient to know which products are relevant for a specific use case, when to choose one over the other, and some common synergies/architectures that leverage multiple GCP products.

You can find the guide I followed below. Note that you should include Dataform, which I haven’t really prepared for.

Service comparison cheat sheet

Database solutions comparison

ServiceWorkload type & data modelCompute abstraction (& separation?)Physical storage layoutKey Visualizer?
Cloud BigtableNoSQL wide-column, petabyte OLTP/IoTNodes (≈ 10 k ops / 250 MB/s each) - nodes scale up/down independently of Colossus storage (true separation)Rows stored lexicographically by row key; each column-family lives in its own SSTable filesYes – Bigtable Key Visualizer
Cloud SpannerGlobally-consistent relational OLTPNodes or Processing Units; compute layer separate from replicated storage shards (“splits”)Tables broken into contiguous key-range splits ordered by primary-key bytesYes – Spanner Key Visualizer
BigQueryServerless columnar OLAP warehouseSlots (virtual CPU-seconds); fully decoupled from Colossus columnar storageColumnar format (Capacitor) - each column stored separatelyNo (uses Execution-Details & slot-usage graphs)
Cloud Firestore (Native mode)NoSQL document store for mobile/webServerless auto-scaling; compute invisible to users, fully separated from storage FirebaseDocuments in collections; underlying shards ordered by document ID (document-oriented)Yes – Firestore Key Visualizer
Cloud SQLManaged MySQL/PostgreSQL/SQL ServerFixed VM machine types (vCPU + RAM) – compute coupled to persistent disksRow-oriented (engine default) on block storageNo
Memorystore for RedisIn-memory key-value cacheShards & replica nodes (RAM-backed); compute = storage (in-memory)All data held in RAM; persistence optionalNo
Cloud StorageObject storage (data lakes, backups)Fully serverless; pay per-GB and per-operation – compute totally hiddenImmutable objects on erasure-coded disks across zones (11 9s durability)No

Dataflow vs Dataproc vs Data Fusion

  • All are GCP-native data processing services.

  • All support ETL/ELT pipelines.

  • All integrate with Cloud Storage, BigQuery, Pub/Sub, etc.

  • All support IAM, VPC, monitoring via Cloud Monitoring, and logging via Cloud Logging.

  • All can be orchestrated via Cloud Composer or Workflows.

FeatureCloud DataflowCloud DataprocCloud Data Fusion
EngineApache Beam (Java/Python SDK)Apache Spark, Hadoop, Hive, PrestoCDAP (Cask)
InterfaceCode-firstCode-first (Spark, Hive, PySpark)UI (drag-and-drop)
Processing TypeBatch + StreamingPrimarily batch (some support for streaming via Spark Streaming)Mostly batch (some streaming support)
Abstraction LevelHigh-level (managed Beam SDK)Lower-level (you manage Spark/Hadoop jobs)High-level (UI-based pipeline creation)
Custom TransformationsFully supported via Beam SDKFully supported via Spark/HiveLimited (need Java plugins for custom logic)
Worker ManagementFully managed autoscaling (with overrides)Cluster-based (you manage cluster size and config)Managed instance (you pay per Fusion instance tier)
Ease of UseModerate (requires dev knowledge)Advanced (requires infra + Spark/Hadoop expertise)Easy (low-code)
Use Case FocusReal-time data pipelines, event processing, large-scale ETLLegacy Hadoop/Spark migration, ML on Spark, batch analyticsRapid data ingestion/movement, citizen data integration
Streaming Support✅ First-class support (windowing, watermarking, etc.)⚠️ Limited (Spark Structured Streaming only)⚠️ Basic (some plugins support real-time ingest)
DimensionDataflowDataprocData Fusion
Streaming✅ Excellent⚠️ Limited⚠️ Basic
Batch✅ Yes✅ Yes✅ Yes
Code Required✅ Yes (Java/Python)✅ Yes (Spark, Hive)❌ No (drag-and-drop)
Custom Logic✅ Native support✅ Native support⚠️ Java plugins only
Fully Managed✅ Serverless⚠️ You manage cluster✅ Instance-based
Use Case TypeStreaming + scalable ETLOpen-source pipelines, ML on SparkSimple data movement + ingestion
UsersEngineersEngineers/Data ScientistsAnalysts/Integrators

Composer vs Workflows vs Scheduler

Feature / ServiceCloud SchedulerCloud WorkflowsCloud Composer
TypeCron-like job schedulerServerless API/service orchestratorManaged Apache Airflow (workflow orchestrator)
Best forSimple time-based triggersMedium-complex orchestrations of APIs & servicesComplex, dependency-aware data pipelines
Triggers / TargetsPub/Sub, HTTP(S), App EngineGCP services & HTTP APIs via YAML/JSON stepsGCP + external systems via Airflow operators
ComplexitySingle job per scheduleMulti-step, retries, branching, parallelismDAGs with dependencies, sensors, SLAs, backfills
Execution ModelStateless cron → targetServerless (pay-per-execution)Persistent environment with schedulers/workers
State / BackfillNoneNo backfill, minimal stateSupports backfills, task states (XCom), sensors
Ops OverheadVery lowLowHigh (managing environment, Airflow concepts)
CostCheapest (per job)Pay per execution stepsCostly (env runs even when idle)
Exam Cues“Run X at Y time,” “Trigger one function/job daily”“Glue APIs together serverlessly,” “Retry logic in YAML”“Orchestrate multi-step pipelines,” “Backfill partitions,” “DAGs & dependencies”

BigQuery – Cloud Data Warehouse (Highest Importance)

Google BigQuery is a fully-managed, serverless analytics data warehouse that operates at petabyte scale. It excels at running SQL queries on vast datasets with high speed and minimal operational overhead. BigQuery decouples storage and compute, storing data columnar and compressed, and allocates computing power (slots) on demand. By default, you pay per data scanned (on-demand pricing), but BigQuery also offers flat-rate capacity via BigQuery Editions (Standard, Enterprise, Enterprise Plus) to reserve slots with committed pricing. Each edition provides different features and SLAs – for example, Enterprise editions include features like BI Engine (in-memory query acceleration) and cross-user result caching, while Enterprise Plus adds compliance controls (Assured Workloads) and managed disaster recovery.

Data Modeling and Storage: BigQuery is schema-oriented and supports structured and semi-structured data (with JSON columns). It’s an OLAP (analytical) engine designed for aggregations and scans rather than transactional updates. Denormalization (storing nested or repeated fields) is common to optimize read performance, since JOINs can be expensive. Partitioning and clustering are crucial techniques to manage large tables: partition on a date or ingest timestamp to prune data scanned, and cluster on frequently filtered columns to organize data for faster reads. BigQuery ensures ACID compliance for each single SQL statement (it uses snapshot isolation), and supports implicit transactions in DML (in single tables) and explicit multi-statement transactions (if needed, but not common). Unlike relational OLTP systems, BigQuery updates or inserts are best done in batches (streaming inserts are supported for real-time use cases, but may have a slight latency for consistency).

Security and Governance: Data in BigQuery is encrypted at rest and in transit by default. For sensitive data, you can enforce Customer-Managed Encryption Keys (CMEK) via Cloud KMS for BigQuery storage to meet compliance needs. Access control is handled by Cloud IAM – you can grant predefined roles like BigQuery Data Viewer or Data Editor on projects, datasets, or even at the table level. BigQuery also supports fine-grained access: Column-level security (using policy tags in Data Catalog) and Row-level security (using authorized views or row access policies) to restrict sensitive data. For example, you might tag a column as PII and restrict access to it via a Data Catalog policy tag. When sharing data, instead of giving direct table access, you can create authorized views or materialized views that expose only allowed subsets of data. BigQuery’s integration with Analytics Hub further allows publishing datasets in a controlled way – organizations can create exchanges and listings to share data products internally or externally in a privacy-safe manner. A subscriber to an Analytics Hub listing gets a linked dataset in their BigQuery project that they can query, without the data being copied (and providers can monitor usage and revoke access if needed). This data sharing as a product concept (data exchanges, clean rooms for joint analysis without data movement) is increasingly important.

Performance and Features: BigQuery is designed to scan large tables efficiently. It uses columnar storage and can scale the compute slots automatically. You should leverage features like table partitioning (by date or integer range) and clustering (e.g., cluster by user_id, or other high-cardinality columns) to reduce query cost and latency. BigQuery has materialized views for pre-aggregating frequent query results, and result caching that caches query results for 24 hours (so repeated queries by the same user are free if underlying data hasn’t changed). For BI dashboards, BigQuery’s BI Engine can be used to cache query results in-memory for sub-second performance – this especially benefits tools like Looker Studio or Tableau connecting live to BigQuery. BigQuery can also execute machine learning with SQL using BigQuery ML, allowing data engineers to train models (like linear regression, k-means, XGBoost, TensorFlow models) on data in place. This may appear in exam scenarios where simple ML (e.g., forecasting or classification) needs to be integrated into a data pipeline without exporting data to a separate ML platform.

Use Cases: The exam will expect you to recognize when BigQuery is the best solution. BigQuery shines for analytical queries (OLAP) on large datasets, e.g. analyzing millions of records for trends, aggregating logs, or as an enterprise data warehouse for BI reporting. It’s less suitable for high-frequency single-row updates or transactional workloads – those fit a database like Cloud SQL or Spanner. If a use case involves complex SQL analytics, ad-hoc queries, or large-scale join/aggregate operations, BigQuery is likely the right choice. BigQuery can ingest streaming data (via Pub/Sub or its streaming insert API) and make data queryable within seconds, making it useful for near-real-time analytics (e.g., monitoring dashboards). It also now supports BigQuery Omni to query data in other clouds, and BigLake tables to query files in Cloud Storage or across data lakes with consistent security.

BigQuery Editions introduced tiered capacity plans – Standard (baseline), Enterprise, Enterprise Plus – which affect available features and pricing commitments. For example, Enterprise Plus adds auto scaling with baseline capacity, 99.99% SLA, and cross-region compliance options. BigQuery also added logical replication (point-in-time restore) under managed disaster recovery for Enterprise Plus. Another newer feature is BigQuery continuous streaming (via BigQuery subscriptions and dynamic tables) – e.g., BigQuery can ingest from Pub/Sub with very low latency into native storage.

Dataflow & Apache Beam – Unified Batch/Streaming Processing

Cloud Dataflow is Google’s fully-managed service for unified stream and batch data processing, built on the open-source Apache Beam programming model. In Dataflow, you create pipelines (in Java, Python, etc. using Beam SDKs) that define transformations on data, and the service handles provisioning and scaling of worker instances to execute the pipeline steps. For the exam, Dataflow is critical as it’s the primary tool for building ETL/ELT pipelines and streaming data processing in GCP.

Batch vs. Streaming: Dataflow handles both modes with the same API. In batch jobs (processing bounded data like files on Cloud Storage or a BigQuery export), Dataflow reads the entire dataset, processes it (e.g., transforms, aggregates), and writes to a sink (like BigQuery, Cloud Storage, etc.). In streaming jobs (unbounded data like events from Pub/Sub), Dataflow pipelines run continuously, handling infinite event streams with a windowing mechanism. You must understand Beam’s time model: events have timestamps and are grouped into windows (e.g., fixed 1-minute windows, sliding windows, session windows) for aggregation. Late-arriving data is handled via watermarks and allowed lateness – Dataflow can trigger updates for late data (using triggers) or discard late events beyond a threshold. Dataflow ensures exactly-once processing when used with sources/sinks that support it (for example, reading from Pub/Sub and writing to BigQuery can be exactly-once). By default, Pub/Sub deliveries are at-least-once, but Dataflow’s checkpointing and deduplication can achieve an end-to-end exactly-once result. A recent update even introduced an explicit at-least-once streaming mode for lower latency if occasional duplicates are acceptable (though exactly-once is usually desired for correctness).

Transforms and Connectors: Beam pipelines consist of PCollections (data sets) and PTransforms (operations). Common transforms include element-wise mapping, filtering, aggregations (groupByKey, combine), and streaming-specific ones like windowing and triggering. Dataflow comes with many connectors: it can read from Cloud Storage, Pub/Sub, Bigtable, Kafka, JDBC databases, etc., and write to many sinks (BigQuery, Pub/Sub, Cloud Storage, Spanner, etc.). For example, a pipeline might read from Pub/Sub, window events by minute, compute some aggregate, and then write to BigQuery – all managed by Dataflow’s runner. The exam may give scenario descriptions like “ingest and transform a stream of user click events then load to storage” – Dataflow is often the correct choice for such tasks, especially if the volume is high or complex processing (like sessionization or joins of streams) is needed.

Scaling and Reliability: Dataflow is serverless in that it autoscales the number of worker VMs. It can scale horizontally for throughput, and with Dataflow Prime (an enhanced version of the service), it can even vertical scale (increase machine memory for skewed workloads) and perform autoscaling with GPU/FPGA if needed. Dataflow pipelines are fault-tolerant – if a worker fails, Dataflow transparently reruns the needed work on another worker. For streaming, state and checkpoints are managed so that failure doesn’t lead to data loss or duplication beyond what the model allows. As a managed service, Dataflow handles provisioning and teardown; you just submit a job with a given worker machine type and it will allocate resources. You can also use Dataflow Templates to parameterize pipelines for reuse, and Flex Templates for custom packaging – useful for scheduling recurring jobs without writing code each time (e.g., a nightly batch pipeline triggered by Cloud Scheduler). Ensure you know that Cloud Dataflow is the recommended service for data pipelines where you need custom transformations in code, especially for streaming data or complex ETL logic. Google may contrast it with Dataproc (Spark) or Data Fusion – Dataflow often wins when you want fully managed, auto-scaling pipelines and you can express your logic in Beam’s paradigms.

Latency and processing guarantees: Dataflow (Beam) provides processing guarantees: event-time processing with allowed lateness ensures results per event-time window, and you can trade off latency vs. completeness using triggers (for example, emit early results and then final ones). Exactly-once sinks like BigQuery or Cloud Storage (with atomic file writes) mean results won’t be duplicated. If the exam scenario is about real-time analytics with minimal lag, Dataflow with streaming (possibly using Streaming Engine for low-latency pipeline processing) is the go-to. There is a slight startup time to Dataflow jobs (a few minutes to provision), but once running, it’s low-latency. For ultra-low latency (< ~1s) event responses, sometimes alternatives like Cloud Functions might be considered, but those are more for event-driven simple tasks, not large-scale processing. Dataflow is built for high-throughput (millions of events per second) with sub-second to seconds latency per event through the pipeline.

Monitoring: Dataflow pipelines integrate with Cloud Monitoring – you can track elements in progress, system lag (how far behind the pipeline is in event-time), and watermark timestamps. The exam may expect that you set up alerts on Dataflow lag or errors. For instance, if a streaming pipeline’s lag exceeds some threshold (meaning it’s falling behind real-time, maybe due to not enough CPU), you’d consider adding more workers or checking for bottlenecks. Dataflow’s UI provides a lot of insight (pipeline graph, throughput, latency metrics).

When not to use Dataflow: If the task is simple moving of data without transformation, sometimes native services or simpler tools suffice (e.g., BigQuery Data Transfer Service or even scheduled queries). If a pipeline is mostly orchestrating other services (like unload from one DB and load to BigQuery), Cloud Composer might orchestrate such steps. But for any complex data processing, Dataflow is usually the correct choice over writing your own Spark cluster or custom code on VMs.

Dataflow has introduced Vertical Autoscaling (the service can automatically increase memory on workers if needed for a heavy step) and improved Streaming Engine features. Also, Dataflow SQL (in preview) lets you run simple streaming transforms using SQL queries in BigQuery that execute on Dataflow. Additionally, Schema-aware pipeline transforms and Readiness IO now allow easier integration with Pub/Sub schemas. Keep in mind Dataflow Prime – the exam might not dive deep, but it’s essentially an enhanced mode focusing on optimizing resource usage and pipeline performance automatically.

Pub/Sub – Real-Time Messaging and Ingestion

Cloud Pub/Sub is GCP’s scalable, global messaging service for event ingestion and delivery. It follows a publish/subscribe model: producers send messages to a topic, and consumers subscribe to the topic to receive those messages. Pub/Sub decouples components in data pipelines, enabling reliable, asynchronous event processing. Key features include high throughput (millions of messages per second), auto-scaling, at-least-once delivery (with optional exactly-once in integration with Dataflow), and durable storage of messages for a retention period (by default 7 days).

Concepts: A message in Pub/Sub is a small data payload (up to 10 MB) plus optional attributes. Publishers publish messages to topics. Subscribers subscribe via either pull (explicitly poll for messages) or push (Pub/Sub sends an HTTP request to a subscriber endpoint). Each subscription has an independent message stream and acknowledgment deadline – subscribers must ACK a message within e.g. 10 seconds (configurable) or it will be re-delivered. This ensures at-least-once delivery: if a subscriber fails to process, the message isn’t lost. Pub/Sub also supports dead-letter topics (unacknowledged messages can be routed after a number of attempts) to handle poison messages gracefully. Order of messages is not guaranteed by default, but you can use ordering keys such that messages with the same key are delivered in order (requires enabling message ordering on the topic).

Use Cases: Pub/Sub is central to streaming architectures on GCP. It’s used to ingest event streams from user activity logs, IoT sensor readings, application event logs, etc. For example, an e-commerce site could publish order events to a topic, and multiple subscribers can independently consume: one pipeline (Dataflow) updates analytics in BigQuery, another triggers downstream processing. Because producers and consumers are decoupled, you can add new consumers without modifying the producers, which is a highly flexible design. The exam may describe scenarios where a system needs to handle streaming data or asynchronous processing – Pub/Sub is likely part of the solution.

Streaming to Storage: Pub/Sub messages can be directly exported to storage systems: for example, you can create a BigQuery subscription that writes incoming messages into BigQuery tables automatically (as they arrive, with a slight batching) – this is great for quick ingestion if minimal transform is needed. Similarly, Cloud Storage subscriptions can dump messages as files. These capabilities mean some simple pipelines can be done without coding: e.g., publish JSON events and have Pub/Sub write them to BigQuery continuously. Pub/Sub has recently added no-code import integrations: you can ingest from AWS Kinesis Data Streams into Pub/Sub (one-click import topic), or ingest batch files from Cloud Storage into Pub/Sub (import topics). This shows up in multi-cloud or migration contexts (exam might mention bridging AWS data into GCP – Pub/Sub’s Kinesis integration is a clue).

Schematization: Pub/Sub optionally enforces schemas on messages (using Avro or Protocol Buffers). If a schema is attached to a topic, producers must send data that validates against it – any invalid message is rejected. This helps maintain data quality in pipelines (e.g., if an event field type changes unexpectedly, you catch it). It also allows subscribers to decode messages consistently.

Exactly-Once and Integration: Traditionally Pub/Sub delivers at-least-once, meaning consumers must handle potential duplicates. However, with Pub/Sub Lite (a separate service) and some new features, or by using Dataflow’s deduplication, you can achieve effectively once processing. Pub/Sub Lite is a lower-cost alternative where you manage capacity (pre-provisioned throughput, message ordering per partition, and longer retention up to months). Lite sacrifices some convenience (no automatic scaling, zonal only) for cost – typically used if you have very large volumes but want to pay less and can tolerate more ops overhead. Standard Pub/Sub is fully managed and multi-zone resilient, so for exam answers, unless specifically about cost at extreme scale, regular Pub/Sub is the answer.

Retention and Replay: Pub/Sub standard retains unacknowledged messages for 7 days by default (can configure up to 7 days, or 31 days if a topic is explicitly configured for longer retention now). Subscribers can also seek to a timestamp or snapshot in history (within retention) to replay past messages. This is useful for recovery or backfills. For example, if a bug in pipeline caused data loss, you could create a new subscription and seek it to an earlier time to replay recent events. Pub/Sub messages are stored encrypted at rest; by default Google manages keys, but you can use CMEK for Pub/Sub if needed for compliance (this is a newer feature – allowing you to bring your own encryption key).

Integration: Pub/Sub is often combined with Dataflow for streaming pipelines: Dataflow can be a subscriber, reading from Pub/Sub and doing transformations, then writing to sinks like BigQuery or Bigtable. Pub/Sub can also trigger Cloud Functions or Cloud Run services directly (via push subscriptions or Eventarc), making it a backbone for event-driven serverless architectures. For instance, a Pub/Sub topic can receive storage upload notifications (e.g., via Cloud Storage’s notification integration) and then a Cloud Function subscribed to that topic processes the file.

Monitoring: You should ensure your systems monitor Pub/Sub throughput and backlog. Cloud Monitoring provides metrics like oldest unacked message age, which indicate if subscribers are falling behind. If backlog is growing, you might need to add more subscribers or investigate a bottleneck.

Pub/Sub has enhanced multi-cloud ingestion (AWS import, as mentioned), and deeper integration with BigQuery Iceberg tables (so streaming data can land in open formats). Also introduced is stream sharing: using Pub/Sub with Analytics Hub to publish streaming data exchanges (in preview), though for exam it’s enough to know Analytics Hub mainly for BigQuery datasets. Pub/Sub’s ecosystem expanded, but the fundamental usage remains: it’s the go-to for decoupling components with an asynchronous, scalable queue. Anytime the exam scenario is “many producers, many consumers, need to handle spikes reliably and decouple processing”, Pub/Sub is the likely answer.

Cloud Storage – Data Lake and File Storage

Cloud Storage (GCS) is a unified object storage service for any amount of unstructured data. It’s essentially Google’s version of an infinite file system in the cloud (like Amazon S3). Data is stored as objects (files/blobs) in buckets. Cloud Storage is crucial in data engineering as the backbone of data lakes, staging areas for pipelines, and for storing any binary large objects.

Storage Classes and Cost: GCS offers multiple storage classes: Standard (frequently accessed data, highest cost, lowest latency), Nearline (cheaper, for data accessed ~monthly), Coldline (cheaper yet, for ~quarterly access), and Archive (cheapest, for rarely accessed, e.g., yearly). All classes have the same performance; they just have different retrieval costs and minimum storage durations. As a data engineer, you plan lifecycle policies to move objects to cheaper classes as they age (for cost optimization). For example, raw logs might be in Standard for 30 days for immediate use, then auto-transition to Coldline for long-term archive after 90 days.

Use Cases: Cloud Storage is ideal for data lake architectures: you land raw data files (CSV, JSON, Avro, Parquet, images, etc.) in GCS. It’s durable (11 9’s durability) and can store virtually unlimited data. Downstream systems like BigQuery or Spark on Dataproc can load or query data from GCS. GCS is also used for backup and recovery – e.g., you might export database backups or BigQuery table snapshots to GCS. It’s also common for machine learning datasets (storing training data, images, etc.). For exam scenarios: if you have a lot of unstructured or semi-structured data (like logs, media, sensor recordings) and need a scalable store before processing, Cloud Storage is the answer. It’s also the target for batch data imports/exports (e.g., transferring data from on-prem via Storage Transfer Service ends up in GCS, then you process it).

Consistency and Performance: GCS is strongly consistent for new object creation and overwrites (since 2020, it now offers strong global consistency). This simplifies pipelines (you don’t have to worry about eventual listing). However, listing very large buckets can be slow – it’s best practice to partition data in a directory-like structure (prefixes) by date or category so you can list only relevant subsets. Reading from GCS is high-throughput – you can achieve very high parallel read/write rates by using multiple threads/workers. (Dataflow and Dataproc can both read GCS with parallelization.) If a comparison arises: Cloud Storage vs. Bigtable or other storage – Cloud Storage is cheaper per GB and great for large binary objects and sequential read/write (throughput), but not good for random single-record lookups with low latency (that’s what databases are for).

Integration: Many services integrate with GCS. BigQuery can query data directly in GCS via external tables (especially if in Parquet/ORC/CSV format) or using BigLake tables (a newer concept to apply BigQuery security on GCS files). Dataflow and Dataproc treat GCS as a first-class source/sink (Beam’s TextIO, FileIO, etc. read/write to GCS with paths like gs://bucket/...). GCS can also trigger events: adding a file can trigger a Cloud Function or Workflow (through Eventarc or Cloud Storage notifications to Pub/Sub) for processing pipelines. For example, the exam might have “a daily file arrives on GCS, how to kick off a pipeline?” – the answer: use Cloud Storage’s Pub/Sub notification feature or Cloud Functions to detect the file and then launch a Dataflow job or other process.

Security: Cloud Storage uses IAM for controlling access to buckets (and optionally fine-grained object ACLs, though IAM is preferred). You can set bucket-level permissions (Storage Admin, Storage Object Viewer, etc.) or even define uniform bucket-level access to avoid per-object ACLs. For sensitive data, you can use CMEK (provide a Cloud KMS key for the bucket) so that all objects are encrypted with your key (which you can revoke if needed). GCS also supports Bucket Lock (WORM storage, for regulatory archives) and Object Versioning to keep old copies of objects when overwritten or deleted (useful for recovery or audit).

Data Lifecycle and Processing: In a typical data lake on GCP, you might ingest raw data to a GCS landing zone. Then Dataflow/Dataproc jobs read those files, parse/clean them, and output results to either BigQuery or back to GCS (perhaps in a refined format like Parquet in a curated zone). Cloud Storage is very flexible – it holds raw and processed data, and you can keep even the final outputs there if using something like Dataplex to query across lakes and warehouses. The exam’s “Storing the data” domain expects you to know when to choose GCS over other storage: if data is unstructured or you want a schema-less repository with cheap storage and can do schema-on-read, GCS is good. E.g., storing billions of log files that you analyze later via Spark or BigQuery external tables is a GCS use case. Conversely, if you need complex queries or low-latency key-value access, GCS by itself isn’t ideal (it’s not a database).

Transfers and Migrations: Getting data into GCS can be done via the Storage Transfer Service (for scheduled transfers from other clouds or web, also from on-prem via an agent), or for huge offline datasets, via Transfer Appliance (a shipped hardware device that you fill with data and send to Google to import). The exam might mention migrating a PB-scale on-prem Hadoop cluster – one approach is to export the HDFS data to a Transfer Appliance or use parallel gsutil copy to GCS, then process in GCP.

Cloud Bigtable – NoSQL Wide-Column Database

Cloud Bigtable is a fully-managed NoSQL database designed for massive scale and ultra-low latency. It’s a sparsely populated, wide-column store (inspired by Google’s internal Bigtable paper and HBase API) that can handle billions of rows, thousands of columns, and heavy read/write throughput with sub-10ms latency per operation in many cases. Bigtable is ideal for scenarios like time-series data, IoT sensor data, analytics that require fast lookups by key, or any workload where you need high throughput and you can design data access around a primary key.

Data Model: Bigtable stores data in tables with rows identified by a single row key (often a string or binary). Each row can have many columns grouped into column families. Each cell (row x column) can have multiple timestamped versions (Bigtable retains older versions as configured, allowing you to do historical lookups or rollups). Importantly, data is stored lexicographically by row key, and that ordering is used to distribute data across nodes. So row key design is crucial: a poor key (like a timestamp that monotonically increases) can cause hot-spotting on one node. Good designs often prepend a hash or use reverse timestamps to spread writes. The exam may test knowledge of designing Bigtable keys (e.g., “to avoid hot spots, incorporate a random prefix”). Also, Bigtable has only one index – the row key. No secondary indexes or ad-hoc querying on other columns (you’d have to design your own indexing by making those values part of the row key, or scanning which is expensive). This means you retrieve data by key or key ranges very fast, but you cannot do complex SQL queries directly. You can, however, integrate Bigtable with tools like BigQuery (via external table connector) or Dataflow for analytics.

Consistency and Availability: Bigtable is eventually consistent for cross-node operations. Within a single row (which is the atomic unit), reads and writes are atomic (row-level consistency). Across multiple rows, there is no multi-row transaction – Bigtable trades strong consistency for high availability and partition tolerance (in CAP terms, Bigtable is AP oriented for multi-row reads). If you need cross-row ACID, consider Spanner or Cloud SQL. Bigtable can be deployed as single-cluster (within one zone) or multi-cluster replicated (across zones or regions). Multi-cluster gives high availability (and even multi-region reads), but with eventual consistency between clusters. Typically, you’d use multi-zone within a region for 99.99% availability, or multi-region if global presence is needed. Bigtable’s SLA is 99.9% for single-cluster and 99.99% for multi-cluster (per cluster). It also provides automatic failover in multi-cluster configurations. The exam may pose a scenario like “need a database for IoT readings globally, must be highly available and low-latency reads in each region, not strongly consistent globally” – Bigtable fits (multi-region Bigtable with eventual consistency).

Scaling: Bigtable performance scales with the number of nodes in the cluster. Each node can handle a certain throughput and amount of data (in 2025, one node can handle ~10,000 QPS or more and stores about 2.5 TB by default before you add more for performance). You can add or remove nodes on the fly to scale, and Bigtable now also supports autoscaling for nodes based on CPU utilization or latency targets. This is important: earlier you had to manually scale, but now Bigtable can scale up/down within limits – great for cost optimization. Bigtable tables can grow to petabytes. It’s also a relatively expensive service per node, so you use it when sub-second millisecond response and large scale are needed. For moderate data volumes, Cloud SQL or Firestore might be cheaper.

Access Patterns: Optimal use of Bigtable is accessing data by row key or scanning contiguous row ranges. For example, a time-series might use row key like <deviceID>#<timestamp> so that you can retrieve a time range of data for one device by scanning that key range. Or for an ad-tech app, row key might be a user ID, storing user profile or click counts, for quick retrieval. Aggregations are not done inside Bigtable (no group by queries natively), but you can write a MapReduce (Dataproc or Dataflow) to read from Bigtable and aggregate externally. If exam question says “need to count events per user per day in real-time as events come” – one approach is to design Bigtable row keys as user#day, and increment counters (Bigtable supports atomic row mutations and counter increments). Bigtable can handle millions of ops/sec throughput, far beyond what traditional relational DBs handle.

Backups and Migration: Bigtable now offers snapshots and backups (you can take a point-in-time snapshot of a table and restore later). This is a relatively recent feature that helps with data protection (previously, to back up Bigtable you had to export to GCS). Bigtable also introduced change streams (in preview/GA by 2023) which allow Dataflow to capture all data changes in Bigtable and stream them out for analytics or replication use cases (similar to how one might stream database CDC changes). If you see mention of feeding Bigtable changes to BigQuery in real-time, know that Bigtable change streams + Dataflow could be a solution.

Comparisons: The exam expects you to differentiate Bigtable vs. other storage:

  • Bigtable vs. BigQuery: BigQuery is for analytic queries (scans, joins) with SQL on huge datasets, but has higher latency (seconds) and not for single-row reads. Bigtable is for fast key-value access and high throughput of small operations (and it’s NoSQL without joins). If data requires aggregation of many records, BigQuery or Dataflow might be used; if data is accessed by key (like looking up a customer’s recent IoT readings to display on a dashboard), Bigtable is great.

  • Bigtable vs. Spanner/SQL: Spanner provides SQL and transactions, but has lower throughput for heavy writes (each write is transactional and replicated). Bigtable is better for simple operations at extreme scale (e.g., time-series ingestions). But Bigtable doesn’t do multi-row transactions or complex queries. If scenario says “needs SQL and strong consistency for >10TB data” – that’s Spanner; if “needs to ingest millions of writes per second and query by key with eventual consistency” – Bigtable.

  • Bigtable vs. Firestore: Firestore is also a NoSQL, but intended for app development (document store with simpler usage, but not as high throughput). Bigtable is much more scalable and lower-level (HBase API, etc.), whereas Firestore has richer querying (limited query by fields, but still not SQL). Firestore auto-scales but with some limits like write rates per document key. Bigtable is for truly large-scale analytics or operational workloads where you control schema design.

Use Cases on exam: IoT data storage, user analytics with large scale (e.g., storing user click streams keyed by user, then using Dataflow to aggregate), time-series monitoring (like metrics storage), recommendation system data (large matrices that need quick read by ID). Also, Bigtable often underpins ML feature stores for large volume of features that are indexed by an ID for online prediction lookup (though GCP also has Vertex AI Feature Store for that use-case specifically). If they mention HBase migration or using open-source HBase, Bigtable is the managed alternative (and you can use the HBase API with it).

Cloud Spanner – Globally Distributed SQL Database

Cloud Spanner is Google Cloud’s fully-managed, globally-distributed relational database with strong consistency and high availability. It’s a unique offering that combines the relational semantics (schemas, SQL, ACID transactions) with horizontal scalability across data centers and even continents. Spanner achieves this via Google’s TrueTime API (which leverages atomic clocks for globally synchronized time). For the exam, Spanner is important for scenarios needing enterprise OLTP at scale – think of it as an alternative to traditional relational databases when you outgrow them, or when you need multi-region writes.

Capabilities: Spanner offers ANSI SQL (it has dialects: Google Standard SQL and a PostgreSQL interface as well). You can define schemas, primary keys, secondary indexes, etc. Spanner supports strong consistency for reads and writes (Reads are strongly consistent by default, or can be tuned to stale reads for performance). It supports distributed ACID transactions across rows and even across regions – this sets it apart from other NoSQL which often drop transactions. Spanner’s consistency and transactions mean you can use it for financial systems, inventory management, etc., where correctness is critical. Spanner automatically shards data across nodes (called split or partitions) based on primary key ranges. It scales by adding nodes which increases processing and storage capacity. Recently, Spanner introduced autoscaling with granular units (you no longer need to add a full node, you can add smaller increments of compute called “Processing Units” – up to 1000 PUs = 1 node). This makes it more flexible and accessible for smaller workloads too.

Global and Regional: Spanner can be configured as single-region, multi-region (synchronous replication across zones), or multi-continental (with read-write instances across continents). In multi-region configurations, it uses consensus replication (Paxos) – typically 3 or more replicas – to commit transactions. This yields a 5 9’s availability (99.999%) for multi-region Spanner. The tradeoff for global Spanner is write latency – commits have to be confirmed by a majority of replicas, so writes might be slower (e.g. a few tens of ms across regions). But it’s still often < 100 ms globally, which is incredible for what it does. For exam, if the scenario says “globally consistent database for user account transactions, must survive a regional outage without data loss” – Spanner is the answer (Spanner will keep working if a region goes down, as long as majority of replicas are up, due to synchronous replication).

Use Cases: Spanner is best when you need a relational schema and transactional consistency and need to scale beyond what a single-instance database can handle, or need multi-region. For example, a worldwide inventory database, a banking ledger system, or any service where downtime is not acceptable and data consistency is paramount. It’s used under the hood by Google for services like AdWords, etc. If exam scenario is an IoT time-series with simple queries – that might be better Bigtable. But if it’s “financial transactions with high volume and need for consistency and SQL queries on that data” – Spanner. Spanner can handle a high volume of writes (not as high as Bigtable’s unstructured throughput, but extremely high for a relational system). It also can do reads and complex queries (though BigQuery would outperform for heavy analytics). It’s not meant for petabyte-scale analytics (that’s BigQuery), but can easily handle terabyte-scale OLTP.

Schema and Indexing: In Spanner, you design tables with primary keys. A key design feature is interleaved tables: you can define a child table that is physically co-located with the parent table rows (like hierarchical data) – this optimizes join performance for one-to-many relationships by storing them together. Spanner supports secondary indexes for querying by columns other than primary key, but every secondary index also has to be maintained with strong consistency (so writes that affect an index are a bit heavier). The exam might test knowledge like “how to optimize Spanner schema for locality – answer: use interleaved tables for one-to-many relationships that are always accessed together.”

Transactions: Spanner uses two-phase commit and TrueTime for timestamps to ensure external consistency (transactions commit in a globally serializable order). You as a developer just use transactions (Spanner even supports read-write transactions and read-only snapshot transactions). For very high throughput, they suggest keeping transactions short and favoring single-row transactions when possible (since those can be handled by a single server).

Backup and Export: Spanner has backup capability (you can take consistent backups without downtime). It also has a change streams feature in preview to stream changes to Dataflow (for feeding analytics, similar to Bigtable’s change streams). Also, Dataflow has SpannerIO which can read from Spanner for batch or stream from change streams.

Costs: Spanner is a premium service – historically you had to provision at least 1 node (~$700/month). With granular PUs, you can provision less (100 PUs ~ $70/month), making it reachable for smaller projects. But it’s still more expensive than something like Cloud SQL for small loads. So cost is a consideration: exam might hint “small startup with modest SQL needs” – likely Cloud SQL is enough, not Spanner. Spanner is for massive or mission-critical SQL needs.

Comparison:

  • Spanner vs Cloud SQL: Cloud SQL runs MySQL/Postgres on a single server (with replication possible but not scaling writes or horizontal sharding). It has a size and QPS limit beyond which it can’t scale well. Spanner can scale out and handle more throughput and larger size (tens of TB or more) seamlessly. Cloud SQL has simpler usage (existing tools, engines), Spanner is a Google-proprietary tech (with PostgreSQL interface now to ease adoption).

  • Spanner vs Bigtable: Already discussed above – Spanner if you need SQL and multi-row transactions, Bigtable if you just need key-value at insane scale and can tolerate eventual consistency or do not need structured schema across rows.

  • Spanner vs BigQuery: BigQuery for analytics (OLAP), Spanner for transactions (OLTP) with high consistency demands. They are often complementary (you might periodically export Spanner data to BigQuery for analytics).

In summary, Cloud Spanner is your answer for globally distributed consistent database. If an exam question involves global user data that must remain consistent (for example, banking or account info replication worldwide) – Spanner is likely the expected solution.

Cloud Dataproc – Managed Hadoop/Spark for Data Processing

Cloud Dataproc is GCP’s managed service for running Apache Hadoop and Spark clusters (and related big data frameworks like Hive, Presto, Pig). It allows you to create clusters on demand with a specific number of VMs (master and workers) pre-configured with the Hadoop/Spark ecosystem. Dataproc is often used for lift-and-shift of existing on-prem Hadoop workloads or when you need the flexibility of the Spark ecosystem for data processing.

Key Features: Dataproc clusters can be ephemeral (created to run a job, then shut down) or long-running (like a persistent Hadoop cluster). They can scale up or down, and you can use cheaper preemptible VMs for workers to save cost on batch jobs. Dataproc has a fast startup time (around 90 seconds or less for a cluster, which is much quicker than unmanaged Hadoop setup). It integrates with Cloud Storage (you typically store data and scripts in GCS instead of HDFS, though Dataproc clusters do have HDFS for ephemeral storage if needed).

Spark on Dataproc: Spark is a general-purpose engine for batch and stream (Spark Streaming, though Dataflow is generally preferred for streaming in GCP). You might choose Spark if you have existing Spark jobs or need advanced libraries (Spark MLlib, GraphX, etc.) or prefer using the Spark SQL/DataFrame API. For example, some workflows might extract data from BigQuery using the Spark BigQuery connector, perform complex in-memory transformations in Spark, then load results back. Dataproc also supports Jupyter Notebooks, so data scientists can use PySpark in notebooks on Dataproc Hub.

Hadoop ecosystem: Dataproc allows use of Hive (for SQL on Hadoop), Pig (older scripting), Presto/Trino, Flink (though now Dataflow covers that too), and more. If a question scenario is about using open-source tools or migrating an existing Hadoop cluster, Dataproc is the answer. It is basically “Hadoop/Spark as a service”.

When to use Dataproc vs Dataflow: This is common. Dataflow is serverless and great for new pipelines especially streaming, but requires writing code in Beam model. Dataproc might be used if you have existing Spark or Hadoop jobs or need specific libraries not in Beam. It’s also useful for exploratory data analysis on big data using notebooks. If you need a custom environment (say custom Python libraries or GPU support in Spark), Dataproc can accommodate that. Dataproc can also run Hive/Spark jobs that process data and output to BigQuery or GCS – similar end result as Dataflow but using Spark SQL or HiveQL. The exam might expect: Dataflow for fully-managed simplicity and streaming or simpler batch, Dataproc if specifically referencing Hadoop/Spark or on-prem migration or needing custom big data frameworks.

Scaling: You can autoscale Dataproc clusters (set policies to add workers if CPU high, remove if low). Also Dataproc has autoscaling clusters for jobs (especially with ephemeral clusters). There’s Dataproc on Kubernetes and Serverless Dataproc (called Dataproc Serverless for Spark) where you submit a Spark job and Google runs it without you managing the cluster explicitly (this uses ephemeral compute in the background). This is analogous to Dataflow (serverless) but for Spark jobs. It’s a newer feature (GA in 2021/22). The exam might not delve deep into that, but it’s good to know it exists: if they mention “run Spark without cluster management overhead” – Dataproc Serverless is the solution.

Data Lake and Dataproc: In a GCP data lake, Dataproc might be used to run Spark SQL queries on files in GCS (which can be an alternative or complement to BigQuery). E.g., using Spark to ETL raw files into Parquet. Dataproc can also integrate with Dataplex for auto-discovery of data and manage multiple lakes.

Security: Dataproc integrates with Kerberos, etc., but many prefer using Cloud IAM and keeping things simple (for instance, storing data in GCS with IAM controls, rather than HDFS with Kerberos). If a cluster is processing sensitive data, you might enable Shielded VMs, use VPC Service Controls, and CMEK on data in GCS.

Exam tip: If you see a mention of needing Hadoop ecosystem tools (like Pig script, or using HDFS, or migrating from on-prem Cloudera, etc.), think Dataproc. If it’s a generic “we need to transform data at scale and store in BigQuery but no mention of Hadoop/Spark specifically”, Dataflow is likely. Google’s guidance often pushes Dataflow for new development because it’s easier to maintain (no clusters). But Dataproc is still very relevant, especially for custom or existing workloads.

Cost: Dataproc is billed per VM (plus a small management fee per cluster). You can use preemptibles to cut costs for non-critical tasks (like spot instances that may be reclaimed – good for transient big jobs). Also turning off clusters when not in use is key.

Other: Dataproc allows customization via initialization actions (scripts to run on startup to install extra packages). Also, one can run Dask, RAPIDS, Jupyter on Dataproc by customizing it.

In summary, Cloud Dataproc gives you the flexibility of open-source big data tools on GCP with less ops effort. It’s important for exam to know when it’s the best fit versus Cloud Data Fusion or Dataflow.

Cloud Composer (Airflow) – Workflow Orchestration

Cloud Composer is a fully-managed workflow orchestration service built on Apache Airflow. It allows you to author, schedule, and monitor complex workflows (DAGs – Directed Acyclic Graphs of tasks). In data engineering, Composer is used to automate data pipelines end-to-end, especially when there are multiple steps and dependencies across various services.

Airflow Basics: In Airflow (and thus Composer), you define DAGs in Python code. A DAG might have tasks like “run a Dataflow job, then run a BigQuery query, then send an email”. Airflow has many pre-built operators for GCP (e.g., BigQueryOperator, DataflowOperator, DataprocSubmitJobOperator), which make it easy to call those services from a workflow script. Cloud Composer runs Airflow for you – you don’t manage the Airflow server or database; Google handles that (on GKE under the hood). You get an Airflow UI to monitor runs, set schedules, and manage connections.

Use Cases: Use Composer when you need to coordinate multiple jobs or handle retries and dependencies that span services. For example, a nightly pipeline might: extract data from an API (perhaps via Cloud Function), load to Cloud Storage, trigger a Dataflow job to process it, then load results to BigQuery, and finally kick off a BigQuery ML training query. Composer can orchestrate all those steps with proper ordering and error handling. If any step fails, Airflow can retry it or alert. Without Composer, you might have to chain together Cloud Functions with Pub/Sub triggers or use Cloud Scheduler to time things, which gets hard to manage as complexity grows. Airflow is a standard tool for workflow management, so familiarity is expected.

Cloud Composer vs. Cloud Scheduler vs. Workflows: These are related:

  • Cloud Scheduler is a cron-like service to trigger jobs or HTTP endpoints on a schedule. If you just have a single task (like “trigger this Dataflow pipeline every day at 1 AM”), Scheduler might suffice. It’s very simple, not aware of multi-step dependencies.

  • Cloud Workflows is another GCP service to orchestrate serverless tasks using a YAML-defined workflow – it’s good for calling Cloud Functions/Cloud Run or HTTP APIs in sequence, with conditional logic, loops, etc.. Workflows is lightweight and can be used for simpler orchestrations or event-driven flows. However, it’s not specifically tailored to data pipelines (no native connectors to BigQuery or Dataflow, though you can call APIs).

  • Cloud Composer (Airflow) is heavier but more powerful for complex data pipelines, especially where many GCP services are involved. It shines when you have dependencies that are not just time-based but data-based (e.g., run task B after task A outputs a file).

The exam might explicitly mention Airflow or Composer if they expect that as the answer. E.g., “We need to manage monthly and daily jobs with dependencies, with an audit trail and failure alerts, which service to use?” – Cloud Composer. Or if a pipeline has conditional branching (like if data size > threshold do X else Y), Airflow can handle that logic. Airflow also integrates with Slack/email for notifications, and can keep track of historical runs.

High Availability: Composer v2 (current) runs Airflow 2.x on GKE with better architecture – it separates the Airflow scheduler and workers, and can scale them. It supports multi-zone for high availability of Airflow components (multiple schedulers in HA mode). This ensures workflows trigger reliably. If exam touches on “ensuring the orchestrator isn’t a single point of failure”, mention that Composer uses multiple schedulers for HA.

Monitoring and Logging: Composer tasks log to Cloud Logging by default, and can be monitored through Cloud Monitoring. You should remember that Airflow’s scheduler orchestrates but doesn’t do heavy data work itself – tasks are either executed in Airflow (like small Python ops) or trigger external services (Dataflow jobs, etc.). Offloading heavy lifting to managed services is ideal, and Airflow is just the conductor.

Security: Cloud Composer environments run in a VPC, you can configure private IP only, etc. Airflow connection settings (like database passwords) can be stored in Airflow’s connections securely (potentially referencing Secret Manager).

Workflows vs. Composer summary: Google Workflows is newer and great for event-driven or microservice orchestration (with low overhead), while Composer/Airflow is the tried-and-true for complex data workflows and schedules. The exam could present a scenario of chaining multiple Cloud Functions or HTTP calls – either Workflows or Composer could do it. If it’s purely serverless calls and not huge complexity, Workflows might be answer; if it’s clearly data pipeline and on a schedule with many steps, Composer is answer.

In all, use Cloud Composer for enterprise data pipeline orchestration where you need an Airflow DAG to manage tasks. It’s quite an important service to know for professional data engineers as real-world pipelines often need orchestration beyond a single Dataflow job.

Cloud Data Fusion – Managed ETL (GUI-Driven)

Cloud Data Fusion is a fully-managed, code-free data integration service. It provides a visual drag-and-drop interface to design ETL/ELT pipelines (built on the open-source CDAP platform). It’s essentially GCP’s GUI tool for building data pipelines without writing code – allowing you to connect to various data sources, transform data, and load into targets using a graphical pipeline canvas.

Features: Data Fusion comes with many connectors and transforms. For example, it can connect to on-prem databases (via direct connect or through Datastream for change data capture), SaaS APIs, Cloud Storage, Pub/Sub, etc. You can then visually add transformations like join, aggregate, filter, parse, etc., similar to tools like Informatica or Talend. Under the hood, Data Fusion pipelines run on Dataproc – when you execute a pipeline, it spins up a Dataproc Spark or MapReduce job to perform the tasks at scale. But the user doesn’t have to manage that; it’s abstracted away.

Use Cases: Data Fusion is great for organizations who have data engineers or integrators that prefer a UI to create pipelines rather than coding in Python/Java. It’s also very useful for data migration and integration use cases. For example, migrating an on-prem Oracle database to BigQuery – Data Fusion has plugins to read from Oracle and write to BigQuery, handling schema, etc.. The exam blueprint specifically calls it out (in domain 1.2 and 2.2), meaning you should know it exists as an option. If a question scenario is “a team with limited coding skills needs to build and schedule data pipelines from various sources to GCP data targets,” Data Fusion is a likely answer (or possibly Cloud Dataflow templates, but Data Fusion is explicitly for the no-code approach). Also, if the question mentions “quickly integrate a new data source with minimal custom code”, Data Fusion fits.

Comparison with Dataflow: Dataflow requires coding in Beam; Data Fusion is GUI. Dataflow can be more flexible for complex logic and custom operations, but Data Fusion covers many common transformations and can also include custom Python/Java code via plugins if absolutely needed. For a simple task like “join these two CSV datasets and apply some functions then load to BigQuery”, both Dataflow or Data Fusion could do it – if the exam angle is about ease and no-code, Data Fusion is correct.

Deployment: Data Fusion has the concept of instances (Basic or Enterprise). Enterprise instances can connect to on-prem data via a secure gateway and have advanced features like pipelines with DLP integrations or scheduling. Basic is cheaper but for simpler cloud use. Typically, you’ll create an instance and then design pipelines in its UI. At runtime, ephemeral Dataproc clusters are created for pipeline runs (Enterprise edition allows private IP Dataproc, etc.). The exam likely won’t go into that depth, just the high-level.

Scheduling: Data Fusion pipelines can be scheduled (it has an internal scheduler or can be triggered via REST/API as well). However, for complex multi-pipeline coordination, one might still use Composer to orchestrate multiple Data Fusion pipelines (there’s an operator for that). But Data Fusion itself can handle simple scheduling e.g., run this pipeline every hour.

Wrangler: Data Fusion includes Wrangler, a feature for interactively cleaning a sample of data and generating transformation recipes (like splitting columns, converting formats) that then can be applied at scale in the pipeline. This is helpful for rapid development of cleaning steps by data analysts.

Exam scenario fit: The exam might mention a need for an ETL tool that multiple analysts can use or where maintainability and speed of development is important. Also, if a question involves integrating a database where you might do change capture – Data Fusion can incorporate CDC via Datastream (it can use Datastream’s stream of changes as a source). Actually, Data Fusion has a replication feature (still evolving) for some databases for near real-time replication to BigQuery.

Remember, Data Fusion = GUI ETL on Dataproc. In contrast, Cloud Dataprep (which is another no-code tool) is for interactive data preparation on small scale or for prototyping – it’s now part of Data Fusion as the UI for wrangling. The exam blueprint mentioned Dataprep in context of preparing and cleaning data, meaning they expect knowledge that GCP offers Dataprep/Trifacta for visually cleaning data especially by analysts, and Data Fusion for building production pipelines using similar transformations without coding.

BigQuery Data Transfer Service & Data Migration Tools

In any data engineering project, getting data into the platform is critical. Google Cloud offers specialized services for migrating and transferring data from various sources:

  • BigQuery Data Transfer Service (DTS): This is a managed service that regularly loads data into BigQuery from external sources – including Google products (Google Ads, Google Analytics, YouTube, etc.), cloud storage (CSV/JSON from GCS), and SaaS apps (Salesforce, etc.). It can be scheduled (say, daily import) and handles schema updates and errors pretty well. The exam might mention “ingesting marketing analytics data from Google Ads/Analytics to BigQuery daily” – the answer: use BigQuery DTS, not writing a custom pipeline.

  • Storage Transfer Service: Used for large-scale transfers to Cloud Storage. It can pull data from AWS S3, Azure Blob, another GCS bucket, or even SFTP, and place into GCS on a schedule or one-time. It’s important for moving existing data lakes or initial bulk loads. For on-prem, it has an agent to move from local filesystem to GCS. If moving a ton of files, use this over writing your own scripts.

  • Transfer Appliance: A physical appliance solution for offline data transfer. You order an appliance, load it with tens or hundreds of terabytes from your data center, ship to Google, and they import to Cloud Storage. This is used when network transfer would be too slow (e.g., you have 500 TB and a slow link). If exam mentions huge volumes and limited bandwidth, Transfer Appliance is the answer.

  • Database Migration Service (DMS): This service helps migrate operational databases to Cloud SQL or AlloyDB. It supports MySQL, PostgreSQL, and SQL Server migration to Cloud SQL with minimal downtime (using replication). It also can do PostgreSQL to AlloyDB migrations, etc.. If a question says “migrate a MySQL DB to GCP with minimal downtime”, DMS is designed for that. It often uses continuous replication under the hood to catch up target with source. DMS doesn’t support everything (Oracle migration is not fully GA via DMS – for Oracle they often mention Datastream).

  • Datastream: This is a Change Data Capture (CDC) service for streaming data replication from relational databases (Oracle, MySQL, recently PostgreSQL) into Google Cloud targets (like BigQuery, Cloud SQL, Spanner). It reads the source DB’s transaction logs and streams changes. For example, replicating an on-prem Oracle to BigQuery in near real-time – use Datastream, which will write change events to Cloud Storage or Pub/Sub, and you typically use Dataflow to apply them to BigQuery (there is a template for that). The exam could have a scenario: “we need to continuously replicate data from an operational MySQL database into BigQuery for analytics” – answer: Datastream (CDC) into BigQuery. Or if migrating Oracle to Cloud SQL, one approach is using Datastream to minimize downtime, since DMS doesn’t natively support Oracle (as of now, Oracle -> Cloud SQL uses Datastream in preview). Datastream combined with Dataflow can also feed Spanner or others. Keep in mind Datastream is serverless CDC. It decouples extract (Datastream) from apply (Dataflow). This is a modern pipeline for data warehousing where you want nearly live data from a source database without heavy custom ETL.

  • Firestore/BQ Export/Import: Note BigQuery has native export (to GCS) and import, and Firestore has export to GCS etc. The exam might not go into these specifics, but for completeness: If you needed to migrate data out of BigQuery, you can extract to GCS and then transfer.

AlloyDB: While not exactly a migration tool, it’s a new destination. AlloyDB is Google’s advanced PostgreSQL-compatible database, offering high performance (with vectorized columnar engine for analytical queries, and improved scalability). It’s positioned as an alternative to Cloud SQL Postgres for more demanding workloads (particularly HTAP workloads – Hybrid Transaction/Analytical Processing – since AlloyDB can handle analytical queries faster). The exam might mention AlloyDB as an option if you need a PostgreSQL interface but with performance near Spanner for reads (but note, AlloyDB is not globally distributed like Spanner; it’s regional but can scale reads via read replicas and has a separate analytic engine). If they mention “need PostgreSQL but 4x performance of standard Postgres for heavy analytical joins and need more than one primary for writes” – AlloyDB fits (except multi-primary doesn’t exist, it’s primary-replica). They likely just expect recognition that AlloyDB is a new choice in relational DBs on GCP besides Cloud SQL and Spanner.

Summary: For exam:

  • Use BigQuery DTS for scheduled imports into BigQuery from known supported sources (no code).

  • Use Storage Transfer or Transfer Appliance for bulk data transfer to cloud.

  • Use DMS for one-time migration of MySQL/Postgres/SQLServer to Cloud SQL/AlloyDB (especially minimal downtime migrations).

  • Use Datastream for ongoing replication of database changes (or for Oracle migrations) to analytics or cloud databases.

  • Recognize AlloyDB as an option for high-performance PostgreSQL in cloud, perhaps mentioned along with Spanner as advanced DB solutions.

Cloud SQL – Managed Relational Databases (MySQL, PostgreSQL, SQL Server)

Cloud SQL is GCP’s managed database service for popular relational engines: it offers fully-managed MySQL, PostgreSQL, and SQL Server instances. It handles tasks like automated backups, patching, high-availability failover, and read replicas. Cloud SQL is often the go-to for migrating existing applications that use these databases or setting up new small-to-medium databases quickly.

Features: Cloud SQL provides a familiar environment – you have a database instance with certain vCPU/RAM, storage size (up to 30 TB per instance currently), and you can connect via standard drivers or sockets. For High Availability, Cloud SQL offers a configuration with a primary and a standby in another zone (for MySQL/Postgres). It uses synchronous replication to the standby and automatic failover if primary fails, giving ~99.95% availability SLA. There are also read replicas (asynchronous) for scaling reads (including cross-region replicas if needed for geographic distribution of reads). Cloud SQL supports point-in-time recovery (PITR): continuous WAL archiving for Postgres or binlog for MySQL, so you can restore to any time within a retention window (e.g., last 7 days). This is useful for recovering from accidental data deletion by restoring to just before it happened.

Use Cases: Use Cloud SQL when you need a traditional relational database for an application – typically OLTP workloads with moderate scale. For example, a web app backend storing user data, or a small analytics dataset that needs complex SQL joins but is only a few hundred GB. The exam might present a scenario like “we have a 1 TB MySQL on-prem, need to lift-and-shift to cloud with minimal changes” – Cloud SQL MySQL is clearly intended there, possibly with DMS to migrate it. Or if an application specifically requires PostgreSQL features (extensions, etc.), Cloud SQL Postgres is the path.

Limits vs. Spanner/AlloyDB: Cloud SQL is essentially one VM (or primary/standby pair) handling the DB. It can vertical scale up to 64 or 96 vCPUs depending on the edition, but it cannot shard automatically across multiple write nodes. So there is a performance/scale ceiling. For high scale (> tens of thousands of writes per second, or > 30 TB, or needing multi-region writes), Cloud SQL is not enough – that’s where Spanner or AlloyDB come in. But Cloud SQL covers a huge range of general use cases with less operational effort than running your own database on a VM.

Security: Cloud SQL data is encrypted at rest and in transit. It can use Cloud IAM database authentication for Postgres, or you can manage users as usual in the DB. It also now supports CMEK for storage if needed. Connectivity can be through public IP (with SSL) or private IP (over VPC). A common design is to use private IP Cloud SQL inside a VPC for better security.

Maintenance: Google auto-upgrades minor versions, and you can set maintenance windows. The exam may not go deep but know that managed means less ops but you have less control (for instance, certain flags you can’t change, etc. though Cloud SQL allows many config adjustments and extensions especially for Postgres).

AlloyDB vs Cloud SQL: If an exam scenario says “the team’s Postgres is suffering performance issues on analytics queries and they'd like to scale it, what do you do?” AlloyDB might be an answer – because AlloyDB is Postgres-compatible but with a columnar engine that accelerates analytic queries by 100x in some cases, and can scale out read nodes. So AlloyDB is a good choice if Cloud SQL Postgres single-node can’t handle the workload. However, AlloyDB is newer and not as likely to be deeply covered yet. The key is recognition that GCP has a spectrum: Cloud SQL (simple, single-node), AlloyDB (more performance for Postgres, multi-node for reads, still regional), Spanner (different engine, global, horizontal scale, no MySQL compatibility though).

Microsoft SQL Server: Cloud SQL also supports SQL Server, making migrations from SQL Server easier. It has similar features for backups, HA (with AlwaysOn availability groups behind scenes). It might appear in an exam question if they mention MSSQL specifically – just know Cloud SQL provides that.

Scaling: Cloud SQL can vertically scale (change machine type) with some downtime. It cannot horizontally shard, but you can distribute reads to replicas. For writes, one primary only. Throughput can reach maybe up to ~50k IOPS on high-end instances with good storage, but beyond that, Spanner or other distributed DB is needed.

In exam context, Cloud SQL is the default choice for typical relational database needs up to medium scale. If no special need for Spanner and the data size/transactions are moderate, Cloud SQL should be your answer for deploying a relational DB in GCP (especially if existing app uses MySQL/Postgres). And use Database Migration Service to get data into Cloud SQL if coming from similar engines with minimal downtime.

Firestore – NoSQL Document Database (Serverless)

Cloud Firestore is a fully-managed NoSQL document database optimized for web, mobile, and IoT applications that need to store JSON-like data, sync data in real-time, and scale automatically. Firestore is part of Google’s Firebase platform and Google Cloud. It offers semi-structured data storage with documents (key-value pairs, possibly nested) organized into collections. It’s schema-less (you can have different fields in different docs), but typically similar docs share structure for your app’s logic.

Modes: Firestore has Native mode and Datastore mode. Native mode is the recommended one – it supports real-time updates, mobile offline sync, etc. Datastore mode is for compatibility with older Google Datastore (used with App Engine), which has some different constraints (like eventual consistency for global queries in legacy Datastore). In the exam, unless they mention Datastore specifically, think in terms of Firestore native.

Consistency and Transactions: Firestore in native mode is strongly consistent for reads and writes in a single document or across a collection (if in the same database instance). It even supports multi-document ACID transactions (with some limits like 500 document max in a transaction) – which is unique among many NoSQL databases. This strong consistency is a big improvement over the old Datastore which was eventually consistent for global queries. Firestore’s multi-region instances (like nam5 (us multi-region) or eur3 (eu multi-region)) ensure high availability, and still provide strong consistency for documents and queries (basically, reads default to strongly consistent in Cloud Firestore). You can also do offline data access on mobile and sync later.

Querying: Firestore allows queries on fields (with automatic indexing on every field by default). You can filter and sort on fields, but queries are somewhat limited – e.g., you can’t do complex JOINs or aggregations server-side beyond simple count/average (recently sum and avg were introduced in queries). You often denormalize data for Firestore (like store redundant data to avoid joins). It’s great for retrieving documents by key or by attribute range quickly. It’s not meant to scan huge collections for analytics (that’s BigQuery’s job, though you can export Firestore data to BigQuery for analysis).

Scaling and Performance: Firestore auto-scales to very high traffic. It can handle thousands or tens of thousands of writes per second, though there are some per-document throughput limits (you shouldn’t hotspot a single document with >1k writes/sec, for example). It’s managed behind the scenes so you don’t allocate instances; you just pay for storage and operations (reads/writes).

Use Cases: The exam might present scenarios like “mobile app needs to store user preferences and sync in real-time across devices” or “a gaming app needs a flexible database to store game state with real-time updates”. Firestore would be ideal. It’s often used for user-facing applications that require low latency and offline support. For IoT, Firestore can ingest moderate streams of data but for very high frequency telemetry, Bigtable might be better. Firestore is simpler to use and integrate (especially if a client-side needs direct DB access, as in Firebase).

Firestore vs. Bigtable vs. BigQuery:

  • Firestore is good for structured data of an app (like user profiles, orders, chat messages), not huge analytics or heavy aggregations.

  • Bigtable is better for time-series and analytics where you manage keys and need extreme scale (and don’t require rich querying or real-time client SDKs).

  • BigQuery is analytical, not for serving single record lookups in an app due to latency.
    Firestore vs. Cloud SQL: Firestore is NoSQL and flexible; Cloud SQL is relational. If data fits a relational schema and you need SQL queries or transactions across many rows, Cloud SQL might be better. If schema is flexible or unknown upfront, or you want simpler scaling, Firestore is great.

Firestore introduced TTL (time-to-live) policies – you can mark a field as an “expiry timestamp” and Firestore will auto-delete documents when they expire. This is useful for automatic data retention (for PII or simply to control DB size). Another big feature: Point-in-Time Recovery (PITR), which keeps old versions of documents for up to 7 days and allows you to restore or read the state of the database at an earlier time. This is crucial for disaster recovery beyond the daily backups – e.g., if someone accidentally deleted a bunch of data, you could do a PITR to yesterday. Firestore PITR became GA in late 2023. Also, count(), sum(), avg() queries are now available (so you can do aggregations without needing an external system).

Firestore in Datastore mode: If exam mentions legacy App Engine app using Datastore, know that underlying it’s Firestore in Datastore mode. The difference is Datastore mode has different limits (no real-time updates, eventually consistent queries globally, higher write throughput per entity group perhaps, but no direct mobile SDK).

Security and integration: Firestore integrates with Firebase Authentication for client access; on server side, you use IAM. It can be used with Cloud Functions (trigger a function on data changes in Firestore, often used in serverless backends). It also can stream updates to BigQuery for analytics (via Firebase Extensions or custom Dataflow).

In summary, Cloud Firestore is the choice when you need a NoSQL DB for application data with simple queries and require scalability and real-time updates. If you see scenario about a shopping cart service, user settings, etc., Firestore likely fits. If they mention needing relational features or huge analytics, Firestore is not the answer.

Data Governance and Metadata: Data Catalog & Dataplex

As data volume and usage grows, managing and governing that data is crucial. Google Cloud provides Data Catalog and Dataplex to help with data governance, cataloging, quality, and a “data mesh” approach.

Data Catalog: Google Cloud Data Catalog is a fully-managed, centralized metadata catalog service. It automatically indexes metadata from BigQuery (datasets, tables, schemas), Pub/Sub topics, and Cloud Storage (if configured with filesets or via Dataplex). It allows you to tag data assets with custom tags (which can have attributes like “PII: true/false” or “Owner: Finance Dept”). These tags can be used for search/discovery and for access control (Policy Tags a.k.a. tag-based access control in BigQuery can restrict access to columns labeled sensitive). Data Catalog provides a search interface – e.g., an analyst can search “customer transactions” and find relevant datasets and tables, with descriptions, schema, and tags. Essentially, it’s like a data dictionary and business glossary in one. The exam expects that you know using Data Catalog is best practice for data discovery and metadata management – so people don’t create duplicate data due to not knowing something exists, and to comply with regulations by tagging PII.

Dataplex: Cloud Dataplex is an intelligent data fabric or basically a data lake management and governance solution. It can organize data into lakes and zones – e.g., raw zone, curated zone – even if data is across Cloud Storage and BigQuery. It provides a unified view and catalog (it actually uses Data Catalog under the hood for metadata). Dataplex focuses on enabling a data mesh architecture: where different domains (dept/teams) manage their own data but under central governance. With Dataplex, you can set up data lakes for each domain, and apply central policies and have visibility across them. It also has features like data quality checks (AutoDQ) and data profiling built-in. For example, you can define rules like “no nulls >5% in column X” and Dataplex can scan and report quality issues. It also offers serverless Spark for ad-hoc data exploration on the lake without managing Dataproc clusters.

For the exam, note these likely contexts:

  • If a question is about data discovery: answer is Data Catalog (so analysts can find data with search, and see metadata like descriptions and tags).

  • If about governing a data lake or implementing data mesh: answer is Dataplex. For instance, “How to manage data across silos with central governance” – Dataplex organizes data and enforces policies across distributed storage.

  • If about managing data quality at scale: mention Dataplex’s data quality feature or using Cloud DLP for sensitive data detection plus tagging.

  • For metadata-driven access control: Data Catalog’s tags with policy tags on BigQuery can restrict access to sensitive fields (like only compliance team can see columns tagged “PII”).

Data Catalog is now effectively integrated into Dataplex (called Dataplex Data Catalog) – meaning if you use Dataplex, it uses Data Catalog for metadata, but you might see references to them together. But one can use Data Catalog alone, e.g., if mainly working with BigQuery, you might tag datasets and not use full Dataplex.

Analytics Hub (already covered above) also ties in governance – making sharing controlled (you can see who subscribed, revoke access centrally, etc.). So governance is also about safe data sharing – which Analytics Hub provides for BigQuery.

Security Policies: Governance includes things like Org Policies (for example, restricting data to certain regions, requiring CMEK, disallowing public datasets) – for exam, know Org Policies exist and can enforce these platform-wide restrictions.

In summary, Data Catalog and Dataplex ensure that as a data engineer, you not only build pipelines but also make data findable, understandable, and trustworthy. The exam will appreciate answers that mention tagging data for governance, using Catalog for discovery, and Dataplex for central lake management especially in modern data mesh contexts.

Business Intelligence and Visualization: Looker Studio, Looker, and More

After data is stored and processed, the last mile is making it available for analysis and visualization. GCP’s ecosystem provides integration with BI tools:

  • Looker Studio (formerly Google Data Studio): This is a free, web-based BI/dashboarding tool. It allows you to create interactive reports and dashboards with charts and filters. It connects to BigQuery (native connector), Cloud SQL, spreadsheets, etc. Looker Studio is easy for analysts to use to share insights. It’s not as enterprise-feature-rich as paid BI tools, but it’s commonly used. For exam, if they mention creating dashboards for business users on top of BigQuery, Looker Studio is a valid solution (especially since it’s Google’s own and doesn’t cost per-user). It can also use BI Engine to cache data for faster performance with BigQuery. When using BigQuery with Looker Studio, one should consider enabling BI Engine (in-memory cache) and optimizing queries (maybe using aggregated tables) for speed. It’s ad-hoc and meant for interactive filtering. If a question is about a simple way to let analysts create reports on BigQuery data, answer: Connect Looker Studio to BigQuery (just ensure they use viewer's credentials or a service account appropriately, and maybe enable BI Engine to make it smooth). Also, Looker Studio can “extract” data (cache a subset) if needed for performance or sharing with external viewers who might not have BigQuery access.

  • Looker (Google Cloud’s enterprise BI platform, acquired in 2020): Not to be confused with Looker Studio, Looker is a more powerful tool where developers define a data model (LookML) and business users explore data via that model. Looker connects to databases (especially BigQuery) and is great for governed metrics definitions and complex dashboards. The exam blueprint doesn’t explicitly mention Looker, but it’s good to know what it is in case. If scenario says “need a governed, enterprise BI solution with embedded analytics” – Looker might be the intended answer. If the question is about a large organization standardizing metrics with a central semantic layer, mention Looker (LookML models provide a single source of truth for metrics, and business users can slice-and-dice without writing SQL, via Looker’s Explore interface).

  • Connected Sheets: This is a feature that connects Google Sheets to BigQuery so that you can analyze large datasets with pivot tables and formulas in a spreadsheet interface. It basically lets Sheets users pull millions of rows from BigQuery without writing SQL (they use the UI). It’s great for non-SQL-savvy folks. For exam, in a scenario “business users are comfortable with spreadsheets and need to do ad-hoc analysis on BigQuery data” – Connected Sheets is a solution to mention.

  • Third-party BI: GCP works well with Tableau, Power BI, etc., often through connectors or ODBC/JDBC. The exam may not focus on those except maybe acknowledging you can use those tools with BigQuery. But it might test that you should use BigQuery BI Engine to accelerate BI tools (like Tableau) by caching data in memory. BI Engine supports Looker Studio, and recently they also opened it to other BI tools via an API.

Best Practices: For visualizing data on BigQuery:

  • Use authorized views or custom data mart tables to ensure BI users only see appropriate data.

  • Leverage BI Engine for sub-second query response on dashboards.

  • Ensure queries are optimized (pre-aggregate large raw data to daily summaries for the dashboard rather than scanning raw detail on every view).

  • If multiple BI tools or users repeatedly run similar queries, consider materialized views or result caching.

Analytics Hub: For sharing datasets with external organizations for analytics, Analytics Hub allows them to access via their BI tools without you emailing files, etc. The exam might tie that in in a question about sharing data externally securely.

Scheduling and Alerts: Use something like Cloud Scheduler to schedule report refresh if needed (though most BI tools fetch live or cached data), and possibly use Data Studio’s built-in email schedule for PDF reports.

Ad-hoc Analysis: Google also has Cloud Datalab (deprecated) or just using notebooks/Colab for data exploration. But likely not in exam.

In essence, know that data engineers might be responsible for enabling BI: preparing data (like creating aggregated tables, enabling BI Engine, setting up Data Catalog with helpful descriptions so BI folks understand data), and setting up appropriate tools (Connected Sheets for Excel-minded users, Looker Studio for interactive dashboards, Looker for governed enterprise reporting). The exam’s section 4.1/4.2 covers these areas, so be prepared to suggest the right tool for the right user need.

0
Subscribe to my newsletter

Read articles from Radu Pana directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Radu Pana
Radu Pana