From Basics to Brilliance: My Week 1 Journey with PostgreSQL for Data Engineering

Ankit RajAnkit Raj
7 min read

Introduction:

Background

As a data engineer working full-time in the industry, I’m always looking for ways to enhance my skills and stay ahead of the curve with the latest technologies. Recently, I decided to focus more on data engineering and dive deeper into PostgreSQL. My goal is to expand my knowledge, improve my ability to design and optimize data pipelines, and gain hands-on experience managing databases more efficiently. I’m excited to integrate SQL with other data engineering tools and tackle real-world data challenges to become a more effective data professional.

Learning Objectives:

  • Develop a strong understanding of SQL and PostgreSQL.

  • Learn how to efficiently query and manage data using basic and advanced SQL techniques.

  • Understand the integration of PostgreSQL with other data engineering tools like Python and PySpark in future learning stages.

Week 1 Overview

  1. Topics Covered*:*
  • Introduction to SQL and Databases:

    • What is a database?

    • Different SQL dialects

    • Installing PostgreSQL and pgAdmin on Windows and MacOS

    • Setting up and managing databases

  • Basic SQL Operations:

    • SELECT: Fetching data from a table

    • ORDER BY: Sorting query results

    • SELECT DISTINCT: Removing duplicate entries

    • LIMIT: Restricting the number of rows returned

  • Filtering Data:

    • WHERE: Filtering rows based on conditions

    • WHERE operators: Using operators in WHERE clauses

    • AND/OR: Combining multiple conditions

    • BETWEEN: Filtering values within a range

    • IN: Checking if a value matches any in a list

    • LIKE: Pattern matching in text data

  • Aggregate Functions:

    • COUNT(): Counting rows

    • SUM(): Summing numeric values

    • GROUP BY: Grouping rows by column values

    • HAVING: Filtering groups based on conditions

  • String and Date Functions:

    • LENGTH, LOWER, UPPER: String manipulation

    • LEFT & RIGHT: Extracting portions of a string

    • Concatenate: Joining strings

    • POSITION: Finding a substring's position

    • SUBSTRING: Extracting parts of a string

    • EXTRACT: Extracting date or time parts

    • TO_CHAR: Converting date/time to string format

    • Intervals & Timestamps: Working with time-based data

  • Advanced Functions:

    • Mathematical functions and operators: Performing calculations in SQL

    • CASE WHEN: Conditional logic in SQL queries

    • COALESCE: Handling NULL values

    • CAST: Converting between data types

    • REPLACE: Replacing parts of a string

  1. Resources used

    • 15 Days of SQL: The Complete SQL Masterclass 2024

    • Data Engineering Essentials using SQL, Python, and PySpark

  2. Key concepts

      • Understanding database setup, structure, and SQL operations.

        • Mastery of SELECT, ORDER BY, DISTINCT, and LIMIT for data retrieval.

        • Proficiency in filtering data using WHERE, logical operators, and pattern matching with LIKE.

        • Working with aggregate functions like COUNT(), SUM(), and GROUP BY to summarize data.

        • String manipulation and date/time functions to refine and manipulate data.

        • Advanced SQL techniques like CASE WHEN and handling NULL values with COALESCE.

3. Challenges Faced

  • Date Operations:

    • Issue: Understanding and working with date and time functions in SQL, specifically with complex date formats, intervals, and timestamp operations.

    • Solution: I spent time experimenting with the EXTRACT, TO_CHAR, and BETWEEN functions to better handle date operations and format data as needed for my queries. Practice with INTERVAL also helped in understanding how to perform date arithmetic.

  • HAVING Clause:

    • Issue: Initially, it was confusing to differentiate between the WHERE and HAVING clauses. While both are used to filter data, WHERE filters rows before aggregation, and HAVING filters after the aggregation process.

    • Solution: I practiced using the HAVING clause alongside GROUP BY in multiple queries, which made it clearer how to filter grouped data based on aggregated values, not just raw data.

4. Key Learnings and Takeaways

  1. Understanding Databases:

    • Databases are essential for storing and managing data in structured formats. In PostgreSQL, tables, schemas, and databases are crucial components that help organize and retrieve information efficiently.
  2. Mastery of Basic SQL Operations:

    • The SELECT statement is the foundation of querying databases, and I now have a solid grasp of retrieving and filtering data using conditions such as WHERE, ORDER BY, and LIMIT.

    • I’ve learned the importance of SELECT DISTINCT for eliminating duplicate records and how to control the output with LIMIT to prevent overload when querying large datasets.

  3. Filtering and Searching Data:

    • Using operators like AND, OR, BETWEEN, IN, and LIKE has enabled me to build more precise queries and retrieve relevant data based on specific conditions, allowing me to optimize data retrieval for reporting or analysis tasks.
  4. Using Aggregate Functions:

    • I now understand how to summarize and analyze large datasets by using aggregate functions like COUNT(), SUM(), and GROUP BY. Additionally, HAVING is a powerful tool for filtering grouped data based on aggregate results, which is critical for reporting and analytics.
  5. String Manipulation and Date Functions:

    • String functions such as LENGTH, LOWER, UPPER, SUBSTRING, and CONCATENATE help in cleaning, formatting, and extracting specific portions of text data, which is commonly needed when preparing data for analysis.

    • Date and time functions like EXTRACT, TO_CHAR, and INTERVAL are essential for handling time-sensitive data and performing date arithmetic in reports and analyses.

  6. Advanced SQL Techniques:

    • The CASE WHEN statement has proven to be very useful for implementing conditional logic directly in SQL queries. This enables creating dynamic reports based on multiple conditions without needing external application logic.

    • COALESCE and CAST functions are critical for dealing with missing data and ensuring that data types align across queries, reducing errors and enhancing query performance.

  7. Database Setup and Environment Configuration:

    • I successfully set up PostgreSQL and pgAdmin, gaining a deeper understanding of how to install and configure a database environment on both Windows and macOS. This hands-on experience also included troubleshooting common issues like setting binary paths and handling performance-related problems.

Next Week Plan: Joins, Unions, and Subqueries

Building on the foundational skills from Week 1, the focus for Week 2 will be on mastering more advanced SQL concepts: Joins, Unions, and Subqueries. These topics are critical for working with relational databases and will significantly enhance my ability to query, combine, and analyze data from multiple tables effectively.

Key Topics to Cover

  1. Joins

    • Understanding the purpose of joins in SQL and how they allow us to combine data from multiple tables.

    • Types of Joins to be covered:

      • INNER JOIN: Retrieving rows with matching values in both tables.

      • LEFT JOIN (LEFT OUTER JOIN): Including all rows from the left table, even if there are no matches in the right table.

      • RIGHT JOIN (RIGHT OUTER JOIN): Including all rows from the right table, even if there are no matches in the left table.

      • FULL JOIN (FULL OUTER JOIN): Combining results from both left and right tables, including unmatched rows.

      • SELF JOIN: Joining a table to itself for hierarchical or complex data relationships.

      • CROSS JOIN: Creating a Cartesian product of two tables.

    • Practical scenarios and challenges to solidify understanding of joins.

  2. Unions

    • Exploring the UNION and UNION ALL operators to combine results from multiple SELECT statements into a single dataset.

    • Key differences between UNION (removes duplicates) and UNION ALL (includes duplicates).

    • Best practices for using UNION effectively, such as ensuring the same number of columns and compatible data types in combined queries.

  3. Subqueries

    • Introduction to subqueries and their role in simplifying complex queries.

    • Types of subqueries:

      • Single-row subqueries: Returning a single value for comparisons.

      • Multi-row subqueries: Returning multiple values, often used with IN or ANY.

      • Correlated subqueries: Dependent on the outer query and evaluated for each row.

    • Use cases of subqueries in SELECT, WHERE, and FROM clauses to create dynamic and efficient queries.

Conclusion

The first week of my PostgreSQL journey laid a solid foundation in SQL and database management. By completing practical exercises and overcoming challenges, I gained confidence in retrieving, aggregating, and manipulating data using SQL.

I developed a clear understanding of filtering data, working with string and date functions, and applying advanced concepts like CASE WHEN for complex query logic. Additionally, setting up PostgreSQL and tackling challenges such as date handling and the HAVING clause significantly strengthened my skills.

This week’s progress has prepared me to tackle more advanced topics in the upcoming weeks, and I’m eager to apply these skills to real-world data engineering tasks.

You can check out the scripts and exercises I’ve worked on this week in my GitHub repository: GitHub: https://github.com/ankit1519/postgreSQL

0
Subscribe to my newsletter

Read articles from Ankit Raj directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ankit Raj
Ankit Raj

Data Engineer