Why Pandas is Essential to Efficient Data Engineering

Jason MalefakisJason Malefakis
4 min read

Python is the quintessential data analytics language with its simplicity, versatility, and huge library ecosystem. Though Python on its own seriously tackles many data tasks, the Pandas library (a portmanteau of panel data - multi-dimensional data sets - and data analytics) not only simplifies the simplicity of Python but supercharges its power. Pandas saves time and enables potent functions in almost as few characters as regular expressions.

Why is Pandas the clear winner?

  1. Data Structure Utility: Pandas’s two main data structures are data frames and series. Not only do they make it easy to visualize what the data is doing, but they enable complex data manipulations. Data movement, organization, transformation, and analysis can be accomplished in rows and columns, just like a standard relational database table or Excel spreadsheet. At least to me, this is far more intuitive than trying to visualize nested lists or dictionaries.

  2. Data Handling and Cleaning: Pandas has innate functions to handle missing values, data type conversions, and normalizations. Large and messy data sets can be quickly turned into standardized usable formats. This allows users to focus on generating valuable insights instead of wasting time on scrubbing the data.

  3. Straight-forward Grouping and Aggregation: Grouping and aggregating data is very important for data analysis and reporting. Pandas has many functions along with Seaborn and Numpy to transform and summarize data in minimal steps (much more code would be required in pure Python to achieve the same results).

  4. Database and File Integrations: Pandas very easily connects to SQL databases, Excel files, CSV files, text files and more to open, read, and start analyzing complex data swiftly. Pandas can handle data coming from multiple places easily and can send data quickly once transformed. It can also easily integrate with visualization tools like Tableau and PowerBI or libraries like Plotly or Matplotlib (or D3 with a React front-end).

  5. Faster with Comprehensive Time Series Support: Built on top of Numpy with efficient compilation, lazy evaluation (only when explicitly triggered), and parallel processing, Pandas is computationally faster than most other libraries. With many date/time options that enable time series analysis, Pandas makes charting, trending, and forecasting easier.

Sample Code and Applications

  1. Counting Active Users Over Time:

    This is a common application using SQLAlchemy and Pandas to determine the number of active users who have logged in over the past 30 days. df = pd.read_sql() easily takes a SQL query and loads it into a data frame to be manipulated.

     from datetime import datetime, timedelta
     from sqlalchemy import func
     import pandas as pd
     from config import db
     from models import User
    
     def get_active_users():
         thirty_days_ago = datetime.now() - timedelta(days=30)
         active_user_count = db.session.query(func.count(User.id)).filter(User.last_login >= thirty_days_ago).scalar()
         return {"active_users": active_user_count}
    
     def get_active_users_detailed():
         query = db.session.query(User.id, User.name, User.last_login).filter(User.last_login >= thirty_days_ago)
         df = pd.read_sql(str(query.statement), db.engine)
    
         active_users_count = df['id'].nunique()
         active_users_data = {
             "total_active_users": active_users_count,
             "user_details": df.to_dict(orient='records')
         }
         return active_users_data
    
  2. Analyzing Business Growth Over Time:

    Looking at cumulative metrics over time is straightforward given Pandas’ date tools and aggregation functions. Here’s a way to track business growth month-by-month (expanding() to accommodate new data over an expanding window), manipulating columns and setting indices grouping by month (resample(‘M’)).

     def get_business_growth():
         query = db.session.query(
             Business.id,
             Business.created_at
         ).filter(Business.created_at != None)
    
         df = pd.read_sql(str(query.statement), db.engine)
         df['created_at'] = pd.to_datetime(df['created_at'])
         df['cumulative_businesses'] = df['created_at'].sort_values().expanding().count()
         growth_data = df.set_index('created_at').resample('M')['cumulative_businesses'].max().reset_index()
    
         growth_data['created_at'] = growth_data['created_at'].dt.strftime('%Y-%m-%d')
         business_growth = {
             "business_growth": growth_data.to_dict(orient='records')
         }
         return business_growth
    
  3. Aggregating Listing Prices:

    Pricing analysis is another common data engineering topic that Pandas handles with easy (inplace will modify the existing data frame). Here is a way to organize, sort, and convert dates to identify pricing trends over time.

     def get_listing_price():
         query = db.session.query(Listing.start_time, Listing.price).filter(Listing.start_time != None, Listing.price != None)
         df = pd.read_sql(str(query.statement), db.engine)
    
         df['start_time'] = pd.to_datetime(df['start_time'])
         df.sort_values(by='start_time', inplace=True)
         df['start_time'] = df['start_time'].dt.strftime('%Y-%m-%d')
    
         listing_price = {
             "listing_price": df.to_dict(orient='records')
         }
         return listing_price
    
  4. Calculating Average Review Ratings:

    Pandas makes it easy to group, aggregate, and run calculations. fillna() can easily clean data as discussed earlier.

def get_location_review():
    query = db.session.query(
        Review.listing_id,
        Review.overall
    )

    df = pd.read_sql(str(query.statement), db.engine)

    review_stats = df.groupby('listing_id').agg(
        average_overall=('overall', 'mean'),
        total_reviews=('overall', 'size'),
        median_rating=('overall', 'median'),
        rating_std_dev=('overall', 'std')
    ).reset_index()

    review_stats['rating_std_dev'].fillna(0, inplace=True)

    review_stats = review_stats.sort_values(by='average_overall', ascending=False)

    location_review = {
        "location_review": review_stats.to_dict(orient='records')
    }

    return location_review

In Summary

For fast and easy data engineering, pandas is a critically powerful tool that enables a vast array of use-cases with minimal amounts of code. It enhances Python’s immense capabilities and empowers users to build agile data analysis visualizations, tools, and applications critical to the success of organizations and individuals. The development speed that Pandas facilitates can provide the necessary competitive edge.

0
Subscribe to my newsletter

Read articles from Jason Malefakis directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jason Malefakis
Jason Malefakis