Why Pandas is Essential to Efficient Data Engineering
Python is the quintessential data analytics language with its simplicity, versatility, and huge library ecosystem. Though Python on its own seriously tackles many data tasks, the Pandas library (a portmanteau of panel data - multi-dimensional data sets - and data analytics) not only simplifies the simplicity of Python but supercharges its power. Pandas saves time and enables potent functions in almost as few characters as regular expressions.
Why is Pandas the clear winner?
Data Structure Utility: Pandas’s two main data structures are data frames and series. Not only do they make it easy to visualize what the data is doing, but they enable complex data manipulations. Data movement, organization, transformation, and analysis can be accomplished in rows and columns, just like a standard relational database table or Excel spreadsheet. At least to me, this is far more intuitive than trying to visualize nested lists or dictionaries.
Data Handling and Cleaning: Pandas has innate functions to handle missing values, data type conversions, and normalizations. Large and messy data sets can be quickly turned into standardized usable formats. This allows users to focus on generating valuable insights instead of wasting time on scrubbing the data.
Straight-forward Grouping and Aggregation: Grouping and aggregating data is very important for data analysis and reporting. Pandas has many functions along with Seaborn and Numpy to transform and summarize data in minimal steps (much more code would be required in pure Python to achieve the same results).
Database and File Integrations: Pandas very easily connects to SQL databases, Excel files, CSV files, text files and more to open, read, and start analyzing complex data swiftly. Pandas can handle data coming from multiple places easily and can send data quickly once transformed. It can also easily integrate with visualization tools like Tableau and PowerBI or libraries like Plotly or Matplotlib (or D3 with a React front-end).
Faster with Comprehensive Time Series Support: Built on top of Numpy with efficient compilation, lazy evaluation (only when explicitly triggered), and parallel processing, Pandas is computationally faster than most other libraries. With many date/time options that enable time series analysis, Pandas makes charting, trending, and forecasting easier.
Sample Code and Applications
Counting Active Users Over Time:
This is a common application using SQLAlchemy and Pandas to determine the number of active users who have logged in over the past 30 days. df = pd.read_sql() easily takes a SQL query and loads it into a data frame to be manipulated.
from datetime import datetime, timedelta from sqlalchemy import func import pandas as pd from config import db from models import User def get_active_users(): thirty_days_ago = datetime.now() - timedelta(days=30) active_user_count = db.session.query(func.count(User.id)).filter(User.last_login >= thirty_days_ago).scalar() return {"active_users": active_user_count} def get_active_users_detailed(): query = db.session.query(User.id, User.name, User.last_login).filter(User.last_login >= thirty_days_ago) df = pd.read_sql(str(query.statement), db.engine) active_users_count = df['id'].nunique() active_users_data = { "total_active_users": active_users_count, "user_details": df.to_dict(orient='records') } return active_users_data
Analyzing Business Growth Over Time:
Looking at cumulative metrics over time is straightforward given Pandas’ date tools and aggregation functions. Here’s a way to track business growth month-by-month (expanding() to accommodate new data over an expanding window), manipulating columns and setting indices grouping by month (resample(‘M’)).
def get_business_growth(): query = db.session.query( Business.id, Business.created_at ).filter(Business.created_at != None) df = pd.read_sql(str(query.statement), db.engine) df['created_at'] = pd.to_datetime(df['created_at']) df['cumulative_businesses'] = df['created_at'].sort_values().expanding().count() growth_data = df.set_index('created_at').resample('M')['cumulative_businesses'].max().reset_index() growth_data['created_at'] = growth_data['created_at'].dt.strftime('%Y-%m-%d') business_growth = { "business_growth": growth_data.to_dict(orient='records') } return business_growth
Aggregating Listing Prices:
Pricing analysis is another common data engineering topic that Pandas handles with easy (inplace will modify the existing data frame). Here is a way to organize, sort, and convert dates to identify pricing trends over time.
def get_listing_price(): query = db.session.query(Listing.start_time, Listing.price).filter(Listing.start_time != None, Listing.price != None) df = pd.read_sql(str(query.statement), db.engine) df['start_time'] = pd.to_datetime(df['start_time']) df.sort_values(by='start_time', inplace=True) df['start_time'] = df['start_time'].dt.strftime('%Y-%m-%d') listing_price = { "listing_price": df.to_dict(orient='records') } return listing_price
Calculating Average Review Ratings:
Pandas makes it easy to group, aggregate, and run calculations. fillna() can easily clean data as discussed earlier.
def get_location_review():
query = db.session.query(
Review.listing_id,
Review.overall
)
df = pd.read_sql(str(query.statement), db.engine)
review_stats = df.groupby('listing_id').agg(
average_overall=('overall', 'mean'),
total_reviews=('overall', 'size'),
median_rating=('overall', 'median'),
rating_std_dev=('overall', 'std')
).reset_index()
review_stats['rating_std_dev'].fillna(0, inplace=True)
review_stats = review_stats.sort_values(by='average_overall', ascending=False)
location_review = {
"location_review": review_stats.to_dict(orient='records')
}
return location_review
In Summary
For fast and easy data engineering, pandas is a critically powerful tool that enables a vast array of use-cases with minimal amounts of code. It enhances Python’s immense capabilities and empowers users to build agile data analysis visualizations, tools, and applications critical to the success of organizations and individuals. The development speed that Pandas facilitates can provide the necessary competitive edge.
Subscribe to my newsletter
Read articles from Jason Malefakis directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by