Python XLSX Reader Guide

Spreadsheets often store critical data that powers reports, dashboards, and data pipelines. Yet, when it comes to reading XLSX files with Python, developers sometimes overlook cell types like dates, formulas, and styles. How can you handle these varied data types seamlessly when parsing an XLSX workbook in your scripts?
The answer lies in choosing the right Python library and using its features effectively. By understanding how openpyxl, pandas, and xlrd treat cell types and memory usage, you can prevent data loss and speed up your workflows. In the next sections, we’ll uncover practical tips for each approach, ensuring you make informed decisions and avoid common surprises.
Why Read XLSX Files
Working with Excel files is common in business, science, and education. Python scripts can automate data intake, cleanup, and analysis, saving hours of manual work. An XLSX reader helps extract rows, columns, and metadata without opening Excel itself.
Beyond basic cell values, spreadsheets often include dates, formulas, merged cells, and custom number formats. A good XLSX reader preserves these nuances so your automation respects the original document. That means your scripts can handle financial reports with correct currency symbols, log dates for time-series analysis, and update cells without breaking formulas.
Choosing a Library
There are three popular libraries for reading XLSX in Python: openpyxl, pandas, and xlrd. Each has its strengths:
Library | Memory Use | Cell Types | Speed |
openpyxl | Medium | Full support | Moderate |
pandas | High (DataFrame) | Values only | Fast |
xlrd | Low (legacy) | Values only | Fastest |
openpyxl reads and writes .xlsx files with full styling, formulas, and data types. pandas uses openpyxl or xlrd under the hood to load sheets into DataFrames quickly but drops styling and formulas. xlrd is optimized for speed and low memory but only supports .xls by default—newer versions no longer handle .xlsx.
Using openpyxl
To get started, install openpyxl. Then load a workbook and iterate rows:
from openpyxl import load_workbook
wb = load_workbook('data.xlsx', data_only=True)
ws = wb.active
for row in ws.iter_rows(min_row=2):
values = [cell.value for cell in row]
print(values)
data_only=True
returns calculated values instead of formulas.- Check
cell.is_date
to detect date cells and format them. - Use
ws.iter_rows(read_only=True)
for large files.
If you prefer working with dictionaries, you can map headers to values by combining lists or see examples on iterating dictionaries.
Using pandas
pandas makes reading sheets into DataFrames a breeze:
import pandas as pd
df = pd.read_excel('data.xlsx', sheet_name='Sheet1', engine='openpyxl')
print(df.head())
You can:
- Specify
dtype
orconverters
to control types. - Use
usecols
to limit columns and save memory. - Read only a subset of rows with
nrows
andskiprows
.
Once loaded, pandas lets you filter, group, and export data. To save your DataFrame as JSON, refer to writing JSON examples.
Advanced Features
Beyond basic reads, you can access formulas, comments, and styles with openpyxl:
- Formulas: Load with
data_only=False
and readcell.value
as the formula string. - Comments: Iterate
ws.comments
to extract author and text. - Styles: Inspect
cell.font
,cell.fill
, andcell.number_format
for formatting details.
Use these features to audit complex workbooks, validate cell formulas, or generate styled reports automatically. Handling these advanced parts ensures your scripts interact fully with Excel's power.
Performance Tips
Working with large spreadsheets can hit memory and speed bottlenecks. Try:
- Using
read_only=True
in openpyxl to stream rows instead of loading all cells. - Limiting columns with
usecols
in pandas. - Reading in chunks with
chunksize
for out-of-core processing. - Converting dates and strings on the fly using
converters
.
Tip: For files over 50,000 rows, streaming modes often halve memory usage and speed up iteration significantly.
Common Pitfalls
Even seasoned developers run into issues:
- Engine errors: pandas may default to
xlrd
which no longer supports .xlsx. Always setengine='openpyxl'
. - Missing dependencies: Ensure
openpyxl
andpandas
are installed in your environment. - Empty cells: Excel may leave gaps, resulting in
None
orNaN
values you need to handle. - Version mismatch: Newer Excel features sometimes break older library versions.
Handling these pitfalls early saves debugging time and keeps your data pipelines robust.
Conclusion
Reading XLSX files in Python unlocks powerful automation and data analysis. By selecting the right library—openpyxl for full-featured reads, pandas for fast DataFrame processing, or xlrd for legacy .xls support—you control performance, accuracy, and memory use. Using openpyxl’s advanced APIs helps preserve formulas and styles, while pandas can slice and dice large sheets quickly.
Armed with these tools, you’re ready to build scripts that load, transform, and export spreadsheets into your workflows. Next time you face an Excel file, pick the method that fits your needs and tweak the parameters covered here. That way, you’ll avoid surprises and deliver reliable results every time.
Read XLSX files in Python using openpyxl, pandas, and xlrd with practical code examples and tips.
Subscribe to my newsletter
Read articles from Mateen Kiani directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
