Web Scraping Applied to Healthcare Management: Continuous Economic Monitoring


Introduction
In this project, I performed web scraping via the Central Bank of Brazil's (BCB) public API and automated the collection of historical data for a set of indicators grouped into four analytical categories:
Sectoral Costs: IPCA Health, Medical Services, and Health Plans
Macroeconomic Indicators: General IPCA, IGP-M, and GDP
Healthcare Demand: Unemployment, Household Income, and Consumer Confidence
Financial Costs: SELIC Rate, Default Rate, and Public Health Expenditures
The data was extracted directly from the BCB’s public API using Python (the full script is available at the end of the article), processed with pandas, and enhanced with 6- and 12-month moving averages. Custom visualizations were created using matplotlib and seaborn. I also explored cross-correlations between variables, including monthly lags between the SELIC rate and IPCA Health, to investigate temporal relationships.
The final dataset was exported to Power BI, enabling periodic updates and providing a visual and strategic view of the indicators within the organization.
Results and Discussion
Figure 1 shows the time series of macroeconomic and health indicators, still in Python.
Figure 1. Macroeconomic and Health Indicators (Central Bank of Brazil)
1. Macroeconomic Overview
Economic Activity and Employment:
Quarterly GDP data highlights the economic cycles of the past decade, emphasizing the 2015–2016 recession and the sharp contraction during the COVID-19 pandemic in 2020, followed by a recovery. The unemployment rate reflects this context, peaking after the 2016 recession and showing a downward trend in recent years.
Inflation and Interest Rates:
Inflation, measured by the general IPCA, peaked in 2015–2016 and again in 2021–2022. The IGP-M, in turn, showed much greater volatility, with a significant spike in 2020–2021 driven by wholesale prices and exchange rate fluctuations. In response to inflationary surges, the SELIC rate was sharply increased starting in 2021, after reaching its historic low in 2020 as an economic stimulus measure.
2. Consumer Indicators
Confidence and Income:
Consumer confidence proved sensitive to economic cycles, with sharp declines during recessions and gradual recoveries. Despite the crises, household income shows a general upward trend over the period, albeit with fluctuations.
Default Rate:
The default rate appears to follow cycles of monetary tightening and economic crises, showing a recent upward trend likely correlated with rising SELIC rates and the resulting increase in credit costs.
3. Health Sector Analysis
Health Inflation:
Health-specific inflation, represented by IPCA Health, Medical Services, and Health Insurance, exhibits persistent volatility. Notably, health costs do not always follow the same path as general inflation, reflecting unique sector dynamics.
Government Spending:
Public health expenditure shows a clear upward trend over the decade, with a significant increase and acceleration starting in 2020, a direct result of growing demand and increased public investment to combat the COVID-19 pandemic.
Figure 2. Lagged Correlation Analysis | IPCA Health and SELIC
Analysis: Relationship between IPCA Health and SELIC Rate
1. Visual Chart Analysis:
The chart shows that the Health IPCA (red line) exhibits very high volatility, with sharp monthly peaks and valleys. In contrast, the SELIC rate (blue line) follows smoother and longer monetary policy cycles. Visually, there is no clear or immediate relationship between the two curves.
2. Statistical Correlation Analysis:
The lagged correlation analysis investigates whether a change in the SELIC rate today has an impact on Health IPCA in future months.
Immediate Correlation (Lag 0):
The correlation is 0.027, which is virtually zero. This confirms the visual impression that there is no immediate impact of SELIC on health inflation.
Lagged Correlation (Lags 1–24):
The most significant result is a positive correlation of 0.281 with a 12-month lag. Although still low, it is the strongest observed.
The relationship between the SELIC rate and Health IPCA is complex and neither direct nor immediate. The main finding of this analysis is that changes in SELIC tend to have their most noticeable impact on health inflation approximately 12 months later. This suggests that monetary policy mechanisms, such as increased credit costs and financing expenses for hospitals, laboratories, and suppliers, take about a year to translate into the final prices of health-related services and products.
Finally, the data were processed and consolidated into a structured file (.xlsx), which serves as a reliable and regularly updated data source for our Power BI dashboards.
Figure 3. Data Exported to Power BI
Conclusion
This work highlights the impact of a structured data cycle. Starting from web scraping of data from the Central Bank of Brazil, it was possible to extract relevant insights in a simple way, such as the lagged correlation between health inflation and the SELIC interest rate. The automation of the process and the integration of data into Power BI ensured continuous updates, reliability, and a strong focus on strategic decision-making.
Code (Python | Google Colab)
import pandas as pd
import requests
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from datetime import datetime
from dateutil.relativedelta import relativedelta
import numpy as np
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 120
plt.rcParams['font.size'] = 10
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['axes.labelsize'] = 10
def fetch_bcb_series(code, start_date, end_date):
url = f"https://api.bcb.gov.br/dados/serie/bcdata.sgs.{code}/dados?formato=json&dataInicial={start_date}&dataFinal={end_date}"
response = requests.get(url)
data = response.json()
if isinstance(data, list) and len(data) > 0:
df = pd.DataFrame(data)
df["data"] = pd.to_datetime(df["data"], dayfirst=True)
df["valor"] = df["valor"].astype(float)
df = df.sort_values("data")
return df
else:
print(f"Error retrieving data from series {code}")
return None
series_codes = {
"IPCA Health (%)": 1641,
"Medical Services (%)": 27838,
"Health Insurance (%)": 27864,
"General IPCA (%)": 433,
"IGP-M (%)": 189,
"Quarterly GDP (%)": 4389,
"Unemployment (%)": 24369,
"Household Income (R$)": 22099,
"Consumer Confidence": 28143,
"SELIC (%)": 4391,
"Default Rate (%)": 20714,
"Gov Health Spending (R$)": 13762
}
end_date = datetime.today()
start_date = end_date - relativedelta(years=10)
start_date_str = start_date.strftime("%d/%m/%Y")
end_date_str = end_date.strftime("%d/%m/%Y")
dfs = {}
for name, code in series_codes.items():
dfs[name] = fetch_bcb_series(code, start_date_str, end_date_str)
if dfs[name] is not None:
dfs[name] = dfs[name].rename(columns={"valor": name})
df_merged = None
for name, df in dfs.items():
if df is not None:
if df_merged is None:
df_merged = df
else:
df_merged = pd.merge(df_merged, df, on="data", how="outer")
df_merged = df_merged.sort_values("data")
df_last10 = df_merged[df_merged["data"] >= (pd.to_datetime("today") - pd.DateOffset(years=10))]
groups = {
"Sector Costs": [
"IPCA Health (%)",
"Medical Services (%)",
"Health Insurance (%)"
],
"Macroeconomic": [
"General IPCA (%)",
"IGP-M (%)",
"Quarterly GDP (%)"
],
"Health Demand": [
"Unemployment (%)",
"Household Income (R$)",
"Consumer Confidence"
],
"Financial Costs": [
"SELIC (%)",
"Default Rate (%)",
"Gov Health Spending (R$)"
]
}
palettes = {
"Sector Costs": ["#1f77b4", "#ff7f0e", "#2ca02c"],
"Macroeconomic": ["#d62728", "#9467bd", "#8c564b"],
"Health Demand": ["#e377c2", "#7f7f7f", "#bcbd22"],
"Financial Costs": ["#17becf", "#aec7e8", "#ffbb78"]
}
for group, variables in groups.items():
fig, axes = plt.subplots(len(variables), 1, figsize=(14, 4.5 * len(variables)))
fig.subplots_adjust(hspace=0.4)
fig.suptitle(f"{group} Indicators - Last 10 Years", y=0.99, fontsize=14, fontweight='bold')
if len(variables) == 1:
axes = [axes]
for i, col in enumerate(variables):
ax = axes[i]
series = df_last10[["data", col]].dropna()
if series.empty:
ax.text(0.5, 0.5, f"No data for '{col}'", ha='center', va='center')
continue
freq = pd.infer_freq(series["data"]) or 'MS'
use_marker = freq in ['MS', 'M', 'W']
ax.plot(series["data"], series[col],
color=palettes[group][i],
linewidth=2,
label=col,
marker='o' if use_marker else '',
markersize=3 if use_marker else 0)
if len(series) <= 60 and use_marker:
for x, y in zip(series["data"], series[col]):
if pd.notnull(y):
text = f"{y:.1f}"
if "%" in col:
text += "%"
elif "R$" in col:
text = f"R${y:,.0f}".replace(",", ".")
ax.annotate(text,
(x, y),
textcoords="offset points",
xytext=(0, 6),
ha='center',
fontsize=8,
color=palettes[group][i])
window = 6 if "Trimestral" in col else 12
moving_avg = series[col].rolling(window).mean()
ax.plot(series["data"], moving_avg,
color="gray",
linestyle='--',
linewidth=1.5,
label=f"{window}-month Avg")
ax.set_title(col)
ax.set_ylabel("%" if "%" in col else "R$" if "R$" in col else "")
ax.legend(loc='upper left')
ax.grid(True, linestyle=':', alpha=0.6)
if "R$" in col:
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'R${x/1e9:,.1f} B'))
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
ax.tick_params(axis='x', rotation=45)
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()
df_corr = df_merged[["data", "IPCA Health (%)", "SELIC (%)"]].dropna()
df_corr.set_index("data", inplace=True)
df_corr = df_corr.pct_change().dropna()
for lag in range(0, 25):
corr = df_corr["SELIC (%)"].shift(lag).corr(df_corr["IPCA Health (%)"])
print(f"Correlation with {lag}-month lag: {corr:.3f}")
df_plot = df_last10[["data", "IPCA Health (%)", "SELIC (%)"]].dropna()
plt.figure(figsize=(12, 5))
plt.plot(df_plot["data"], df_plot["IPCA Health (%)"], label="IPCA Health (%)", color="red")
plt.plot(df_plot["data"], df_plot["SELIC (%)"], label="SELIC (%)", color="blue")
plt.title("IPCA Health vs SELIC - Last 10 Years")
plt.legend()
plt.grid(True)
plt.show()
Subscribe to my newsletter
Read articles from Bernardo Ribeiro de Moura directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Bernardo Ribeiro de Moura
Bernardo Ribeiro de Moura
Senior Data Analyst at Unimed Rio Preto, working with predictive models, cost optimization, and data-driven decision-making. Bachelor’s in Chemistry (UNESP), transitioning to Data Science (UNIVESP), combining science and technology to solve real-world problems. Specialized in Google Data Analytics. I write about predictive analysis, data visualization, and statistical modeling. Let’s exchange ideas on Python, SQL, and the impact of data in our daily lives!