Time Series Analysis using Python

Ahmad W KhanAhmad W Khan
10 min read

Time series analysis is an essential tool in the arsenal of a data scientist, enabling them to analyze and forecast data points collected over time. This comprehensive guide dives deep into advanced techniques, tools, libraries, and methodologies for time series analysis. This guide will equip you with the skills needed to tackle complex time series data in various domains, including finance.

Table of Contents

  1. Introduction to Time Series Analysis

  2. Key Concepts and Terminology

  3. Tools, Libraries, and Technologies

  4. Data Preprocessing and Exploration

  5. Decomposition of Time Series

  6. Stationarity and Differencing

  7. Advanced Models for Time Series Forecasting

    • ARIMA

    • SARIMA

    • Prophet

    • LSTM

  8. Model Evaluation and Selection

  9. Practical Example: Stock Price Prediction

  10. Conclusion

1. Introduction to Time Series Analysis

Time series analysis involves analyzing time-ordered data to extract meaningful statistics and characteristics. It's used in various domains such as finance, economics, environmental science, and more. The goal is often to forecast future values based on historical data.

The complexity of time series analysis lies in understanding and dealing with the intrinsic patterns and structures within the data, such as trends, seasonality, cycles, and noise. Accurate time series forecasting can lead to significant advantages in decision-making processes, from predicting stock prices to anticipating demand in supply chain management.

2. Key Concepts and Terminology

Before diving into the technical details, it's crucial to understand some key concepts and terminology in time series analysis:

  • Trend: The long-term movement or direction in the data. It represents the underlying pattern that indicates a persistent increase or decrease in the series.

  • Seasonality: The repeating patterns or cycles in the data that occur at regular intervals, such as daily, monthly, or yearly.

  • Noise: Random variations or fluctuations in the data that cannot be attributed to the trend or seasonality.

  • Stationarity: A property of a time series where statistical properties such as mean and variance are constant over time. Stationarity is crucial for many time series forecasting methods.

  • Autocorrelation: The correlation of a time series with its own past values.

  • Lag: The time step difference between observations in a time series.

Understanding these concepts is essential for selecting the appropriate models and methods for analyzing and forecasting time series data.

3. Tools

Programming Language

  • Python: Python is widely used for time series analysis due to its rich ecosystem of libraries and ease of use.

Libraries and Tools

  • Pandas: For data manipulation and analysis. It provides data structures like DataFrame, which are essential for handling time series data.

  • NumPy: For numerical operations. It supports a wide array of mathematical functions and operations on arrays.

  • Matplotlib and Seaborn: For data visualization. These libraries help in creating insightful plots and charts.

  • Statsmodels: For statistical modeling. It provides classes and functions for the estimation of many different statistical models.

  • pmdarima: For automating ARIMA model selection. It simplifies the process of building and tuning ARIMA models.

  • Prophet: For time series forecasting developed by Facebook. It's designed to handle missing data and seasonal variations automatically.

  • Keras and TensorFlow: For building neural network models such as LSTMs. They provide tools to create and train deep learning models.

  • Scikit-learn: For model evaluation and metrics. It offers tools for splitting data, validating models, and calculating performance metrics.

These tools and libraries are essential for implementing the advanced techniques discussed in this guide. They provide robust functionality for handling, analyzing, and visualizing time series data, making Python a preferred choice for time series analysis.

4. Data Preprocessing and Exploration

Data preprocessing is a critical step in time series analysis. It involves cleaning the data, handling missing values, and exploring the data to understand its underlying structure.

Importing Libraries

First, let's import the necessary libraries. These libraries will help us with data manipulation, visualization, and modeling.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from pmdarima import auto_arima
import warnings
warnings.filterwarnings("ignore")

Loading the Data

Loading the data is the first step in any analysis. Here, we will use a CSV file containing time series data.

data = pd.read_csv('your_time_series_data.csv', index_col='date', parse_dates=True)
data.head()

Visualizing the Data

Visualizing the data helps us understand its structure and identify any apparent trends or seasonality.

plt.figure(figsize=(10, 6))
plt.plot(data)
plt.title('Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

Handling Missing Values

Missing values can significantly affect the analysis. We can handle them by using forward fill or backward fill methods.

data = data.fillna(method='ffill')

Exploring Statistical Properties

Exploring the statistical properties of the data gives us insights into its distribution and variability.

print(data.describe())
print(data.info())

Box plots can be useful for visualizing seasonal patterns and trends over time.

plt.figure(figsize=(12, 8))
sns.boxplot(x=data.index.year, y=data['value'])
plt.title('Yearly Seasonality')
plt.show()

5. Decomposition of Time Series

Time series decomposition involves splitting the series into its components: trend, seasonality, and residual. This helps us understand the underlying patterns in the data.

Additive vs. Multiplicative Decomposition

  • Additive: When the components add up to form the time series.

  • Multiplicative: When the components multiply to form the time series.

Decomposing the Series

Using the seasonal_decompose function from statsmodels, we can decompose the time series into its components.

composition = seasonal_decompose(data, model='multiplicative')
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(data, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

In the above code, the seasonal_decompose function breaks down the time series into three components: trend, seasonality, and residual. The resulting plots help us visualize these components separately.

6. Stationarity and Differencing

A stationary time series has a constant mean and variance over time, making it easier to model. Many time series models require the series to be stationary.

Augmented Dickey-Fuller Test

The Augmented Dickey-Fuller (ADF) test is a statistical test used to check if a time series is stationary.

def adf_test(series):
    result = adfuller(series)
    print('ADF Statistic:', result[0])
    print('p-value:', result[1])
    for key, value in result[4].items():
        print('Critical Values:')
        print(f'   {key}, {value}')

adf_test(data['value'])

The ADF test provides a way to check for stationarity. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, meaning the series is stationary.

Differencing

Differencing is a technique used to make a time series stationary. It involves subtracting the previous observation from the current observation.

data_diff = data.diff().dropna()
adf_test(data_diff['value'])

In this step, we apply differencing to the data and then perform the ADF test again to check for stationarity. This process may need to be repeated more than once to achieve stationarity.

7. Advanced Models for Time Series Forecasting

ARIMA Model

The ARIMA (AutoRegressive Integrated Moving Average) model is a popular choice for time series forecasting. It combines autoregression, differencing, and moving average components.

Fitting the ARIMA Model

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(data, order=(5,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())

In this code, we fit an ARIMA model to the data. The order parameter specifies the AR, I, and MA terms. The summary of the model provides insights into the coefficients and their significance.

Plotting ARIMA Forecasts

predictions = model_fit.predict(start=len(data_diff), end=len(data_diff)+365, typ='levels')
plt.figure(figsize=(10, 6))
plt.plot(data_diff, label='Actual')
plt.plot(predictions, label='Forecast')
plt.legend(loc='best')
plt.show()

The ARIMA model can then be used to generate forecasts, which we plot alongside the actual data to visualize the model's performance.

SARIMA Model

The SARIMA (Seasonal ARIMA) model extends ARIMA by adding support for modeling seasonality. This makes it suitable for data with seasonal patterns.

Fitting the SARIMA Model

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(data, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
model_fit = model.fit(disp=0)
print(model_fit.summary())

The SARIMA model includes additional parameters to capture seasonal effects. The seasonal_order parameter specifies the seasonal components.

Prophet

Prophet is a powerful forecasting tool developed by Facebook. It is designed to handle missing data and seasonal variations automatically.

Fitting the Prophet Model

from fbprophet import Prophet

data_reset = data.reset_index().rename(columns={'date': 'ds', 'value': 'y'})
model = Prophet()
model.fit(data_reset)
future = model.make_future_dataframe(periods=365)
forecast = model.predict(future)

model.plot(forecast)
plt.show()

Prophet simplifies the process of creating and fitting a time series model. It automatically detects and handles seasonal patterns and holidays.

LSTM (Long Short-Term Memory)

LSTM networks are a type of recurrent neural network (RNN) capable of learning long-term dependencies. They are particularly useful for modeling time series data.

Preparing the Data for LSTM

from keras.models import Sequential
from keras.layers import LSTM, Dense

# Prepare the data
data_scaled = (data - data.mean()) / data.std()
train_size = int(len(data) * 0.80)
train, test = data_scaled[0:train_size], data_scaled[train_size:len(data)]
def create_dataset(dataset, look_back=1):
    X, Y = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        X.append(a)
        Y.append(dataset[i + look_back, 0])
    return np.array(X), np.array(Y)

look_back = 1
trainX, trainY = create_dataset(train.values, look_back)
testX, testY = create_dataset(test.values, look_back)

trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

Building and Training the LSTM Model

model = Sequential()
model.add(LSTM(50, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

In this code, we prepare the data for the LSTM model by scaling it and creating sequences. We then build and train the LSTM model using the Keras library.

Plotting LSTM Forecasts

predictions = model.predict(testX)
plt.figure(figsize=(10, 6))
plt.plot(testY, label='Actual')
plt.plot(predictions, label='Forecast')
plt.legend(loc='best')
plt.show()

The trained LSTM model can be used to make predictions, which we plot against the actual values to evaluate the model's performance.

8. Model Evaluation and Selection

Evaluating the performance of your time series model is crucial to ensure its accuracy and reliability.

Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)

MAE and RMSE are common metrics for evaluating the performance of time series models.

from sklearn.metrics import mean_absolute_error, mean_squared_error

predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)
mae = mean_absolute_error(test, predictions)
rmse = np.sqrt(mean_squared_error(test, predictions))

print(f'MAE: {mae}')
print(f'RMSE: {rmse}')

Visualizing Residuals

Residuals are the differences between the actual and predicted values. Analyzing residuals helps in understanding the model's performance and identifying any patterns the model failed to capture.

residuals = pd.DataFrame(model_fit.resid)
plt.figure(figsize=(10, 6))
plt.plot(residuals)
plt.title('Residuals')
plt.show()

9. Practical Use Case: Stock Price Prediction

Data Collection

For this example, we will use stock price data from a well-known stock. You can obtain stock price data using various APIs such as Yahoo Finance, Alpha Vantage, or directly from financial data providers.

import yfinance as yf

ticker = 'AAPL'
data = yf.download(ticker, start='2010-01-01', end='2020-01-01')
data = data['Close']
data.head()

Data Preprocessing

data = data.fillna(method='ffill')

plt.figure(figsize=(10, 6))
plt.plot(data)
plt.title(f'{ticker} Stock Price')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

Decomposing the Time Series

decomposition = seasonal_decompose(data, model='multiplicative')
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(data, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Checking for Stationarity

adf_test(data)
data_diff = data.diff().dropna()
adf_test(data_diff)

Building the ARIMA Model

model = ARIMA(data, order=(5,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())

predictions = model_fit.predict(start=len(data_diff), end=len(data_diff)+365, typ='levels')
plt.figure(figsize=(10, 6))
plt.plot(data_diff, label='Actual')
plt.plot(predictions, label='Forecast')
plt.legend(loc='best')
plt.show()

Building the LSTM Model

Preparing the Data for LSTM

from keras.preprocessing.sequence import TimeseriesGenerator

data_scaled = (data - data.mean()) / data.std()
train_size = int(len(data) * 0.80)
train, test = data_scaled[0:train_size], data_scaled[train_size:len(data)]

train_generator = TimeseriesGenerator(train, train, length=10, batch_size=1)
test_generator = TimeseriesGenerator(test, test, length=10, batch_size=1)

Building and Training the LSTM Model

model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(10, 1)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

model.fit(train_generator, epochs=50)

predictions = model.predict(test_generator)
plt.figure(figsize=(10, 6))
plt.plot(test[10:], label='Actual')
plt.plot(predictions, label='Forecast')
plt.legend(loc='best')
plt.show()

10. Conclusion

Time series analysis is an invaluable tool for data scientists and analysts. This advanced guide has covered critical techniques and models that can be applied to various domains, with a practical example focusing on stock price prediction. By mastering these methods, you can unlock powerful insights and make data-driven decisions to drive success in your projects and organizations.

Feel free to reach out to me at AhmadWKhan.com, with any questions or further discussions on advanced time series analysis techniques. Happy forecasting!

0
Subscribe to my newsletter

Read articles from Ahmad W Khan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ahmad W Khan
Ahmad W Khan