Risk Analytics in Python: Building Models Using VS Code, ML Python Libraries, Kedro and Docker

Introduction

In today's financial ecosystem, where change is practically unstoppable, risk analytics is an important building block in every decision-making process for banks and fintech companies. During the more than six years of accrued experience in banking with specialized expertise in risk analytics, modeling, and validation, I have been able to witness changes in risk modeling techniques over time and the growing need for productive, scalable, and reproducible methods of building risk models.

In this article, we will review how Python is being utilized in risk analytics, focusing on the model development using templates. This approach not only streamlines the development process but assures consistent, maintainable results for any given project of risk modeling. We'll go through examples taken from various aspects of the industry and present some best practices for different types of professionals working in this exciting field.

Theory

Why Risk Analytics in Finance Matters

Risk in the Financial Sector

The finance industry inherently encompasses a degree of risk. From credit risk to market risk, operational to liquidity risk, financial firms always need to assess and manage various kinds of risks so that stability and profitability can be maintained. It is risk analytics that provides these key toolkits and methodologies for quantifying and analyzing, hence mitigating, these risks effectively.

The Role of Data Science in Risk Management

Data science has totally changed the paradigm for risk management in finance. Financial institutions can now create more accurate prediction models using data from advanced statistical techniques, machine learning algorithms, and big data technologies. This enables them to uncover complex patterns and relationships in the data and apply these to automating various risk assessment processes and providing real-time risk insights. As one such example, today, most banks make use of machine learning models to predict the probability of default of applicants for loans, thus facilitating more correct credit decisioning.

Python: The Go-To Language for Risk Analytics

Why Python?

Python is the de facto language in data science and risk analytics due to its simplicity, versatility, and powerful ecosystem of libraries. Most financial institutions have migrated from legacy systems built around SAS to Python, reaping numerous benefits therein:

  1. Cost-effectiveness: Open-source in nature, reducing licensing costs

  2. Flexibility: Easier integrations with modern data science tools and frameworks

  3. Talent pool: Wider availability of Python developers

  4. Community support: Access to a large community and continuous enhancements

Essential Tools for Risk Analytics

The cornerstones of risk analytics have been the following few Python libraries:

  • Pandas: Data manipulation and analysis

  • NumPy: Numerical computing

  • Scikit-learn: ML algorithms

  • OptBinning: Data binning

  • Statsmodels: Statistical modeling

  • Matplotlib, Seaborn, Plotly: Data visualization

The de facto standard framework for building ML pipelines is Kedro, while containerization has historically been Docker’s domain. For any new risk analytics project, list out all the libraries with their version numbers in a requirements.txt file. This ensures reproducibility across environments.

Building Risk Models Using Templates

The Concept of Model Templates

Model templates are pre-defined structures or frameworks representing the underlying basis from which a risk model is built. A template encapsulates standard practices, best practices in model development, as well as common functionalities. Some of the benefits of using templates include:

  1. Consistency across various models and teams

  2. Less development time

  3. Maintenance and updates easier

  4. Better code quality and readability

Key Components of a Risk Model Template

A typical comprehensive template will contain the following:

  1. Data pre-processing module

  2. A feature engineering framework

  3. A model training and evaluation pipeline

  4. Model performance metrics calculation features

  5. Reporting and visualization functions

  6. Model validation checks

Advanced Techniques in Risk Modeling

Incorporating Macroeconomic Factors

Most credit portfolio stress testing exercises center on macroeconomic modeling. It therefore, becomes necessary to identify the key macroeconomic variables such as GDP growth, unemployment, or inflation. The analyst thereafter constructs a time series model in order to project these indicators and injects this projection into the risk model to determine the magnitude of performance of the portfolio against varied economic conditions. For example, during the COVID-19 pandemic, banks had to rush and alter their stress testing models for economic conditions never seen before. Thus, flexibility in macroeconomic modeling in times of crisis is paramount.

Ensemble Methods for Improved Accuracy

Ensemble methods generally perform better in risk analytics, considering the strengths of various algorithms combined to improve accuracy. Effective ensemble techniques include Random Forests, Gradient Boosting Machines including XGBoost and LightGBM, and stacking for enhanced predictive power. Nevertheless, with the use of the ensemble methods, there is a need to appreciate the trade-off between performance and interpretability. This is particularly so in regulated environments, where explainability often complements accuracy.

Neural Networks for Complex Risk Patterns

Neural networks are ideal for modeling complex and nonlinear relationships that may appear in the data. Therefore, these networks would be best suited for finding complex risk patterns. They work well on high-dimensional feature space and do feature learning automatically. Deep learning models have been applied to perform fraud detection of transactions in real time and achieve much higher detection rates compared to traditional rule-based approaches in the financial world.

Implementing Models in Production

Model Monitoring and Validation

Successful model creation involves manifold key components. For the monitoring of these models effectively and their validation, one needs automatic performance tracking with the threshold definition of model degradation, alerting systems on material performance drops, and retro tests or backtesting. Here's a good practice for robust monitoring: In pursuit of all this, one could look at a dashboard that has real-time data so that any issue can be quickly pinpointed and updated on time.

Integrating Models into Risk Decision Systems

It is important to pay attention to designing modular and scalable code structures, efficiently using data pipelines, and realizing real-time scoring of data in integrating models into production systems to make risk decisions. Besides, the development of API endpoints facilitates seamless model integrations. Examples could be seen in how banks use a real-time credit scoring system that merges different risk models to grant instant lending decisions for online loan applications.

Challenges and Best Practices in Risk Analytics

Data Quality and Availability

Any risk analytics accuracy depends on high-quality data being available. It is best practice to include data validation checks, development of data quality scorecards, putting data governance policies in place, and looking for alternative data sources where traditional data may be in short supply. This can be helped by automation of the checks of data quality within ETL pipelines to find potential issues early.

Model Interpretability and Explainability

While the number of models is increasing and growing in complexity, so is the need for interpretability. Techniques vary from SHAP value calculation over feature importance, Partial Dependence Plots to understand how features have an impact on the outcomes of the model, up to LIME, which tries to explain each prediction model-agnostically, locally. For instance, in the financial sector, regulators demand transparent explanations for credit decisions; thus, interpretable AI is favored.

Regulatory Compliance

There is a lot of documentation for institutions that operate in highly regulated environments in the processes that lead to model development, routine validations, regulatory requirements like the BASEL standards, orderly inventories of models, and version control. Probably the best pragmatic way out for compliance with these requires a model governance framework with a periodic review cycle and broad documentation standards.

Practice

Project Plan

  1. Create an environment for model development using Docker

  2. Create a model template using Kedro

  3. Create a PD model - an example from the Kaggle competition using our template

Creating environment for model development

  1. Install Docker

  2. Install VS Code

  3. Install the following VS Code plugins:

4. Create folders and files structures on disk:

risk_modeling

├── docker-compose.yaml

└── env

├──Dockerfile

├──requirements.txt

5. Run Docker instructions:

In ./risk_modeling/docker-compose.yaml:

services:
  p310:
    build: env/
    ports:
      - 4141:4141
    volumes:
      - ./:/docker_disk
    tty: true

In ./risk_modeling/env/Dockerfile:

FROM python:3.10-bookworm
COPY requirements.txt /requirements.txt
RUN  apt-get update && apt-get upgrade -y && apt-get install cmake -y && apt install libopenblas-dev -y
RUN pip install -r requirements.txt

Installing cmake & libopenblas-dev is necessary for some libraries in requirements

In ./risk_modeling/env/requirements.txt:

pandas==1.4.1
numpy==1.26.3
scikit-learn==1.4.0
scipy==1.11.4
catboost==1.2.2
jupyterlab==4.3.4
ks-metric==0.2.0
lightgbm==3.2.1
matplotlib==3.8.2
matplotlib-inline==0.1.6
missingno==0.5.2
mlxtend==0.21.0
optbinning==0.19.0
optuna==3.5.0
plotly==5.18.0
tqdm==4.66.1
kedro==0.19.10
kedro-viz==10.1.0
kedro-datasets==6.0.0
notebook==7.3.2
openpyxl==3.0.10

6. Build the Docker image:

You can use the system command prompt or terminal in VS Code:

Open the directory risk_modeling. In the folder (or where the file docker-compose.yaml is located) run in terminal:

docker-compose build

Or:

docker-compose -f docker-compose.yaml build

We can check the created docker image using docker command in terminal:

docker images

Next step: we start our development service container from image risk_modeling-p310 with the command in terminal:

docker-compose up

Or:

docker-compose -f docker-compose.yaml up

And finally, attach an IDE (VS Code in our case) to the created Docker container:

The next steps of model development will take place in our container:

Inside container we download extension:

Create a Model Template Using Kedro

Use the following command In the container’s terminal:

kedro new

Set the project name:

risk_model

In the example we will not use PySpark, so include:

1,2,3,4,5,7

We don’t need an example:

no

Now we’ve got a model project structure, with the Kedro template:

  • ./conf: folder for credentials, input parameters and directories (data paths)

  • ./data: data storage folder

  • ./docs: project documentation

  • ./logs: project logs

  • ./notebooks: for code drafts or hypothesis testing using the Jupyter notebook format

  • ./src: main folder with all model codes and pipelines

  • ./tests: folder for testing including writing unit tests

Create a PD Model (Kaggle Competition Example Using Our Template)

We’ve solved a Kaggle task. Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

Download the Data

Put the unpacked Kaggle data to directory: 01_raw

Set Up Configuration Files

This section specifies saving paths, file types and formats that are obtained during model creation. This includes datasets, binning files, trained model files, tables with statistics, graphs, etc.

In /docker_disk/risk-model/conf/base/catalog.yml:

# https://www.kaggle.com/competitions/GiveMeSomeCredit/data
01_raw_development:
  type: pandas.CSVDataset
  filepath: data/01_raw/cs-training.csv

01_raw_validation:
  type: pandas.CSVDataset
  filepath: data/01_raw/cs-test.csv

01_raw_example_to_submit:
  type: pandas.CSVDataset
  filepath: data/01_raw/sampleEntry.csv

# 01_preprocessing pipeline data: preprocessing stages

## fix index
02_intermediate_development:
  type: pickle.PickleDataset
  filepath: data/02_intermediate/df_dev.pkl

02_intermediate_validation:
  type: pickle.PickleDataset
  filepath: data/02_intermediate/df_validation.pkl

## new features

04_feature_development:
  type: pickle.PickleDataset
  filepath: data/04_feature/df_dev.pkl

04_feature_validation:
  type: pickle.PickleDataset
  filepath: data/04_feature/df_validation.pkl

## train/test & WOE-binning

05_model_input_X_train:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_train.pkl

05_model_input_X_test:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_test.pkl

05_model_input_y_train:
  type: pickle.PickleDataset
  filepath: data/05_model_input/y_train.pkl

05_model_input_y_test:
  type: pickle.PickleDataset
  filepath: data/05_model_input/y_test.pkl
### long list
05_model_input_X_train_woe:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_train_woe.pkl

05_model_input_X_test_woe:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_test_woe.pkl

05_model_input_development_woe:
  type: pickle.PickleDataset
  filepath: data/05_model_input/df_dev_woe.pkl

05_model_input_validation_woe:
  type: pickle.PickleDataset
  filepath: data/05_model_input/df_validation_woe.pkl

### short list

05_model_input_X_train_woe_short_list:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_train_woe_short_list.pkl

05_model_input_X_test_woe_short_list:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_test_woe_short_list.pkl

05_model_input_development_woe_short_list:
  type: pickle.PickleDataset
  filepath: data/05_model_input/df_dev_woe_short_list.pkl

05_model_input_validation_woe_short_list:
  type: pickle.PickleDataset
  filepath: data/05_model_input/df_validation_woe_short_list.pkl


# 02_modeling pipeline data: modeling stages and final model

## model files

06_models_feature_selection:
  type: pickle.PickleDataset
  filepath: data/06_models/sfs.pkl

06_models_binning:
  type: pickle.PickleDataset
  filepath: data/06_models/bp.pkl

06_models_binning_short_list:
  type: pickle.PickleDataset
  filepath: data/06_models/bp_short_list.pkl

06_models_lr:
  type: pickle.PickleDataset
  filepath: data/06_models/lr.pkl

06_models_scorecard:
  type: pickle.PickleDataset
  filepath: data/06_models/scorecard.pkl

## scored files

07_model_output_sample_to_kaggle:
  type: pandas.CSVDataset
  filepath: data/07_model_output/sampleEntry.csv

07_model_output_development_scored:
  type: pickle.PickleDataset
  filepath: data/07_model_output/df_dev_scored.pkl

07_model_output_validation_scored:
  type: pickle.PickleDataset
  filepath: data/07_model_output/df_validation_scored.pkl

# 03_reporting pipeline data: reports

08_reporting_variables_binning_table:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_variables_binning.xlsx
  save_args:
    index: True
    sheet_name: Sheet1
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_variables_binning_table_short_list:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_variables_binning_short_list.xlsx
  save_args:
    index: True
    sheet_name: Sheet1
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_variables_summary_table:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_variables_summary.xlsx
  save_args:
    sheet_name: Sheet1
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_variables_summary_table_short_list:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_variables_summary_short_list.xlsx
  save_args:
    sheet_name: Sheet1
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_scorecard_table:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_scorecard.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_feature_correlation:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_lr_features_correlation.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_feature_correlation_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/plot_model_feature_correlation.jpeg

08_reporting_model_scorecard_feature_coefs:
  type: tracking.JSONDataset
  filepath: data/08_reporting/model_scorecard_features_coefs.json

08_reporting_model_scorecard_feature_selection_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/plot_model_scorecard_feature_selection.jpeg

08_reporting_statistics_monitoring_train_test:
  type: pickle.PickleDataset
  filepath: data/08_reporting/scorecard_monitoring_train_test.pkl


08_reporting_statistics_monitoring_dev_valid:
  type: pickle.PickleDataset
  filepath: data/08_reporting/scorecard_monitoring_dev_valid.pkl

#

08_reporting_model_scorecard_features_psi_detailed_train_test:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_scorecard_features_psi_detailed_train_test.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_scorecard_features_psi_summary_train_test:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_scorecard_features_psi_summary_train_test.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000


08_reporting_model_scorecard_features_psi_detailed_dev_valid:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_scorecard_features_psi_detailed_dev_valid.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_scorecard_features_psi_summary_dev_valid:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_scorecard_features_psi_summary_dev_valid.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_scorecard_psi_summary_train_test:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_scorecard_psi_summary_train_test.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_scorecard_psi_summary_dev_valid:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_scorecard_psi_summary_dev_valid.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_scorecard_psi_summary_train_test_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/plot_model_scorecard_psi_summary_train_test.jpeg

08_reporting_model_scorecard_psi_summary_dev_valid_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/plot_model_scorecard_psi_summary_dev_valid.jpeg

08_reporting_model_scorecard_statistical_tests:
  type: pandas.ExcelDataset
  filepath: data/08_reporting/table_model_scorecard_statistical_tests.xlsx
  save_args:
    sheet_name: Sheet1 
  metadata:
    kedro-viz:
      preview_args:
          nrows: 1000

08_reporting_model_scorecard_roc_auc_train_test_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/plot_model_scorecard_roc_auc_train_test.jpeg

08_reporting_model_scorecard_ks_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/plot_model_scorecard_ks_train_test.jpeg


08_reporting_model_scorecard_coefs_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/plot_model_scorecard_coefs.jpeg

We will also set the names of the fields from the task file, list global variables, hyperparameters of the model, and set manual binning for those features for which it is needed.

In /docker_disk/risk-model/conf/base/parameters.yml

cols:
  col_raw_id: 'Unnamed: 0'
  col_final_id: 'Id'
  col_final_prediction: 'Prediction'
  cols_variables: ['RevolvingUtilizationOfUnsecuredLines',
                    'age',
                    'NumberOfTime30-59DaysPastDueNotWorse',
                    'DebtRatio',
                    'MonthlyIncome',
                    'NumberOfOpenCreditLinesAndLoans',
                    'NumberOfTimes90DaysLate',
                    'NumberRealEstateLoansOrLines',
                    'NumberOfTime60-89DaysPastDueNotWorse',
                    'NumberOfDependents',
                    'Debt',
                    'NumberOfTime30+DaysPastDueNotWorse',
                    'avg_debt_per_credit',
                    'feat_killer']

  cols_categorial: ['feat_killer']
  cols_to_manual_binning_object: ['feat_killer']
  col_target: 'SeriousDlqin2yrs'


global_parameters:
  seed: 1234
  test_ratio: 0.33

binning_parameters:
  metric: 'woe'
  metric_missing: 'empirical'

feature_selection_parameters:
  selection_feature_metric: 'roc_auc'
  selection_forward: True

manual_binning:
  splits: {'feat_killer':
  [
    [
          [ 'flag_bad_utilization=0&flag_bad_dlq=0&flag_bad_noopencreds=0&' ],

          [ 'flag_bad_utilization=0&flag_bad_dlq=0&flag_bad_noopencreds=1&',
            'flag_bad_utilization=1&flag_bad_dlq=0&flag_bad_noopencreds=0&',
            'flag_bad_utilization=1&flag_bad_dlq=0&flag_bad_noopencreds=1&' ],

          [ 'flag_bad_utilization=0&flag_bad_dlq=1&flag_bad_noopencreds=0&' ],

          [ 'flag_bad_utilization=0&flag_bad_dlq=1&flag_bad_noopencreds=1&',
            'flag_bad_utilization=1&flag_bad_dlq=1&flag_bad_noopencreds=0&',
            'flag_bad_utilization=1&flag_bad_dlq=1&flag_bad_noopencreds=1&' ]
    ],
          [True,  True, True, True]
  ]
          }

monitoring_parameters:
  psi_n_bins: 10
  psi_method: 'cart'
  inplace_y_actual: 0
  none_type: 'None'

Create the Pipelines

Next, we code our model in three stages.

In /docker_disk/risk-model/src/risk_model/pipelines:

Create 3 pipelines in 3 folders:

  • 01_preprocessing

  • 02_modeling

  • 03_reporting

In the directories, create files:

  • init.py

  • nodes.py

  • pipeline.py

Pipeline 1: Preprocessing

At the preprocessing stage, we will create additional features that we will use in modeling. For feature preprocessing, we will use WOE binning using the OptBinning library.

In ~/pipelines/01_preprocessing/__init__.py:

from .pipeline import create_pipeline

In ~/pipelines/01_preprocessing/nodes.py:

import numpy as np
import pandas as pd

from typing import Dict, Tuple
from optbinning import BinningProcess
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

def get_index(df:pd.DataFrame, col_to_index:str, index_name:str)->pd.DataFrame:
    """
    rebuild indexes
    """
    df_indexed = df.rename(columns={col_to_index:index_name}).set_index(index_name)
    return df_indexed



def _get_killer_feature(df:pd.DataFrame)->pd.DataFrame:
    """

    """
    df_ = df.copy(deep=True)
    for x in ['flag_bad_utilization', 'flag_bad_dlq', 'flag_bad_noopencreds']:
        df_[x+'_modified'] = x +'='+ df_[x].astype(str) + '&'
    df_['feat_killer'] = df_[[x for x in df_ if '_modified' in x]].sum(axis=1)
    return df_['feat_killer']


def get_extra_features(df:pd.DataFrame)->pd.DataFrame:
    """
    feature engineering
        Debt: DebtRatio * MonthlyIncome
        NumberOfTime30+DaysPastDueNotWorse: sum('NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate')
        avg_debt_per_credit: Debt / NumberOfOpenCreditLinesAndLoans where NumberOfOpenCreditLinesAndLoans > 0 else NULL
        feat_killer: 3 flags of bad clients (flag_bad_utilization, flag_bad_dlq, flag_bad_noopencreds)

    """
    df['Debt'] =df['DebtRatio']*df['MonthlyIncome'].fillna(1)
    df['NumberOfTime30+DaysPastDueNotWorse'] = df[['NumberOfTime30-59DaysPastDueNotWorse','NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate']].sum(axis=1)
    df['avg_debt_per_credit'] =  np.where( df['NumberOfOpenCreditLinesAndLoans']>0, df['Debt']/df['NumberOfOpenCreditLinesAndLoans'], np.nan)

    df['flag_bad_utilization'] = np.where(df['RevolvingUtilizationOfUnsecuredLines']>1, 1, 0)
    df['flag_bad_dlq'] = np.where(df['NumberOfTime30+DaysPastDueNotWorse']>1, 1, 0)
    df['flag_bad_noopencreds'] = np.where(df['NumberOfOpenCreditLinesAndLoans']== 0, 1, 0)
    df['feat_killer'] = _get_killer_feature(df)
    return df


def get_train_test(df: pd.DataFrame, target, test_size:float, random_state:int)->pd.DataFrame:
    """
    divide development sample on train/test
    """
    X_train, X_test, y_train, y_test = train_test_split(
    df[[x for x in df.columns if x!=target ]], df[target], test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test


def get_binning(df_X_train:pd.DataFrame,
            series_y_train:pd.Series,
            variable_names:list,
            categorical_variables:list,
            manual_cols_to_bin_splits:dict,
            ):
    """
    binning with manual splits corrections
    """

    binning_fit_params={}
    if len(manual_cols_to_bin_splits)>0:
        for feat in manual_cols_to_bin_splits.keys():
            binning_fit_params[feat] = {
                'user_splits': np.array(manual_cols_to_bin_splits[feat][0], dtype=object),
                'user_splits_fixed': manual_cols_to_bin_splits[feat][1]
            }
    else:
        pass

    bp = BinningProcess(variable_names=variable_names,
                        categorical_variables=categorical_variables,
                        binning_fit_params=binning_fit_params)

    bp.fit(df_X_train[variable_names], series_y_train)

    return bp

def get_woe_binned_features(df:pd.DataFrame, bp, metric:str, metric_missing:str ):
    """
    transformation to binned
    """
    df_woe = bp.transform(df[bp.variable_names], metric=metric, metric_missing=metric_missing )
    return df_woe

In ~/pipelines/01_preprocessing/pipeline.py:

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import *


def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func = get_index,
                inputs = dict(df = "01_raw_development",  
col_to_index ="params:cols.col_raw_id", 
index_name = "params:cols.col_final_id"),
                outputs="02_intermediate_development",
                name="get_index_development",
            ),

            node(
                func = get_index,
                inputs = dict(df = "01_raw_validation",  
col_to_index ="params:cols.col_raw_id", 
index_name = "params:cols.col_final_id"),
                outputs="02_intermediate_validation",
                name="get_index_validation",
            ),

            node(
                func = get_extra_features,
                inputs = "02_intermediate_development",
                outputs="04_feature_development",
                name="get_extra_features_development",
            ),

            node(
                func = get_extra_features,
                inputs = "02_intermediate_validation",
                outputs="04_feature_validation",
                name="get_extra_features_validation",

            ),

            node(
                func = get_train_test,
                inputs = dict(df = "04_feature_development", 
target="params:cols.col_target",
test_size="params:global_parameters.test_ratio",
random_state="params:global_parameters.seed"),
                outputs=["05_model_input_X_train", 
 "05_model_input_X_test", 
 "05_model_input_y_train", 
 "05_model_input_y_test"],
                name="get_train_test",

            ),

            node(
                func = get_binning,
                inputs = dict(df_X_train = "05_model_input_X_train",
                              series_y_train="05_model_input_y_train",
                              variable_names="params:cols.cols_variables",
                              categorical_variables="params:cols.cols_categorial",
manual_cols_to_bin_splits="params:manual_binning.splits"),
                outputs="06_models_binning",
                name="get_binning",

            ),

            node(
                func = get_woe_binned_features,
                inputs = dict(df = "05_model_input_X_train",
                              bp="06_models_binning",
                              metric="params:binning_parameters.metric",
                          metric_missing="params:binning_parameters.metric_missing",
                             ),
                outputs="05_model_input_X_train_woe",
                name="get_woe_binned_features_X_train",
            ),

            node(
                func = get_woe_binned_features,
                inputs = dict(df = "05_model_input_X_test",
                              bp="06_models_binning",
                              metric="params:binning_parameters.metric",
                              metric_missing="params:binning_parameters.metric_missing",
                             ),
                outputs="05_model_input_X_test_woe",
                name="get_woe_binned_features_X_test",
             ),

            node(
                func = get_woe_binned_features,
                inputs = dict(df = "04_feature_development",
                              bp="06_models_binning",
                              metric="params:binning_parameters.metric",
                          metric_missing="params:binning_parameters.metric_missing",
                             ),
                outputs="05_model_input_development_woe",
                name="get_woe_binned_features_development",
             ),

            node(
                func = get_woe_binned_features,
                inputs = dict(df = "04_feature_validation",
                              bp="06_models_binning",
                              metric="params:binning_parameters.metric",
                          metric_missing="params:binning_parameters.metric_missing",
                             ),
                outputs="05_model_input_validation_woe",
                name="get_woe_binned_features_validation",
            ),           
            ]
    )

Pipeline 2: Modeling

At the second stage, we will select binned variables using the forward feature selection method from mlxtend library (https://rasbt.github.io/mlxtend/), optimizing the ROC AUC metric.

After the selection, we will manually adjust the binning (features and group boundaries were indicated in the file: /docker_disk/risk-model/conf/base/parameters.yml ). Next, we will train the model itself on the final dataset.

In ~/pipelines/02_modeling/__init__.py:

from .pipeline import create_pipeline

In ~/pipelines/02_modeling/nodes.py:

import numpy as np
import pandas as pd

from optbinning import BinningProcess
from optbinning import Scorecard
from typing import Dict, Tuple
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from mlxtend.evaluate import PredefinedHoldoutSplit
from sklearn.linear_model import LogisticRegression


import warnings
warnings.filterwarnings('ignore')


def feature_selection_logreg_predefined_test( df_dev_woe:pd.DataFrame,
                        df_dev:pd.DataFrame,
                        target:str,
                        X_test:pd.DataFrame,
                        features_in: list,
                        seed:int,
                        selection_forward:bool,
                        selection_feature_metric:str,
                        n_jobs=-1
                        ):
    """
    feature selection with predefined test sample
    """
    test_index = PredefinedHoldoutSplit(X_test.index.to_list())
    y_dev = df_dev[target]
    np.bool = np.bool_ # old numpy
    lr = LogisticRegression(random_state=seed)
    sfs = SFS( estimator=lr,
                k_features=(1, df_dev_woe[features_in].shape[1]),
                forward=selection_forward,
                floating=False,
                scoring=selection_feature_metric,
                cv=test_index)
    sfs.fit(df_dev_woe[features_in], y_dev)
    return sfs


def get_binning_short_list(df_X_train:pd.DataFrame,
            series_y_train:pd.Series,
            sfs,
            categorical_variables:list,
            manual_cols_to_bin_splits:dict,
            ):
    """
    binning with manual splits corrections
    """

    variable_names = list(sfs.k_feature_names_)
    categorical_variables_in = [x for x in categorical_variables if x in variable_names]
    binning_fit_params={}
    if len(manual_cols_to_bin_splits)>0:
        for feat in manual_cols_to_bin_splits.keys():
            binning_fit_params[feat] = {
                'user_splits': np.array(manual_cols_to_bin_splits[feat][0], dtype=object),
                'user_splits_fixed': manual_cols_to_bin_splits[feat][1]
            }
    else:
        pass

    bp_short = BinningProcess(variable_names=variable_names,
                        categorical_variables=categorical_variables_in,
                        binning_fit_params=binning_fit_params)

    bp_short.fit(df_X_train[bp_short.variable_names], series_y_train)

    return bp_short   



def get_woe_binned_features_short_list(df:pd.DataFrame, bp_short, metric:str, metric_missing:str ):
    """
    transformation to binned
    """
    df_woe = bp_short.transform(df[bp_short.variable_names], metric=metric, metric_missing=metric_missing )
    return df_woe



def model_scorecard_logreg_sfs(X_train:pd.DataFrame,
                     y_train:pd.Series,
                     bp_short,
                     seed:int):
    """
    modeling
    """                    
    lr = LogisticRegression(random_state=seed)

    scorecard = Scorecard(binning_process=bp_short,
                      estimator=lr,
                      scaling_method=None,
                      )

    scorecard.fit(X_train[list(bp_short.variable_names)], y_train)

    return lr, scorecard

In ~/pipelines/02_modeling/pipeline.py:

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import *

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func = feature_selection_logreg_predefined_test,
                inputs = dict(  df_dev_woe = "05_model_input_development_woe",
                                df_dev = "04_feature_development",
                                target = "params:cols.col_target", 
                                X_test ="05_model_input_X_test_woe",
                                features_in = "params:cols.cols_variables" ,
                                seed="params:global_parameters.seed",
                                selection_forward="params:feature_selection_parameters.selection_forward",
                                selection_feature_metric ="params:feature_selection_parameters.selection_feature_metric"),
                outputs = "06_models_feature_selection",
                name = "feature_selection_logreg_predefined_test"
            ),

            node(
                func = get_binning_short_list,
                inputs = dict(df_X_train = "05_model_input_X_train",
                              series_y_train = "05_model_input_y_train",
                              sfs = "06_models_feature_selection",
                              categorical_variables = "params:cols.cols_categorial",
                              manual_cols_to_bin_splits = "params:manual_binning.splits"),
                outputs = "06_models_binning_short_list",
                name = "get_binning_short_list"
            ),

            node(
                func = get_woe_binned_features_short_list,
                inputs = dict(df = "05_model_input_X_train",
                              bp_short="06_models_binning_short_list",
                              metric="params:binning_parameters.metric",
                              metric_missing="params:binning_parameters.metric_missing",
                             ),
                outputs="05_model_input_X_train_woe_short_list",
                name="get_woe_binned_features_X_train_short_list",
            ),

            node(
                func = get_woe_binned_features_short_list,
                inputs = dict(df = "05_model_input_X_test",
                              bp_short="06_models_binning_short_list",
                              metric="params:binning_parameters.metric",
                              metric_missing="params:binning_parameters.metric_missing",
                             ),
                outputs="05_model_input_X_test_woe_short_list",
                name="get_woe_binned_features_X_test_short_list",
            ),

            node(
                func = get_woe_binned_features_short_list,
                inputs = dict(df = "04_feature_development",
                              bp_short="06_models_binning_short_list",
                              metric="params:binning_parameters.metric",
                              metric_missing="params:binning_parameters.metric_missing",
                             ),
                outputs="05_model_input_development_woe_short_list",
                name="get_woe_binned_features_development_short_list",
            ),

            node(
                func = get_woe_binned_features_short_list,
                inputs = dict(df = "04_feature_validation",
                              bp_short="06_models_binning_short_list",
                              metric="params:binning_parameters.metric",
                              metric_missing="params:binning_parameters.metric_missing",
                             ),
                outputs="05_model_input_validation_woe_short_list",
                name="get_woe_binned_features_validation_short_list",
            ),


            node(
                func = model_scorecard_logreg_sfs,
                inputs = dict(  X_train = "05_model_input_X_train",
                                y_train = "05_model_input_y_train",
                                bp_short = "06_models_binning_short_list",
                                seed = "params:global_parameters.seed"),
                outputs = ["06_models_lr", "06_models_scorecard"],
                name = "model_scorecard_logreg_sfs"
            ),             
        ]
    )

Pipeline 3: Reporting

At the last stage, we will score using the final model samples:

  • Train, Test

  • Development (Train + Test)

  • Validation Sample for Kaggle.com

We will also conduct all the necessary validation tests and generate tables and graphs for the model report.

In ~/pipelines/03_reporting/__init__.py:

from .pipeline import create_pipeline

In ~/pipelines/03_reporting/nodes.py:

import numpy as np
import pandas as pd
from typing import Dict, Tuple
import matplotlib.py:plot as plt
from optbinning import Scorecard
from optbinning.scorecard import ScorecardMonitoring
from optbinning.scorecard import plot_auc_roc, plot_ks
from optbinning import BinningProcess
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
import seaborn as sns
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.ticker as mtick
import plotly.graph_objects as go



def get_scored_sample(df:pd.DataFrame, X_train, X_test, scorecard, metric:str, metric_missing:str):
    """
    scoring function (binning included)
    """
    df['type_sample'] = np.NaN
    df.loc[df.index.intersection(X_train.index), 'type_sample'] = 'train'
    df.loc[df.index.intersection(X_test.index), 'type_sample'] = 'test'

    df[[x + '_woe' for x in scorecard.binning_process_.variable_names]] =  scorecard.binning_process_.transform(df,  metric = metric,
                                                                                                                    metric_missing = metric_missing
                                                                                                                    )
    df['prediction'] = scorecard.predict_proba(df)[:,1]

    return df

def get_result_to_kaggle(df:pd.DataFrame, scorecard):
    """
    result: upload to Kaggle
    """
    df['Probability'] = scorecard.predict_proba(df)[:,1]

    return df.reset_index()[['Id', 'Probability']]



def get_table_binning(bp):
    """get binning long_list
    """
    t=[]
    for feature in list(bp.variable_names):
        t.append(bp.get_binned_variable(feature).binning_table.build(show_digits=4))
    table=pd.concat(t , keys=list(bp.variable_names))

    return table


def get_table_features_quality(bp):
    """get features quality (IV, Gini)
    """
    table_info_from_binning = bp.summary()
    return table_info_from_binning


def get_table_features_quality_short_list(bp_short, df_dev_scored, target):
    """get features quality (IV, Gini)
    """
    table_info_from_binning = bp_short.summary()

    table_gini_train_test = pd.DataFrame([])
    for type_sampe in ['train', 'test']:
        table_info_gini = {}

        for feature_woe in [ x for x in df_dev_scored.columns if 'woe' in x]:
            table_info_gini[feature_woe] = (2*roc_auc_score(df_dev_scored[df_dev_scored.type_sample == type_sampe][target],
                                                            df_dev_scored[df_dev_scored.type_sample == type_sampe][feature_woe]) - 1) * (-1)

        table_gini_train_test = pd.concat([table_gini_train_test, pd.DataFrame.from_dict(table_info_gini, orient='index', columns=['gini_{}'.format(type_sampe)])], axis=1)
    table_gini_train_test = table_gini_train_test.reset_index().rename(columns = {'index':'name'})
    table_gini_train_test['name'] = table_gini_train_test['name'].str.replace('_woe','', regex=True)

    table = table_info_from_binning.merge(table_gini_train_test, how='left', on = 'name')
    return table

def get_correlation_table_short_list(X_train_woe):
    """get corr table
    """
    table_corr = X_train_woe.corr().round(2)
    fig = plt.figure(figsize=(15, 10))
    fig = sns.heatmap(table_corr, annot=True).get_figure()

    return table_corr, fig


def get_table_model_scorecard(model):
    """
    get model scorecard info
    """
    table = model.table(style='detailed')
    return table

def get_table_model_scorecard_feature_coefs(model):
    """
    get model scorecard info
    """
    weights = {}
    weights['const'] = np.round(model.estimator_.intercept_[0],4)
    weights['features'] = dict(zip(model.estimator_.feature_names_in_, np.round(model.estimator_.coef_[0],4)))
    return weights   


def get_plot_feature_selection(sfs):
    """
    plot quality depending on the features
    """
    plot = plot_sfs(sfs.get_metric_dict(),figsize=(10, 8))[0]
    return plot


class ScorecardMonitoringMod(ScorecardMonitoring):
    def psi_plot(self, savefig=None):
        """Plot Population Stability Index (PSI).

        Parameters
        ----------
        return plot
        """
        self._check_is_fitted()

        fig, ax1 = plt.subplots()

        n_bins = len(self._n_records_a)
        indices = np.arange(n_bins)
        width = np.min(np.diff(indices))/3

        p_records_a = self._n_records_a / self._n_records_a.sum() * 100.0
        p_records_e = self._n_records_e / self._n_records_e.sum() * 100.0

        p1 = ax1.bar(indices-width, p_records_a, width, color='tab:red',
                     label="Records Actual", alpha=0.75)
        p2 = ax1.bar(indices, p_records_e, width, color='tab:blue',
                     label="Records Expected", alpha=0.75)

        handles = [p1[0], p2[0]]
        labels = ['Actual', 'Expected']

        ax1.set_xlabel("Bin ID", fontsize=12)
        ax1.set_ylabel("Population distribution", fontsize=13)
        ax1.yaxis.set_major_formatter(mtick.PercentFormatter())

        ax2 = ax1.twinx()

        if self._target_dtype == "binary":
            metric_label = "Event rate"
        elif self._target_dtype == "continuous":
            metric_label = "Mean"

        ax2.plot(indices, self._metric_a, linestyle="solid", marker="o",
                 color='tab:red')
        ax2.plot(indices, self._metric_e,  linestyle="solid", marker="o",
                 color='tab:blue')

        ax2.set_ylabel(metric_label, fontsize=13)
        ax2.xaxis.set_major_locator(mtick.MultipleLocator(1))

        ax2.set_xlim(-width * 2, n_bins - width * 2)

        plt.legend(handles, labels, loc="upper center",
                   bbox_to_anchor=(0.5, -0.2), ncol=2, fontsize=12)

        plt.tight_layout()

        if savefig is None:
            #plt.show()
            pass
        else:
            if not isinstance(savefig, str):
                raise TypeError("savefig must be a string path; got {}."
                                .format(savefig))
            plt.savefig(savefig)
            plt.close()
        return fig

def get_statistics_monitoring(scorecard, psi_method:str, psi_n_bins:int, target,
                                X_actual, y_actual, X_expected, y_expected, inplace_y_actual):
    """
    file with scorecard metrics and statics (Gini, KS, PSI, etc.)
    inplace_y_actual: value
    """
    scorecard_monitoring = ScorecardMonitoringMod(scorecard=scorecard, psi_method=psi_method,
                                 psi_n_bins = psi_n_bins, verbose=False)

    if  (type(y_actual) == str) and (target != None):
        y_actual = X_actual[target].fillna(inplace_y_actual)
    else:
        pass
    if  (type(y_expected) == str) and (target != None):
        y_expected = X_expected[target]
    else:
        pass   
    scorecard_monitoring.fit(X_actual, y_actual, X_expected, y_expected)                                
    return scorecard_monitoring

def get_tables_psi_features_detailed(scorecard_monitoring):
    """
    get table detailed
    """
    return scorecard_monitoring.psi_variable_table(style="detailed")

def get_tables_psi_features_summary(scorecard_monitoring):
    """
    get table summary
    """
    return scorecard_monitoring.psi_variable_table()

def get_table_psi_scorecard(scorecard_monitoring):
    """
    get table psi scorecard
    """
    return scorecard_monitoring.psi_table()

def get_plot_psi_scorecard(scorecard_monitoring):
    """
    get plot psi_scorecard
    """
    return scorecard_monitoring.psi_plot()

def get_table_statistical_tests(scorecard_monitoring):
    """ Null hypothesis: actual == expected
        Chi-square test - binary target
    """
    return scorecard_monitoring.tests_table()

def _calc_plot_auc_roc_mod(y, y_pred, title=None, xlabel=None, ylabel=None,
                    fname=None, **kwargs):
        """Plot Area Under the Receiver Operating Characteristic Curve (AUC ROC).

        """

        fpr, tpr, _ = roc_curve(y, y_pred)
        auc_roc = roc_auc_score(y, y_pred)

        # Define the plot settings
        if title is None:
            title = "ROC curve"
        if xlabel is None:
            xlabel = "False Positive Rate"
        if ylabel is None:
            ylabel = "True Positive Rate"

        plt.plot(fpr, fpr, linestyle="--", color="k", label="Random Model")
        plt.plot(fpr, tpr,  label="Model (AUC: {:.5f})".format(auc_roc))
        plt.title(title, fontdict={"fontsize": 14})
        plt.xlabel(xlabel, fontdict={"fontsize": 12})
        plt.ylabel(ylabel, fontdict={"fontsize": 12})
        plt.legend(loc='lower right')


        return plt.gcf()

def get_plot_roc_auc_train_test(scorecard, X_train, y_train, X_test, y_test):
    """ plot train test roc_auc plots
    """
    train_roc_auc = round(roc_auc_score(y_train, pd.Series(scorecard.predict_proba(X_train)[:, 1])),2)
    test_roc_auc = round(roc_auc_score(y_test, pd.Series(scorecard.predict_proba(X_test)[:, 1])),2)
    g1 = _calc_plot_auc_roc_mod(y_train, pd.Series(scorecard.predict_proba(X_train)[:, 1]))
    g1 = _calc_plot_auc_roc_mod(y_test, scorecard.predict_proba(X_test)[:, 1], title='ROC Train {} Test {}'.format(train_roc_auc, test_roc_auc))
    return g1


def get_plot_ks_train_test(scorecard, X_train, y_train, X_test, y_test ):
    plt.figure(figsize=(6.4, 4.8))
    plt.subplot(121)
    plot_ks(y_train, scorecard.predict_proba(X_train)[:, 1], title='KS Train')
    plt.subplot(122)
    plot_ks(y_test, scorecard.predict_proba(X_test)[:, 1], title='KS Test')
    #plt.suptitle('KS')
    g1 = plt.gcf()
    return g1


def get_plot_coefs_scorecard_model (scorecard_monitoring):
    """
    get plot of coefs scorecard
    """
    #plt.figure(figsize=(3.2, 2.4))
    reg_coef = pd.DataFrame((zip(scorecard_monitoring.scorecard.estimator_.feature_names_in_, scorecard_monitoring.scorecard.estimator_.coef_[0])), columns=['name', 'reg_coef'])
    g1 = sns.barplot(round(reg_coef,2), x="reg_coef", y="name", errorbar=None)
    g1.bar_label(g1.containers[0], fontsize=8)


    return g1.get_figure()

def get_general_statistics_report(scorecard_monitoring):
    scorecard_monitoring.system_stability_report()

In ~/pipelines/03_reporting/pipeline.py:

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import *

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func = get_scored_sample,
                inputs = dict( df = "04_feature_development",
                               X_train = "05_model_input_X_train",
                               X_test = "05_model_input_X_test",
                               scorecard = "06_models_scorecard",
                               metric="params:binning_parameters.metric",
                               metric_missing="params:binning_parameters.metric_missing",
                ),
                outputs = "07_model_output_development_scored",
                name = "get_scored_sample_development"
            ),

            node(
                func = get_scored_sample,
                inputs = dict( df = "04_feature_validation",
                               X_train = "05_model_input_X_train",
                               X_test = "05_model_input_X_test",
                               scorecard = "06_models_scorecard",
                               metric="params:binning_parameters.metric",
                               metric_missing="params:binning_parameters.metric_missing",
                ),
                outputs = "07_model_output_validation_scored",
                name = "get_scored_sample_validation"
            ),        

            node(
                func = get_result_to_kaggle,
                inputs = dict( df = "04_feature_validation",
                               scorecard = "06_models_scorecard"

                ),
                outputs = "07_model_output_sample_to_kaggle",
                name = "get_result_to_kaggle"
            ), 

            node(
                func = get_table_binning,
                inputs = dict( bp = "06_models_binning"

                ),
                outputs = "08_reporting_variables_binning_table",
                name = "get_table_binning_long_list",
                tags="info_table"
            ), 

            node(
                func = get_table_binning,
                inputs = dict( bp = "06_models_binning_short_list"

                ),
                outputs = "08_reporting_variables_binning_table_short_list",
                name = "get_table_binning_short_list",
                tags="info_table"
            ), 

            node(
                func = get_table_features_quality,
                inputs = dict( bp = "06_models_binning"

                ),
                outputs = "08_reporting_variables_summary_table",
                name = "get_table_features_quality_long_list",
                tags="info_table"
            ), 

            node(
                func = get_table_features_quality_short_list,
                inputs = dict( bp_short = "06_models_binning_short_list",
                               df_dev_scored = "07_model_output_development_scored",
                               target = "params:cols.col_target"

                ),
                outputs = "08_reporting_variables_summary_table_short_list",
                name = "get_table_features_quality_short_list",
                tags="info_table"
            ), 

            node(
                func = get_correlation_table_short_list,
                inputs = dict( X_train_woe = "05_model_input_X_train_woe_short_list",

                ),
                outputs = ["08_reporting_model_feature_correlation", "08_reporting_model_feature_correlation_plot" ],
                name = "get_correlation_table_short_list",
                tags="info_plot",
            ), 

            node(
                func = get_table_model_scorecard,
                inputs = dict( model = "06_models_scorecard"

                ),
                outputs = "08_reporting_model_scorecard_table",
                name = "get_table_model_scorecard",
                tags="info_table"
            ), 
            node(
                func = get_table_model_scorecard_feature_coefs,
                inputs = dict( model = "06_models_scorecard"

                ),
                outputs = "08_reporting_model_scorecard_feature_coefs",
                name = "get_table_model_scorecard_feature_coefs",
                tags="info_table"
            ), 

            node(
                func = get_plot_feature_selection,
                inputs = dict( sfs = "06_models_feature_selection"

                ),
                outputs = "08_reporting_model_scorecard_feature_selection_plot",
                name = "get_plot_feature_selection",
                tags="info_plot"
            ), 

            node(
                func = get_statistics_monitoring,
                inputs = dict(  scorecard = "06_models_scorecard",
                                psi_method = "params:monitoring_parameters.psi_method",
                                psi_n_bins = "params:monitoring_parameters.psi_n_bins",
                                target = "params:cols.col_target",
                                X_actual = "05_model_input_X_test",
                                y_actual = "05_model_input_y_test",
                                X_expected = "05_model_input_X_train",
                                y_expected = "05_model_input_y_train",
                                inplace_y_actual = "params:monitoring_parameters.inplace_y_actual"

                ),
                outputs = "08_reporting_statistics_monitoring_train_test",
                name = "get_statistics_monitoring_train_test"
            ), 

            node(
                func = get_statistics_monitoring,
                inputs = dict(  scorecard = "06_models_scorecard",
                                psi_method = "params:monitoring_parameters.psi_method",
                                psi_n_bins = "params:monitoring_parameters.psi_n_bins",
                                target = "params:cols.col_target",
                                X_actual = "04_feature_validation",
                                y_actual = "params:monitoring_parameters.none_type",
                                X_expected = "04_feature_development",
                                y_expected = "params:monitoring_parameters.none_type",
                                inplace_y_actual = "params:monitoring_parameters.inplace_y_actual"

                ),
                outputs = "08_reporting_statistics_monitoring_dev_valid",
                name = "get_statistics_monitoring_dev_valid"
            ), 

            node(
                func = get_tables_psi_features_detailed,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_train_test"
                ),
                outputs = "08_reporting_model_scorecard_features_psi_detailed_train_test",
                name = "get_tables_psi_features_detailed_train_test",
                tags="info_table"
            ), 

            node(
                func = get_tables_psi_features_detailed,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_dev_valid"
                ),
                outputs = "08_reporting_model_scorecard_features_psi_detailed_dev_valid",
                name = "get_tables_psi_features_detailed_dev_valid",
                tags="info_table"
            ), 

            node(
                func = get_tables_psi_features_summary,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_train_test"
                ),
                outputs = "08_reporting_model_scorecard_features_psi_summary_train_test",
                name = "get_tables_psi_features_summary_train_test",
                tags="info_table"
            ), 

            node(
                func = get_tables_psi_features_summary,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_dev_valid"
                ),
                outputs = "08_reporting_model_scorecard_features_psi_summary_dev_valid",
                name = "get_tables_psi_features_summary_dev_valid",
                tags="info_table"
            ), 

            node(
                func = get_table_psi_scorecard,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_train_test"
                ),
                outputs = "08_reporting_model_scorecard_psi_summary_train_test",
                name = "get_table_psi_scorecard_train_test",
                tags="info_table"
            ), 

            node(
                func = get_table_psi_scorecard,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_dev_valid"
                ),
                outputs = "08_reporting_model_scorecard_psi_summary_dev_valid",
                name = "get_table_psi_scorecard_dev_valid",
                tags="info_table"
            ), 

            node(
                func = get_plot_psi_scorecard,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_train_test"
                ),
                outputs = "08_reporting_model_scorecard_psi_summary_train_test_plot",
                name = "get_plot_psi_scorecard_train_test",
                tags="info_plot"
            ), 

            node(
                func = get_plot_psi_scorecard,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_dev_valid"
                ),
                outputs = "08_reporting_model_scorecard_psi_summary_dev_valid_plot",
                name = "get_plot_psi_scorecard_dev_valid",
                tags="info_plot"
            ), 

            node(
                func = get_table_statistical_tests,
                inputs = dict( scorecard_monitoring = "08_reporting_statistics_monitoring_train_test"
                ),
                outputs = "08_reporting_model_scorecard_statistical_tests",
                name = "get_table_statistical_tests",
                tags="info_table"
            ), 

            node(
                func = get_plot_roc_auc_train_test,
                inputs = dict(  scorecard = "06_models_scorecard",
                                X_train = "05_model_input_X_train",
                                y_train = "05_model_input_y_train",
                                X_test = "05_model_input_X_test",
                                y_test = "05_model_input_y_test"
                ),
                outputs = "08_reporting_model_scorecard_roc_auc_train_test_plot",
                name = "get_plot_roc_auc_train_test",
                tags="info_plot"
            ),

            node(
                func = get_plot_ks_train_test,
                inputs = dict(  scorecard = "06_models_scorecard",
                                X_train = "05_model_input_X_train",
                                y_train = "05_model_input_y_train",
                                X_test = "05_model_input_X_test",
                                y_test = "05_model_input_y_test"
                ),
                outputs = "08_reporting_model_scorecard_ks_plot",
                name = "get_plot_ks_train_test",
                tags="info_plot"
            ),

            node(
                func = get_plot_coefs_scorecard_model,
                inputs = dict(  scorecard_monitoring = "08_reporting_statistics_monitoring_train_test",

                ),
                outputs = "08_reporting_model_scorecard_coefs_plot" ,
                name = "get_plot_coefs_scorecard_model",
                tags="info_plot"
            ),

            node(
                func = get_general_statistics_report,
                inputs = dict(  scorecard_monitoring = "08_reporting_statistics_monitoring_train_test",

                ),
                outputs = None ,
                name = "get_general_statistics_report"
            ),

        ]
    )

Run the Pipelines

To run all the pipelines in /docker_disk/risk-model, open Terminal and run the following command:

kedro run

As an option it is also possible to run partially only one of them:

kedro run --pipeline 01_preprocessing

Or even select the node to run from:

kedro run --pipeline 01_preprocessing --from-nodes get_index_development

After we run all the pipelines, we can check the connections between nodes, pipelines and outputs in one place - in Kedro Viz without checking output folders and files.

Visualization: Kedro Viz

To visualize the pipelines in a Kedro project by showing data, nodes, and the connections between them:

In /docker_disk/risk-model open terminal and run command:

kedro viz run --host=0.0.0.0 --port=4141 --autoreload

Then open the browser and go to http://0.0.0.0:4141/.

Here you can check the total model visualization:

From that window it is possible to access every part of code that was used in the project, check inputs and outputs, parameters, and results.

For better result visualisation it is recommended to use tags in node() in pipeline.py:.

Here we have 2 tags: “info_plot” and “info_table”, so we can get fast access to outputs, filtering by them.

Result

File sampleEntry.csv from ~/data/07_model_output we will upload to Kaggle and submit our task.

In the directory: ~/data/08_reporting we collect all the files that could be used in model documentation:

This template represents the basic skeleton for any risk modeling exercise: data pre-processing, model training, and reporting stages. Remember to modify this template to suit your needs by perhaps incorporating business logic for more advanced feature engineering or different types of models.

Conclusion

Python risk analytics provides some really powerful tools that any financial organization could use in order to understand, manage, and mitigate risk. Using the templates provided along with best practices, we will show how this will enable model development efficiency and consistency, and will rapidly adapt to changing market conditions.

As long as the financial world keeps on changing, so will the techniques and technologies we use in risk analytics. Knowledge of current developments in data science, machine learning, and the Python ecosystem will be fundamental to the risk professional who intends to bring additional value to his organization.

Remember, while templates and tools are essential, the vital ingredients to successful risk analytics include a focus on these technical components in concert with domain expertise, critical thinking, and deep business context. As we continue to push the boundaries of what is possible in risk modeling, let us not forget what we are trying to achieve: make informed decisions that protect and grow our financial institutions.

0
Subscribe to my newsletter

Read articles from Alexey Khoroshilov directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alexey Khoroshilov
Alexey Khoroshilov