Hey everyone, I’m an incoming senior at the University of Oxford. I’ve been involved in the ML field for over an year now, following up with the latest developments and participating in various competitions. While I had gained a lot of knowledge, I figured my progress has been on a standstill recently, it was at that time when I decided to take on a focussed approach towards breaking into the field.

In this series of blogs, I will be documenting my journey as I hope to crack the barrier from being a begineer-intermediate to becoming an expert in the field. This will be like a self help guide where I’ll try to teach you the concepts I have learned through practical examples and in a way reflect on my learning.

I hope to complete this in 6 weeks starting from the very fundamental core machine learning skills, followed by deep learning, data handling pipelines, cloud deployment and the likes. Occasionally, I’ll also be sharing my understanding of some of the latest reviews in the literature and the latest news happening in the ML world.

Week 1: Core Machine Learning Skills - Supervised Learning

Any machine learning beginner would have come across the terms supervised learning and unsupervised learning when starting out in understanding the field. The distinction is fairly straightforward, in supervised learning you provide labelled data i.e. every training example has an output which you will be expecting your model to train on. On the other hand, in unsupervised learning, you don’t have idea of the output, you feed the input data into the algorithm with the hope that it will be able to find an underlying pattern within it.

I’ll be beginning with taking a supervised learning problem from Kaggle, in particular the Mercedes Benz Greener Manufacturing competition (Competition Link). Many of the methods I’ll be applying are based on the book - Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition.

Problem Description: We need to predict the time it takes for a car to pass the testing given a set of anonymous variables.

A screenshot of the data:

A brief look at the problem statement and the data suggests that this is a regression problem, and we are provided with both categorical and numerical values, additionally our performance measure is the coefficient of determination \(R^2\) .

Next, we take an overview of our numerical features by plotting a histogram.

The above figures are a small excerpt from the 377 numerical features. All the features seem to be binary features, possibly indicating the presence or absence of a car feature in a Mercedes Benz model.

The presence of binary features is useful for us as they would require no feature scaling. In cases where this is not the case, we need to standardize the data so that our model weighs each feature equally and not be biased because of the difference scales.

Creating a Test Set

In our case, the test set has already been provided by the Kaggle host. For cases when this isn’t true, we can create our test set from the original data by three possible routes:

Setting aside 20% (for a 20:80 split) of the data after randomly shuffling it. This may be done in the following way:

import numpy as np
def create_test_and_train(df, ratio):
    np.random.seed(42) 
    shuffled_indices = np.random.permutation(len(df))
    test_set_size = int(len(data)*ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return df.iloc[test_indices], df.iloc[train_indices]

However, there is a problem with this approach, this solution will break down when the data set gets updated i.e. some of the examples in the previous test set will leak in the new training set leading to hyped up model prediction.

In order to prevent this problem we go ahead with the next two solutions:

Using hashing - Hashing refers to the process of providing unique identifiers to each of our data points using hash functions. The advantage of using hashing is that the hash value of our data point doesn’t change until there is an internal change in the data.

from zlib import crc32

def is_id_in_test_set(identifier, ratio):
    return crc32(np.int64(identifier)) < ratio * 2**32

def split_data_with_id_hash(df, ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
    return data.loc[~in_test_data], data.loc[in_test_data]

This approach works by providing a hash value to a unique identifier column (in this case, its the id column) and then segregating the first 20% hashes. Note that this approach works because the hash function (technically a checksum algorithm - crc32 in our case) generates almost unform values between 0 & 2³²-1.

Using in-built functions:

 from sklearn.model_selection import train_test_split

 train_set, test_set = train_test_split(df, test_size = 0.2, random_state = 42)

Stratified Sampling

Although not so useful in our case as we are not aware of the feature importance (and that we are provided with our test set), we may need to create our test set such that it is a representative of the original dataset. In order to do that, we may choose one of the features (say X100 in our case).

Our feature has the following distribution:

In order to create our test data, we would ideally want a similar distribution of the sample w.r.t X100.

This can be done using built-in function in the following way:

from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits= 10, test_size=0.2, random_state=42)
strat_splits = []
for train_index, test_index in splitter.split(train_data, train_data['X100']):
    strat_train_set_n = train_data.iloc[train_index]
    strat_test_set_n = train_data.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])

strat_train_split, strat_test_split = strat_splits[0]
strat_test_split['X100'].value_counts()/ len(strat_test_split)

This will yield a test and training set with similar distribution as the original data.

Note: A shorter way to do the above is:

strat_train_set, strat_test_set = train_test_split(train_data, test_size = 0.2, stratify = ['X100'], random_state = 42)

Exploring & Visualizing the Data

Since the feature listed here are anonymous, there can’t much feature engineering that can be done. However, we can still do some basic analysis as follows:

We start by getting the Pearson correlation between the various features, this will help us identify which of the features are useful in predicting our target variable.

corr_matrix = train_data.select_dtypes(include = ['number']).corr()
corr_matrix['y'].sort_values(ascending=False)

We notice that some of the columns have NaN correlation, this happens when columns have zero variance. In that case, we can drop these columns as they have no meaningful significance.

We could also have plotted the scatter_matrix using the following code:

from pandas.plotting import scatter_matrix

attributes = ['y','X314','X261','X263','X136']
scatter_matrix(train_data[attributes], figsize=(12,8))
plt.show()

Preparing the Data for Machine Leaning Algorithms

Now that we have done some basic data analysis, it is time to transform our data so that it can be directly fed into any ML algorithm of our choice. Although the specific steps are dependent on the data and the choice of algorithm, there are certain steps which are generally always employed.

A note on Scikit-learn design

Before jumping onto transformers, here’s a brief overview of the type of objects/functions available on Scifit-learn:

Estimators
Transformers
Predictors

Estimators: As the name suggests any object which is used to estimate parameters is an estimator eg. SimpleImputer. The estimation is performed by the fit() method which stores the learned values internally.

Transformers: Estimators like SimpleImputer can also transform a dataset; these are known as transformers. This is performed by the transform() API call, with the dataset as a parameter, and the transformation relies on the learned parameters. All transformers have a method called fit_transform() which can perform both the operations in a single call.

Predictors: Some estimators are capable of making predictions, eg. LinearRegression model is a predictor with a predict() method, additionally, it has a score() method that measures the quality of predictions.

Further notes:

1)All the estimator’s hyperparameters are accessible via public instance variables (eg. imputer.strategy), additionally, the learned parameters of an estimator are accessible via public instance variables with an underscore suffix (eg. imputer.statistics_).

2)Scikit-Learn transformers output NumPy arrays (or SciPy sparse matrices) by default.

Since, our data doesn’t contain any null values, we don’t need the SimpleImputer built function, but real life data may contain instances where the data won’t be perfect, in that case, the function comes in handy.

Handling Text and Categorical Attributes

Most machine learning algorithms expect numerical data, hence it is necessary to handle the text and categorical attributes and transform them into numerical features. This can be done using inbuilt functions like OrdinalEncoder or OneHotEncoder.

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
train_data_1hot = cat_encoder.fit_transform(train_data)

The output of the above code is a sparse matrix.

train_data_1hot.toarray() #code to convert sparse matrix into a Numpy array
cat_encoder.categories_ # for getting the categories for each categorical feature

Note: We could also have used get_dummies() function in order to get one-hot representation. However, unlike OneHotEncoder, get_dummies doesn’t remember the categories it was trained on, so any new category in the actual production deployment, will lead to a new column with no exceptions raised.

Any estimator stores the column names in the feature_names_in_ attribute, Scikit-Learn ensures that any dataframe fed to the estimator has the same columns.

Transformers also provide a get#_feature_names_out() methods to build a Dataframe around the transformer’s output.

Feature Scaling and Transformation

As mentioned earlier, features should be appropriately scaled before being fed to the machine learning algortihm. Two most common ones is the min-max scaling and standardization.

Important Note: It is important to fit() any scaling function only to the training data, then transform the validation and training data based on the learned parameters. Do not fit() or fit_transform() using test/validation.

Although not needed for our data set, I’ll still be scaling using MinMaxScaler for my dataset just for book keeping purpose (it’s not going to change our data as they are already scaled between 0 and 1).

from sklearn.preprocessing import MinMaxScaler

scalar = MinMaxScaler()
train_data_num_scaled = scalar.fit_transform(train_data_num)

Note: If the distribution is heavily skewed with a heavy tail, we should aim for the distribution to be roughly symmetrical. This can be done using logarithmic transformation or bucketizing (not the case with our data)

Now that the features are transformed, we also need to transform our target variable in order to make the distribution symmetrical. This can be done using FunctionTransformer in Scikit-Learn (just like a normal python function)

from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(func = np.log, 
                                     inverse_func = np.exp)
log_y = log_transformer.transform(train_data['y'])

The log_transformer can then be coupled with the TranformedTargetRegressor, which takes in the predictor, the label on which the transformation needs to be carried out. It trains the model and then predicts the final output by calling the inverse_transform() function internally. Eg. (in this simple case, we are just trying to predict the target variable by feeding the model with the same variable just to show how we can use TransformedTargetRegressor).

#Example code
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor

model = TransformedTargetRegressor(LinearRegression(), 
                                   transformer = log_transformer)

model.fit(train_data[['y']],car_data_labels)
predictions = model.predict(train_data[['y']])

Custom Transformers

For making custom trainable transformers we need to write a custom class. Scikit-Learn relies on duck typing i.e. if the class has the right methods and behaves correctly, then its a valid input, so what matters is having the following three methods in the definition - fit(), transform(), and fit_transform().

Here’s how to create a custom transformer which acts like the StandardScaler():

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted

class StandardScalarClone(BaseEstimator, TransformerMixin):
    def __init__(self, with_mean = True):
        self.with_mean = with_mean

    def fit(self, X, y= None): #y is required even though we don't use it
        X = check_array(X) #checks that X is an array with finite float values
        self.mean_ = X.mean(axid=0)
        self.scale_ = X.std(axis=0)
        self.n_features_in_ = X.shape[1] # every estimator stores this in fit

        return self # Always return self in fit()

    def transform(self, X):
        check_is_fitted(self) #looks for learned attrubutes
        X = check_array(X)
        assert self.n_features_in_ == X.shape[1]
        if self.with_mean:
            X = X - self.mean_
        return X/self.scale_

Important points:

sklear.utils.validation package contains several functions which can be used to validate inputs.
All Scikit-Learn estimators set n_features_in_ in the fit() method, and they ensure that the data passed to transform() or predict() has this number of features.
In order to check whether the custom estimator respects Scikit-Learn’s API by passing an instance to check_estimator() from sklearn.utils.estimator_checks package.

Transformation Pipelines

In order to complete all the transformations above, we can create a pipeline using the Pipeline class.

Some examples:

from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("minmax", MinMaxScaler()).
])

The Pipeline constructor takes a list of name/estimator pairs defining a sequence in which the transformations will be carried out. The estimators must all the transformers except the last one which can be a transformer, or a predictor or any other type of estimator.

Tip: If you import sklearn and run sklearn.set_config(display=”diagram”), all estimators will be rendered as interactive diagrams. This is particularly useful for visualizing pipelines. Run num_pipeline in order to visualize a num_pipeline.

You can also use make_pipeline() if you don’t want to name the transformers. Eg.

from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), MinMaxScaler())

# Just use fit_transform() method of the Pipeline class to carry out the transformations.

So far, we have handled the categorical and numerical columns separately. In order to carry both transformations using a single transformer, we can use ColumnTransformer.

For example, the following ColumnTransformer will apply num_pipeline to the numerical attributes and cat_pipeline to the categorical attributes:

from sklearn.compose import ColumnTransformer

num_attributes = list(map(lambda x: 'X' + str(x), range(9, 386))
cat_attributes = list(map(lambda x: 'X' + str(x), range(0, 10))

cat_pipeline = make_pipelines(
        SimpleImputer(strategy = "most_frequent"),
        OneHotEncoder(handle_unknown = "ignore"))

preprocessing = ColumnTransformer([
        ("num", num_pipeline, num_attributes),
        ("cat", cat_pipeline, cat_attributes),
])

The ColumnTransformer requires a list of 3-tuples, each containing a name, a transformer and a list of names (or indices) of the columns on which the transformations need to be carried out.

Now, inorder to further shorten the code i.e choosing the columns without specifying them separately or if there is no need to name the pipelines, we can use make_column_selector() and make_column_transformer() function.

from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
        (num_pipeline, make_column_selector(dtype_include= np.number)),
        (cat_pipeline, make_column_selector(dtype_include= object)),
)

car_data_prepared = preprocessing.fit_transform(car_data)

Select and Train a Model

We can do a very basic linear regression for a start. The following code block will be useful:

from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(car_data, car_data_labels)

The most important point to be noted is the cost function used to evaluate the result and the model choice based on the underlying assumptions (more on these in the future blogs)

Cross-validation

Scikit-Learn provides the k-fold cross-validation feature, which splits the training set into k different non-overlapping subsets called folds, then it trains and evaluates the model k times, each time picking a different fold for evaluation.

Example code:

from sklearn.model_selection import cros_val_score

model_rmse = -cross_val_score(model, car_data, car_data_labels, scoring="neg_root_mean_squared_error", cv = 10)

Tip: To decide on the model you wanna go ahead with, see the cost of the training, as well as the cross validation score, if there seems to be a significant increase in the score, then select 2-5 models with similar scores before doing any hyperparameter tuning)

Fine-Tuning our Model

The first method is GridSearch which is basically automating the process of finding all the combinations of paramters which you find promising.

Eg.

from sklearn.model_selection import GridSearchCV

full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42)),
])
param_grid = [{//dictionary containing list of parameters//}]

grid_search = GridSearchCV(full_pipeline, param_grid, cv=3, scoring ="neg_root_mean_squared_error")

grid_search.fit(car_data, car_data_labels)

# To get the best parameters:
grid_search.best_params_

# For eavluation scores:
grid_search.cv_results_

Tip : Wrapping preprocessing steps in Scikit-Learn pipeline allows you to tune the preprocessing hyperparameters along with the model hyperparameters.

The above method works well when we have few parametric combinations to search. In cases where the hyperparameter search space is large, we use RandomizedSearchCV - instead of trying out all possible combinations, it evaluates a fixed number of combinations, selecting a random value for each hyperparameter at every iteration.

For each hyperparameter, we must provide either a list of possible values or a probability distribution for continuous values.

Scikit-Learn also has HalvingRandomSearchCV and HalvingGridSearchCV hyperparameter search classes. Their whole point is to use computational resources more efficiently, either to train faster or to explore a larger hyperparameter space.

Analyzing the Best Models and Their Errors

Briefly: After finding the best model and doing the fine tuning, it is a good idea to get the relative feature importance and dropping the least useful features, cleaning up the outliers or trying new attribute combinations based in the insight gained.

Additionally, before evaluating on the test set, it is important to ensure that the model works for each category of the data, depending on the type of data you are analysing. In order to evaluate that, we can create subsets of the validation set for the major categories and then analysing the results.

Once, all this is done, our model is ready to be evaluated on the test set, and subsequently ready for launching into production…

…. To be continued

From Theory to Practice: My Machine Learning Development Journey