This article guides absolute beginners on how to submit their first machine learning project on Kaggle. Generally, most people choose the Titanic project as their first one, which gives them a decent understanding of the basics of machine learning.

This article is divided into 3 parts: EDA, DATA CLEANING, and MODEL CHOOSING.

This article provides you with a decent idea of how things work when building simple projects like this.

Gearing up!!

As usual, we begin with importing some of the basic Python libraries for our project. Libraries act as helpers when working with a lot of data. We will be dealing with the whole dataset through libraries.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px

After importing these basic libraries, it is time to make a reference variable of our dataset. You have to download the data from the Kaggle website if you are working on a Jupyter or Colab notebook. You will get two files, training and test CSVs, which we will be using later on in this project.

Here in this variable data, we will be storing train.csv and test.csv to store our test data in it and we will be working with it.

data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

Now it is time to explore your data, We will explore different relationships between different variables; this helps us in understanding the data better. This process of understanding data by plotting different graphs between various variables is called exploratory data analysis(EDA).

EDA

Before diving into it, you should try these two commands by yourself → data.describe() and data.info(). The first one will give you the statistical information about the data, and the second one will give you some numerical information about the data.

Now let us begin with understanding different relations of variable with each other, let us begin with seeing the relation of “Age” with “Fare”. When you do this (see below), it will give you the list of all the columns present in the dataset.

data.columns.tolist()

Now, out of those columns, we are trying to figure out the relationship between two variables: “Age” and “Fare”. I am using the plotly library for this. So when you do this (below), you get a beautiful scatter plot showing how fare varies with age.

fig = px.scatter(data, x="Age", y="Fare")
fig.show()

We can see in this graph that the highest fare was given by someone who was of age around 35 or something.

In this way, we can plot various graphs between different variables to see how one impacts another. Some of the examples are shown below.

fig= px.scatter(data, x="Age", color="Survived", color_discrete_sequence=['red','green'])
fig.show()

Now, here is the cool one: let us find out how the gender of a person impacts his chances of survival.

fig = px.histogram(data, x="Sex", color="Survived", color_discrete_sequence=['red','green'])
fig.show()

Here, you can see that if you were a man who was riding on the Titanic, then you had more chances of dying in comparison to females; green means survived and red means dead here.

You can compare the different ratios of survival for each category mathematically.

DATA CLEANING

At this point, we know a lot of interrelations between different features and how one impacts the other. When we are working with data, there is always a chance that there are values that you do not desire, and our task is to remove any of these types of samples, rows, or columns that we don’t need.

Before that, we have to figure out the different types of columns present in our data.

num_cols = data.select_dtypes(np.number).columns.tolist()
cat_cols = data.select_dtypes("object").columns.tolist()

select_dtypes allows us to select the type of data, which is mentioned inside the bracket. In this case, it is np.number and object. The first allows us to select numbers; the second one allows us to select categorical variables, or in short, those which are not numbers and contain data other than numbers.

.columns allows us to select the columns of the selected data; when you return this, you get an object of type Index. tolist () allows you to convert that index object into a list. This makes it easier to work with.

Now, there might be rows present in your dataset that are empty. These empty values will distort your predictions; to prevent this, we use imputing. Here, I am using SimpleImputer of scikit learn library.

from sklearn.impute import SimpleImputer

impute = SimpleImputer(strategy='mean').fit(train_df[['Age']])

Since we have figured out that there are empty values in the age column, we are fitting the same column into imputer, We are using strategy as mean, which means that it is going to put the mean of the age column in empty places in the age column.

train_df['Age'] = impute.transform(train_df[['Age']])
val_df['Age'] = impute.transform(val_df[['Age']])

test_data['Age'] =  impute.transform(test_data[['Age']])

Here, we are filling the empty values of the age column of the train, validation, and test data and then updating the Age column of the respective data.

Predictions of the model can be wrong if the value range of different features differs very much. To prevent this, we use Scaling.

I am using MinMaxScaler of scikit learn. Since we can only scale the numerical range of columns, we are dealing with numerical columns only.

from sklearn.preprocessing import MinMaxScaler

scaleme = MinMaxScaler().fit(train_df[num_cols])

train_df[num_cols] =  scaleme.transform(train_df[num_cols])
val_df[num_cols] =  scaleme.transform(val_df[num_cols])
test_data[num_cols] = scaleme.transform(test_data[num_cols])

After fitting the scaler on the numerical data, we are putting numerical columns of different parts in which we have divided data; in our case, it is train, test, and validation dataset.

After transforming the data, we are updating the numerical columns of each part of our data here: train, test, and validation data.

We have dealt with numerical columns, but there is a possibility that there are empty values in categorical columns. Now, it is time to deal with that.

train_df[cat_cols]  = train_df[cat_cols].fillna("UNKNOWN")
val_df[cat_cols]  = val_df[cat_cols].fillna("UNKNOWN")

test_data[cat_cols]  = test_data[cat_cols].fillna("UNKNOWN")

Here, we are selecting the categorical columns of our respective dataframes and then filling the empty values with the “UNKNOWN” string.

But a machine learning model takes only numerical data, and our categorical columns don’t have numerical data, so to work with them, we have to convert our categorical columns into numerical columns.

This process of converting categorical data into numerical data is called ENCODING.

I love mapped encoding, but we are using OneHotEncoder here.

from sklearn.preprocessing import OneHotEncoder

encode = OneHotEncoder( handle_unknown = 'ignore').fit(train_df[cat_cols])

We are fitting onehotencoder with the categorical columns because that’s what we want since we would like to convert categorical data into numerical data.

We are using handle_unknown=” ignore” to account for the possibility of encountering some new unknown value; if this happened, our encoder just ignores that.

Our encoder encodes different values into its categories, forming a new set of features; 1 marks the presence, and 0 marks the absence of that value for a particular row. For instance, if row 121 has a gender of male, then for row 121 under the column male, our model will put a 1.

Since we have fitted our model with the necessary data, it has created necessary columns, which can be seen from the following command.

encoded_cols = list(encode.get_feature_names_out(cat_cols))
encoded_cols

Now is the time to add them to our dataframes.

train_df[encoded_cols] = encode.transform(train_df[cat_cols]).toarray()
val_df[encoded_cols] = encode.transform(val_df[cat_cols]).toarray()

test_data[encoded_cols] = encode.transform(test_data[cat_cols]).toarray()

Now our different dataframes have 3 types of columns → categorical variables, numerical variables and encoded columns.

We are using toarray() here to convert a sparse matrix (output of encoder transform function is a sparse matrix unlike MinMaxScaler) into a simple numpy array that can be added into the dataframe.

Now that we have dealt with all data cleaning, it’s time to use a model to make some predictions.

MODEL CHOOSING

There can be different possible models to solve a particular problem. However, I used the logistic regression model. The reason for that is that the problem that we are dealing with is a classification problem, which means that we need to classify a particular thing into a particular category. In our case, it is whether someone dies or not.

Here, we are making new dataframes that contain only numerical data and encoded data, as we don’t need categorical data anymore because our model cannot work with that.

x_train = train_df[num_cols+encoded_cols].copy()
train_targets = train_df['Survived'].copy()

val_train = val_df[num_cols+encoded_cols].copy()
val_targets = val_df['Survived'].copy()

test_inputs = test_data[num_cols+encoded_cols].copy()

Now we have made x_train, val_train, and test_inputs dataframes containing only the necessary columns.

We have also created a copy of the targets of train_df and val_df and stored them into train_targets and val_targets, respectively.

from sklearn.linear_model import LogisticRegression

regression = LogisticRegression.fit(x_train, train_targets)

Now that we have trained our model on the training data, it is time to see our predictions and how accurate they are.

predictions = regression.predict(val_train)

We have stored our predictions inside predictions columns, now it is the time to measure the accuracy of our predictions, which we do by using accuracy_score from the scikit learn library.

score = accuracy_score(predictions,val_targets)
print(score)

Now, this prints the accuracy score or, in short, tells you how better your model performed on the validation data; the better the model performs, the higher the accuracy score.

Now is the time to store our predictions on test data into a CSV file so that we can submit it.

test_predictions = regression.predict(test_inputs)

#creating a dataframe
Submission = pd.DataFrame(
    {
        "PassengerId": test_data['PassengerId'],
        "Survived": test_predictions

    }
)

#saving that submission dataframe as a CSV file
submission.to_csv("submission.csv", index=False)

After getting the predictions from the model, we store them in a dataframe with another column named “PassengerId” and the predictions are stored under the column name “Survived” which are 0s and 1s. 1 for alive and 0 for dead.

After making a dataframe, we converted that dataframe into a csv file named submission.csv. We are setting index=False because Pandas, by default, adds an index column, which is a column of serial numbers; we don’t want that in our submission file.

Then we submit this file under the Kaggle competition!!

Titanic Project!!

Table of contents

Gearing up!!

EDA

DATA CLEANING

MODEL CHOOSING

Subscribe to my newsletter

Kartavay

Kartavay