Kickstart Your Machine Learning Journey: A Summary of Kaggle's Intro Course

I recently completed the "Intro to Machine Learning" course on Kaggle, which you can find here. I believe it's an excellent starting point for those interested in machine learning, particularly with Python libraries like pandas or scikit-learn. In this and next articles, I aim to summarize the most important knowledge gained during a course.

In general, our goal in machine learning is to create a model that can predict specific values based on data.

The first important step in data analysis is to understand the data you are working with.

import pandas as pd

titanic_file_path = './data/Titanic.csv'
# read the data and store data in DataFrame titled melbourne_data
titanic_data: DataFrame = pd.read_csv(titanic_file_path)
# print a summary of the data in titanic data
print(titanic_data.describe())
print(titanic_data.head(20))

In this example I am using Titanic passengers data downloaded from https://www.kaggle.com/datasets

Here is script output of describe method

It's important to understand the meaning of each row.

  • count - shows number of rows in each column without missing values. In count column we can check quickly how many row misses data

  • mean - average value e.g in PassengerId it is 500 as max value is 1000

  • std - this is standard deviation which measures how numerically spread out the values are

  • min - the smallest value in column

  • 25% - 25th percentile - number that is bigger than 25 % of the values and smaller than 75 % of values

  • 50 % - 50th percentile - number that is bigger than 50 % of the values and smaller than 50% of values

  • 75% - 75th percentile - number that is bigger than 75% of the values and smaller than 25% of values

  • max - max value in column

IIt's also beneficial to use the head method, which allows us to specify the number of initial rows we want to view in the output. This is useful for directly inspecting the data within our data frame, helping us identify null values or data that may require format changes, such as dates.

0
Subscribe to my newsletter

Read articles from Jakub Sokolowski directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jakub Sokolowski
Jakub Sokolowski