Wine Quality Prediction using Random Forest Classifier.
Embark on a captivating tutorial as we uncork the world of wine quality prediction using the potent Random Forest Classifier. In this exploration, we'll navigate the intricate landscape of predictive analytics, demonstrating how to leverage this powerful machine learning tool to assess and classify the quality of wines. From understanding the critical features influencing wine quality to fine-tuning the Random Forest model, this tutorial will guide you through the steps of implementing data-driven insights into the art of winemaking. Whether you're a seasoned oenophile or a data enthusiast, join us in this tutorial to master the fusion of technology and viticulture, ultimately gaining a deeper understanding of how the Random Forest algorithm reshapes the landscape of wine production. Cheers to blending tradition with innovation!
What is Random Forest Classification ?
Random Forest Classification is a machine learning algorithm that belongs to the ensemble learning family. It operates by constructing a multitude of decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees. Each tree in the "forest" is built using a random subset of the training data and a random subset of features, which helps reduce overfitting and enhances the model's robustness. The final prediction is then determined by aggregating the predictions of all the trees, making Random Forest Classification a versatile and powerful method for various classification tasks, including predicting the quality of wines based on specific features in the context of viticulture.
Let's Start Coding !
Importing Modules
So we will use Anaconda to launch Jupyter(Not Jupiter) Notebook which is an interactive Python notebook where we can write our code in blocks and execute them independently. You can download Anaconda from their website !
Install Anaconda and open the jupyter notebook. In the first cell type the following import statements which we will require in the forecoming project journey.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Note: If you get any error such as "Module not found" the use "pip" to install the required packages. If you are on MacOS and you have conda environment running you need to use "conda" instead of "pip". To get the correct "pip/conda" statement search in Google.
Downloading and Importing the data
Now we need to download the data in which we will analyse, train and predict. The best way to find datasets is the Kaggle website. It has variety of datasets that you can download for free just by signing in. I will provide the link of the dataset required for this project.
To import the data and store it in the form of a DataFrame we will use the "Pandas" library.
Pandas is a Python library designed for data manipulation and analysis. It provides easy-to-use data structures, such as DataFrame and Series, which allow for efficient handling and manipulation of structured data. With powerful tools for cleaning, filtering, and transforming data, Pandas is widely used for tasks like data cleaning, exploration, and preparation in data science and analysis projects.
dataset = pd.read_csv("winequality-red.csv")
This way we store the csv(Comma Separated Values) file named "winequality-red" into a Dataframe named "dataset".
A DataFrame is a two-dimensional, tabular data structure in the Pandas library for Python. It resembles a spreadsheet or SQL table, where data is organized in rows and columns. It allows for efficient manipulation, analysis, and cleaning of structured data, offering a powerful tool for data scientists and analysts to work with heterogeneous and labeled data in a familiar tabular format.
Analysing the Data and Plotting some graphs to understand the relations.
So now we will analyse the data. We will the data information, size, dimensions, checking if there is any null values and will plot some beautiful graphs.
dataset.shape
The ".shape" function will give us the number of rows and columns in (rows,columns) format.
dataset.head()
The ".head()" function shows us the first 5 rows of the dataset. Similarly we can use ".tail()" function to see the last 5 rows of the dataset.
dataset['quality'].value_counts()
The ".values_counts()" function shows the number of values for 'quality' column, distinctively. If you are familiar with SQL this becomes more clear and relatable.
dataset.isnull().sum()
Here we are using two functions, ".isnull()" to check the null values and ".sum()" to show the sum of the number of null values. If it shows 0 then there is no null values.
dataset.describe()
The ".describe()" function gives a variety of outputs under the different columns. It shows the :
Count
Mean
Standard Deviation
Minimum Value in a Column
25 Percentile of a Column
50 Percentile of a Column
75 Percentile of a Column
Maximum Value of a Column
Let's plot some graphs.
We will use Seaborn and Matplotlib library to plot some graphs and show relations visually.
Matplotlib is a Python 2D plotting library that produces high-quality figures, charts, and visualizations. It provides a wide range of customizable plots and can be used for various purposes, from simple line plots to complex 3D visualizations.
Seaborn, built on top of Matplotlib, is a statistical data visualization library. It provides a high-level interface for creating attractive and informative statistical graphics. Seaborn simplifies the process of generating complex visualizations and adds aesthetic enhancements, making it particularly useful for exploring relationships in datasets.
First, we will see the number of values for each quality. We will use seaborn to plot a catplot.
sns.catplot(x = 'quality', data= dataset, kind= 'count')
Next we will do a "Volatile Acidity vs Quality" barplot. Please note the quality column is our target column and this will play a crucial role in prediction.
plot = plt.figure(figsize=(5,5))
sns.barplot(x='quality', y='volatile acidity', data=dataset)
Next we will do a "Citric Acid Content vs quality" barplot.
plot = plt.figure(figsize=(5,5))
sns.barplot(x='quality', y='citric acid', data=dataset)
Then we will do a correlation and construct a heatmap to understand correlation.
Correlation is a statistical measure that describes the extent to which two variables change in relation to each other. It quantifies the direction and strength of a linear relationship between two sets of data, indicating how changes in one variable are associated with changes in another. A positive correlation implies that as one variable increases, the other tends to increase as well, while a negative correlation suggests that as one variable increases, the other tends to decrease. Correlation values range from -1 to 1, with 0 indicating no linear correlation.
A heatmap is a graphical representation of data in a matrix format, where values are represented by colors. Typically used to visualize the intensity of relationships between two variables, a heatmap assigns colors to different values in a matrix, making it easy to identify patterns or trends. Brighter or warmer colors often represent higher values, while darker or cooler colors represent lower values. Heatmaps are commonly used in data analysis and data visualization to highlight patterns, correlations, or variations within a dataset.
#Correlation
correlation = dataset.corr()
correlation
#constructing a heatmap to understand correlation
plt.figure(figsize=(10,10))
sns.heatmap(correlation, cbar = True, square= True, fmt= '.1f', annot= True, annot_kws= {"size": 8}, cmap= "Blues")
Data PreProcessing and Train-Test-Predict
Now we will pre-process the data and train them so that we can build a prediction model.
Data preprocessing involves the preparation and cleaning of raw data to make it suitable for analysis or machine learning tasks. This essential step includes handling missing values, addressing outliers, normalizing or scaling data, and converting categorical variables into a format suitable for algorithms. Data preprocessing aims to enhance the quality of data, reduce noise, and ensure that the dataset is well-structured and ready for effective analysis or modeling. It plays a crucial role in improving the performance and reliability of machine learning models by ensuring that they are trained on accurate, consistent, and appropriately formatted data.
X = dataset.drop('quality', axis= 1)
X
The above code deletes the column 'quality' from the dataframe and stores it into 'X'.
#Label Binarization
Y = dataset['quality'].apply(lambda y_value: 1 if y_value >= 7 else 0)
Y
The above code inserts only the 'quality' column into Y. Here we also binarizing the label into 0 and 1 so that it becomes easy for the model to train and predict. Here is a bookish definition of Label Binarization
Label binarization is a process in machine learning where categorical labels are converted into binary (0 or 1) format. In the context of binary classification, it involves transforming a set of categorical labels into a binary representation to make them suitable for training machine learning models. Each unique category or class is assigned a binary value, usually 0 or 1, indicating its presence or absence. This transformation is particularly useful when dealing with algorithms that require numerical input, enabling the efficient handling of categorical data in tasks such as classification.
Training the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=3
The provided code is utilizing the train-test-split function from a machine learning library, which we have imported in the beginning to split a dataset into training and testing sets. The variables X and Y are the pre-processed dataframes that we have done above. The function "test_size" is allocating 20% data in the testing dataset.
print(Y.shape, y_test.shape, y_train.shape)
model = RandomForestClassifier()
model.fit(X_train, y_train)
We are calling the RandomForestClassifier function to train the model and we are fitting the training data into the model.
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, y_test)
print(test_data_accuracy)
So now we are predicting on the basis of test dataset. Using the accuracy_score we will find the accuracy by comparing the prediction model with y_test data.
Building Prediction Model
input_data = (8.9,0.62,0.19,3.9,0.17,51.0,148.0,0.9986,3.17,0.93,9.2)
input_data_as_np_array = np.asarray(input_data)
input_data_reshaped = input_data_as_np_array.reshape(1,-1)
prediction = model.predict(input_data_reshaped)
# print(prediction)
if (prediction[0] == 1):
print("Good quality wine")
else:
print("Bad quality wine")
Let's take some random values from the test dataset excluding the values of 'quality' column. The value in 'quality' column will determine whether our model predicts right or wrong. Provide those random values in form of a list and convert the list in a numpy array.
Reshape the numpy array to help model prediction. Using a basic if-else conditional statement we can see if the prediction goes right or not.
And this ends the project. Hope you like it. I will link some YouTube videos for you to learn about Seaborn and Matplotlib. Thank You :)
Important Links
Seaborn by NeuralNine - https://youtu.be/ooqXQ37XHMM?si=rWXnpaNk2HKX7fje
Matplotlib by NeuralNine - https://youtu.be/OZOOLe2imFo?si=1YmCRjovLm4ijqet
Github link for the project - https://github.com/PredictiveSwap/WineQualityPrediction/tree/main
Subscribe to my newsletter
Read articles from Swapnil Bhattacharya directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Swapnil Bhattacharya
Swapnil Bhattacharya
Hello, I am Swapnil. I am currently pursuing my Masters in Data Science from University of Birmingham. You can read my latest article to be updated. Thank you :)