How to Create and Train a Machine Learning Model from Scratch
Machine learning (ML) has revolutionized the way we approach problem-solving, enabling computers to learn from data and make decisions or predictions. In this blog, we’ll walk through the process of creating and training a machine learning model from scratch using Python and its popular libraries like scikit-learn
.
1. Setting Up the Environment
Before we begin, we need to set up the tools necessary to build a machine learning model. We'll use popular Python libraries like pandas
for data handling, scikit-learn
for machine learning algorithms, and matplotlib
for visualization.
Run the following command in your terminal or command prompt to install these libraries:
pip install numpy pandas scikit-learn matplotlib seaborn
2. Understanding the Machine Learning Process
Creating a machine learning model can be broken down into these core steps:
Problem Definition: What are you trying to solve or predict?
Data Collection: Gather or load the dataset.
Data Preprocessing: Clean and prepare the data.
Model Selection: Choose the appropriate algorithm.
Model Training: Train the model using your data.
Model Evaluation: Test the model's performance.
Model Tuning: Optimize the model for better accuracy.
Deployment: Use the model in real-world applications.
3. Loading and Exploring the Data
For this guide, we'll use the famous Iris dataset from scikit-learn
. This dataset contains information about flowers and their classifications into three species based on the length and width of their petals and sepals.
Here’s how you can load and view the data:
import pandas as pd
from sklearn.datasets import load_iris
# Load the dataset
iris = load_iris()
# Convert to DataFrame for easy visualization
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
# Display the first few rows
print(df.head())
Explanation:
- We load the Iris dataset and convert it into a
pandas
DataFrame so we can easily explore it. The dataset includes columns like petal and sepal measurements and a target column indicating the flower species.
4. Preprocessing the Data
Data is rarely perfect, so we need to clean it before training the model. This includes handling missing data, scaling the features (so they're on the same scale), and encoding categorical variables.
Luckily, the Iris dataset is already clean, so we only need to scale the features:
from sklearn.preprocessing import StandardScaler
# Separate the features and target
X = df.drop('species', axis=1)
y = df['species']
# Standardize the features (scaling)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Explanation:
- Feature scaling ensures that the machine learning model doesn’t give more weight to features just because they have a larger range of values (e.g., petal length vs. sepal length).
5. Splitting the Data
Now that the data is clean and scaled, we need to split it into two parts:
Training set (80%): This data will be used to train the model.
Test set (20%): This data will be used to evaluate the model.
from sklearn.model_selection import train_test_split
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}")
Explanation:
- We split the dataset into training and testing sets. By keeping a separate test set, we can evaluate how well the model generalizes to new data.
6. Choosing and Training a Model
We’ll start by using a simple algorithm, K-Nearest Neighbors (KNN), which classifies a new point based on its closest neighbors in the dataset.
from sklearn.neighbors import KNeighborsClassifier
# Initialize the model
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model on the training data
knn.fit(X_train, y_train)
Explanation:
- K-Nearest Neighbors is easy to understand: for each new data point, it looks at the 'k' nearest points in the training data and assigns the majority class label.
7. Evaluating the Model
After training, it's crucial to test the model on the unseen test data to measure its accuracy and other performance metrics.
from sklearn.metrics import accuracy_score
# Make predictions on the test data
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
Explanation:
- Accuracy is a basic metric to measure how many predictions were correct. Here, we predict the species of flowers in the test set and compare them with the actual values to compute accuracy.
8. Tuning the Model
If you’re not satisfied with the initial accuracy, you can tune the model by adjusting its parameters (hyperparameters). For example, we can experiment with different values of k
in the KNN model.
# Try with different number of neighbors (k)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Evaluate the model again
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with k=5: {accuracy * 100:.2f}%")
Explanation:
- By changing the number of neighbors (
k
), we can potentially improve the accuracy of the KNN model. You can experiment with different values to see what works best.
9. Deploying the Model
Once you're happy with the model's performance, you can save it and deploy it for real-world applications. In Python, you can use the joblib
library to save the trained model and later load it when needed.
import joblib
# Save the model to a file
joblib.dump(knn, 'knn_model.pkl')
# Load the model back
knn_loaded = joblib.load('knn_model.pkl')
# Now you can use it to make predictions
y_new_pred = knn_loaded.predict(X_test)
Explanation:
- Saving the model allows you to use it later without having to retrain it. This is useful when deploying the model in production.
Conclusion
Here’s a recap of the process:
Prepare the data: Clean, scale, and split it into training and test sets.
Choose a model: Select an algorithm like KNN, Decision Trees, etc.
Train the model: Use the training set to fit the model.
Evaluate the model: Check its accuracy and performance on the test set.
Tune the model: Adjust parameters to improve accuracy.
Deploy the model: Save it for later use in real-world applications.
Machine learning is a powerful tool that can solve a wide range of problems, and by mastering the basics, you’re well on your way to building intelligent systems. Happy coding!
Subscribe to my newsletter
Read articles from ByteScrum Technologies directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
ByteScrum Technologies
ByteScrum Technologies
Our company comprises seasoned professionals, each an expert in their field. Customer satisfaction is our top priority, exceeding clients' needs. We ensure competitive pricing and quality in web and mobile development without compromise.