1. Introduction

Overview of Human Activity Recognition with Smartphones

Human Activity Recognition (HAR) is an exciting field in machine learning that focuses on identifying physical activities based on sensor data collected from smartphones, smartwatches, and other wearable devices. With advancements in mobile technology, HAR has found applications in health monitoring, fitness tracking, and human-computer interaction.

The Objective of This Project

In this project, we aim to predict one of six human activities based on motion sensor data from smartphones:

Walking
Walking Upstairs
Walking Downstairs
Sitting
Standing
Laying

By analyzing the sensor readings, our model will classify each observation into one of these activity categories.

Why Logistic Regression?

Logistic Regression is a widely used classification algorithm that predicts the probability of an instance belonging to a particular class. Since our problem involves multi-class classification (six activity labels), we will use the One-vs-Rest (OvR) approach, where a separate logistic regression model is trained for each class.

Dataset: Human Activity Recognition with Smartphones

We will use the Human Activity Recognition with Smartphones dataset, which contains:

561 sensor features extracted from accelerometer and gyroscope data.
A target variable (activity) representing the six activity categories.
Data was collected from 30 volunteers performing daily activities while carrying a smartphone.

Now, let's begin by loading and exploring the dataset.

2. Data Import and Exploration

Importing the Data

We start by loading the dataset and inspecting its structure.

Load the Dataset

import pandas as pd

# Load the dataset
data = pd.read_csv(R'Human_Activity_Recognition_Using_Smartphones_Data.csv',sep=',')

# Display first few rows
print(data.head())

tBodyAcc-mean()-X  tBodyAcc-mean()-Y  tBodyAcc-mean()-Z  tBodyAcc-std()-X  \
0           0.288585          -0.020294          -0.132905         -0.995279   
1           0.278419          -0.016411          -0.123520         -0.998245   
2           0.279653          -0.019467          -0.113462         -0.995380   
3           0.279174          -0.026201          -0.123283         -0.996091   
4           0.276629          -0.016570          -0.115362         -0.998139   

   tBodyAcc-std()-Y  tBodyAcc-std()-Z  tBodyAcc-mad()-X  tBodyAcc-mad()-Y  \
0         -0.983111         -0.913526         -0.995112         -0.983185   
1         -0.975300         -0.960322         -0.998807         -0.974914   
2         -0.967187         -0.978944         -0.996520         -0.963668   
3         -0.983403         -0.990675         -0.997099         -0.982750   
4         -0.980817         -0.990482         -0.998321         -0.979672   

   tBodyAcc-mad()-Z  tBodyAcc-max()-X  ...  fBodyBodyGyroJerkMag-skewness()  \
0         -0.923527         -0.934724  ...                        -0.298676   
1         -0.957686         -0.943068  ...                        -0.595051   
2         -0.977469         -0.938692  ...                        -0.390748   
3         -0.989302         -0.938692  ...                        -0.117290   
4         -0.990441         -0.942469  ...                        -0.351471   

   fBodyBodyGyroJerkMag-kurtosis()  angle(tBodyAccMean,gravity)  \
0                        -0.710304                    -0.112754   
1                        -0.861499                     0.053477   
2                        -0.760104                    -0.118559   
3                        -0.482845                    -0.036788   
4                        -0.699205                     0.123320   

   angle(tBodyAccJerkMean),gravityMean)  angle(tBodyGyroMean,gravityMean)  \
0                              0.030400                         -0.464761   
1                             -0.007435                         -0.732626   
2                              0.177899                          0.100699   
3                             -0.012892                          0.640011   
4                              0.122542                          0.693578   

   angle(tBodyGyroJerkMean,gravityMean)  angle(X,gravityMean)  \
0                             -0.018446             -0.841247   
1                              0.703511             -0.844788   
2                              0.808529             -0.848933   
3                             -0.485366             -0.848649   
4                             -0.615971             -0.847865   

   angle(Y,gravityMean)  angle(Z,gravityMean)  Activity  
0              0.179941             -0.058627  STANDING  
1              0.180289             -0.054317  STANDING  
2              0.180637             -0.049118  STANDING  
3              0.181935             -0.047663  STANDING  
4              0.185151             -0.043892  STANDING  

[5 rows x 562 columns]

Feature Overview:
- The dataset consists of 561 sensor-based numerical features derived from smartphone accelerometer and gyroscope readings.
- Feature names follow a pattern, such as tBodyAcc-mean()-X, tBodyAcc-std()-Y, etc., indicating time-domain (tBodyAcc) and frequency-domain (fBodyAcc, fBodyGyro) features.
- Features represent statistical metrics (mean, standard deviation, skewness, kurtosis) and angles derived from motion sensor signals.
Feature Scaling:
- Values range between -1 and 1, confirming that the dataset is already normalized.
- This ensures compatibility with machine learning models like logistic regression, which benefits from scaled input features.
Target Variable (Activity):
- The Activity column is categorical, representing six human activity types.
- In this sample, all five rows belong to the STANDING class, but further exploration will confirm the dataset balance.
Potential Preprocessing Steps:
- Convert Activity into numerical labels using LabelEncoder.
- No missing values were detected, so imputation is not needed.
- Features may exhibit high correlations, requiring feature selection techniques.

Check Dataset Structure

# Summary of the dataset
data.info()

float64    561
object       1
dtype: int64

All sensor features are floating-point numbers, while the target variable activity is a categorical object.

Activity Label Distribution

Before training the model, we analyze the distribution of activity labels to check if the dataset is balanced.

Compute Class Distribution

# Count occurrences of each activity
activity_counts = data['activity'].value_counts()

# Display distribution
print(activity_counts)

LAYING                1944
STANDING              1906
SITTING               1777
WALKING               1722
WALKING_UPSTAIRS      1544
WALKING_DOWNSTAIRS    1406
Name: Activity, dtype: int64

Visualize Class Distribution

import matplotlib.pyplot as plt
import seaborn as sns

# Plot activity distribution
plt.figure(figsize=(8, 5))
sns.barplot(x=activity_counts.index, y=activity_counts.values, palette="viridis")
plt.xticks(rotation=45)
plt.xlabel("Activity")
plt.ylabel("Count")
plt.title("Distribution of Activity Labels")
plt.show()

The dataset is fairly balanced, meaning we don't need to apply any resampling techniques.

Encoding Activity Labels

Since scikit-learn classifiers require numerical target values, we need to convert the categorical Activity column into numerical form. Label Encoding is the best approach for this because it assigns a unique integer to each activity category.

Why Label Encoding Won't Affect the Model

A common concern with Label Encoding is that it introduces an ordinal relationship between the classes (e.g., 0 < 1 < 2). However, this is not a problem in our case because:

Scikit-learn’s Logistic Regression does not assume order in categorical labels. It treats them as distinct classes in a One-vs-Rest (OvR) or Softmax (multinomial mode) framework.
Unlike ordinal regression, multi-class logistic regression learns separate decision boundaries for each category, meaning the numerical values assigned to labels do not impose a ranking.
One-Hot Encoding is unnecessary for the target variable (y) in classification tasks because LogisticRegression expects a single column with integer labels.

Encode Activity Labels with `LabelEncoder`

from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
le = LabelEncoder()

# Encode the Activity column
data["Activity"] = le.fit_transform(data["Activity"])

# Display mapping of labels to numerical values
label_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print("Activity Label Mapping:", label_mapping)

Activity Label Mapping: {'LAYING': 0, 'SITTING': 1, 'STANDING': 2, 'WALKING': 3, 'WALKING_DOWNSTAIRS': 4, 'WALKING_UPSTAIRS': 5}

This ensures that our Activity labels are numeric without introducing unintended ordering effects that could mislead the model.

3. Feature Analysis and Correlation

Understanding the relationships between features is crucial for building a strong and interpretable model. In this section, we perform a correlation analysis to identify any redundant features that may affect the performance of our logistic regression model.

Visualize the Correlation Distribution

We begin by computing the correlation matrix for all features and visualizing it using a histogram. The correlation matrix shows how each feature is related to others, with values ranging from -1 to 1. A correlation close to 1 or -1 indicates a strong relationship, while a value near 0 suggests no significant relationship.

# Compute the correlation matrix for all features
correlation_matrix = data.iloc[:, :-1].corr()

# Plot a histogram of the absolute correlation values
import matplotlib.pyplot as plt
import seaborn as sns

# Get the absolute values of the correlations for the histogram
abs_correlation = correlation_matrix.abs().stack().reset_index(name="correlation")
plt.figure(figsize=(10, 6))
sns.histplot(abs_correlation["correlation"], bins=30, kde=True)
plt.title("Histogram of Absolute Correlation Values")
plt.xlabel("Correlation Value")
plt.ylabel("Frequency")
plt.show()

The histogram of absolute correlation values shows that most feature pairs have a low to moderate correlation, with a significant peak at 0. This indicates that the majority of features in the dataset are weakly correlated with each other. As the correlation value increases toward 1, the frequency of feature pairs decreases, suggesting that strong correlations (either positive or negative) are less common. The plot suggests that most features provide unique information with minimal redundancy, while a small portion of features may be highly correlated and potentially redundant.

Identify Highly Correlated Features

Next, we identify feature pairs that have a high correlation (greater than 0.8). These features may be redundant, and keeping both in the model can lead to multicollinearity, which affects the stability and interpretability of the regression model.

# Sort the correlation values and filter for those above 0.8
highly_correlated = corr_values.sort_values('correlation', ascending=False).query('abs_correlation > 0.8')

# Display the most highly correlated feature pairs
print(highly_correlated)

feature1              feature2  correlation  \
0          tBodyAcc-mean()-X     tBodyAcc-mean()-X     1.000000   
114648     tBodyAccMag-min()     tBodyAccMag-min()     1.000000   
114086     tBodyAccMag-max()     tBodyAccMag-max()     1.000000   
113537     tBodyAccMag-mad()  tGravityAccMag-mad()     1.000000   
113524     tBodyAccMag-mad()     tBodyAccMag-mad()     1.000000   
...                      ...                   ...          ...   
151896      fBodyAcc-std()-Z     fBodyGyro-std()-X     0.800002   
114712     tBodyAccMag-min()      fBodyAcc-std()-X     0.800001   
150565      fBodyAcc-std()-X  tGravityAccMag-min()     0.800001   
122005  tGravityAccMag-min()      fBodyAcc-std()-X     0.800001   
150552      fBodyAcc-std()-X     tBodyAccMag-min()     0.800001   

        abs_correlation  
0              1.000000  
114648         1.000000  
114086         1.000000  
113537         1.000000  
113524         1.000000  
...                 ...  
151896         0.800002  
114712         0.800001  
150565         0.800001  
122005         0.800001  
150552         0.800001  

[46191 rows x 4 columns]

Implications of Correlated Features in Logistic Regression

In logistic regression, multicollinearity occurs when two or more predictors are highly correlated with each other. This can cause unstable coefficients, making it difficult to interpret the model. Multicollinearity can also lead to overfitting, where the model memorizes the training data and performs poorly on unseen data.

By identifying and analyzing the highly correlated features, we can ensure that the features chosen for the logistic regression model are independent and informative. This step helps to improve model stability and avoid issues with overfitting.

4. Splitting Data for Model Training

Properly splitting the data into training and testing sets is a crucial step in model training. It ensures that the model is evaluated on unseen data, providing an unbiased assessment of its performance. In this section, we use StratifiedShuffleSplit to maintain the class balance, ensuring that the proportion of each activity class is preserved in both the training and testing sets.

Train-Test Split

To split the dataset, we use StratifiedShuffleSplit from Scikit-learn. This technique ensures that each class is represented proportionally in both the training and test sets, which is especially important when dealing with imbalanced datasets. In this case, since the activities are balanced, it will help ensure the proportions are consistent between the splits.

# Import StratifiedShuffleSplit from sklearn
from sklearn.model_selection import StratifiedShuffleSplit

# Initialize StratifiedShuffleSplit with a test size of 30%
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

# Create train-test splits
for train_index, test_index in sss.split(data, data['Activity']):
    train_set, test_set = data.iloc[train_index], data.iloc[test_index]

# Define the features and target variable
X_train = train_set.drop("Activity", axis=1)
y_train = train_set["Activity"]
X_test = test_set.drop("Activity", axis=1)
y_test = test_set["Activity"]

# Display class distribution in the training set
print("Training Set Class Distribution:")
print(y_train.value_counts() / len(y_train))

# Display class distribution in the testing set
print("\nTesting Set Class Distribution:")
print(y_test.value_counts() / len(y_test))

Explanation:

StratifiedShuffleSplit is used to split the data into one training set (70%) and one test set (30%) while maintaining the relative distribution of the target variable (activity classes).
We then define the features (X_train, X_test) by dropping the target variable Activity, and the target variable (y_train, y_test) by extracting the Activity column.
We verify that the distribution of classes in both the training and testing sets is consistent with the original dataset.

Verify the Class Distribution in Train and Test Sets

After the split, we display the class distribution of the training and testing sets to ensure that the proportion of each activity is maintained across both sets.

Output:

Training Set Class Distribution:
0    0.188792
2    0.185046
1    0.172562
3    0.167152
5    0.149951
4    0.136496
Name: Activity, dtype: float64

Testing Set Class Distribution:
0    0.188673
2    0.185113
1    0.172492
3    0.167314
5    0.149838
4    0.136570
Name: Activity, dtype: float64

This confirms that the class distribution is maintained in both the train and test sets.

5. Logistic Regression Model Training

Training a Baseline Logistic Regression Model

We begin by training a baseline logistic regression model without any regularization. The model uses the liblinear solver and is fitted on the training data (X_train, y_train). This solver is suitable for small datasets and works well with logistic regression models, including multi-class classification problems.

from sklearn.linear_model import LogisticRegression

# Standard Logistic regression
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)

This model fits the training data, providing coefficients for each feature that the model uses to classify the data into one of the activity categories.

One-vs-Rest (OvR) Approach for Multi-Class Classification

Logistic Regression is naturally a binary classifier, but in the case of multi-class classification, such as the one in this project with six possible activities (labels), a strategy called One-vs-Rest (OvR) is used. In the OvR approach, a separate binary classifier is trained for each class, where the class is treated as the positive class, and all other classes are treated as the negative class. This method ensures that the model can classify multiple classes.

Hyperparameter Tuning with Cross-Validation

Next, we apply hyperparameter tuning using LogisticRegressionCV from sklearn. This method automatically uses cross-validation to determine the best hyperparameters (such as the regularization strength, C).

We will fit models with both L1 (Lasso) and L2 (Ridge) regularization:

from sklearn.linear_model import LogisticRegressionCV

# L1 regularized Logistic regression
lr_l1 = LogisticRegressionCV(Cs=10, penalty='l1', solver='liblinear', cv=4).fit(X_train, y_train)

# L2 regularized Logistic regression
lr_l2 = LogisticRegressionCV(Cs=10, penalty='l2', solver='liblinear', cv=4).fit(X_train, y_train)

Comparing L1 and L2 Regularization

L1 Regularization (Lasso): This approach tends to shrink some coefficients to zero, effectively performing feature selection. It's particularly useful when dealing with high-dimensional datasets.
L2 Regularization (Ridge): L2 regularization, on the other hand, does not shrink coefficients to zero but rather penalizes large coefficients, helping to prevent overfitting.

The regularization strength, C, is the inverse of the regularization parameter, where smaller values indicate stronger regularization. The best value for C is selected using cross-validation, ensuring that the model doesn't overfit or underfit the data.

The next steps would involve comparing the performance of these models on the test data, as well as analyzing their coefficients, which reflect the importance of each feature in predicting the activity classes.

6. Model Interpretation and Coefficients

After training the logistic regression models with and without regularization, we need to evaluate their performance. The evaluation metrics used include accuracy, precision, recall, and F1-score. These metrics give us a detailed understanding of how well our models are performing, especially in multi-class classification problems.

Comparing the Magnitudes of the Coefficients

The following code compares the magnitude of the coefficients for the baseline logistic regression model (lr), L1 regularized model (lr_l1), and L2 regularized model (lr_l2). By examining the coefficients for each model, we can understand the contribution of each feature to the predictions.

# Combine all the coefficients into a dataframe
coefficients = list()
coeff_labels = ['lr', 'l1', 'l2']
coeff_models = [lr, lr_l1, lr_l2]

for lab, mod in zip(coeff_labels, coeff_models):
    coeffs = mod.coef_
    coeff_label = pd.MultiIndex(levels=[[lab], [0,1,2,3,4,5]],
                                codes=[[0,0,0,0,0,0], [0,1,2,3,4,5]])
    coefficients.append(pd.DataFrame(coeffs.T, columns=coeff_label))

coefficients = pd.concat(coefficients, axis=1)
coefficients.sample(10)

The coefficients highlight the relationship between the features and the activity classes. Notably, the L1 regularization shrinks some coefficients towards zero, which helps in feature selection, while the L2 regularization smooths the coefficients to prevent overfitting.

Visualizing the Coefficients

To better understand how the coefficients of the models differ, we can plot them:

fig, axList = plt.subplots(nrows=3, ncols=2)
axList = axList.flatten()
fig.set_size_inches(10,10)

for ax in enumerate(axList):
    loc = ax[0]
    ax = ax[1]

    data = coefficients.xs(loc, level=1, axis=1)
    data.plot(marker='o', ls='', ms=2.0, ax=ax, legend=False)

    if ax is axList[0]:
        ax.legend(loc=4)

    ax.set(title='Coefficient Set ' + str(loc))
    ax.set_xlabel("Feature Index")
    ax.set_ylabel("Coefficient Value")

plt.tight_layout()

This plot shows the coefficient values for each feature in the logistic regression model across different sets, with three regularization approaches: L2 (Ridge), L1 (Lasso), and no regularization (standard logistic regression). Each plot corresponds to a specific feature's coefficient across different models. The green dots represent the standard logistic regression model (lr), the orange dots represent L1 regularization (L1), and the blue dots represent L2 regularization (L2).

From the plots, we can observe that the magnitude and distribution of the coefficients vary depending on the regularization method. L1 regularization tends to shrink some coefficients to zero, making it more useful for feature selection, while L2 regularization generally results in smaller, more evenly distributed coefficients. The standard logistic regression model (without regularization) shows more variability in coefficient magnitudes. These plots are essential for comparing the effects of regularization on the model's coefficients and understanding how each method affects the model's complexity.

7. Model Evaluation and Predictions

Making Predictions

In this part, we use the trained models to predict the activity labels on the test set. Additionally, we generate the probability scores for each activity class. These predictions and probabilities are stored for later evaluation.

# Predict the class and the probability for each
y_pred = list()
y_prob = list()

coef_labels = ['lr', '11', '12']
coef_models = [lr, lr_l1, lr_l2]

for lab, mod in zip(coef_labels, coef_models):
    y_pred.append(pd.Series(mod.predict(X_test), name=lab))
    y_prob.append(pd.Series(mod.predict_proba(X_test).max(axis=1), name=lab))

y_pred = pd.concat(y_pred, axis=1)
y_prob = pd.concat(y_prob, axis=1)

print(y_pred.head())

   lr  I1  I2
0   3   3   3
1   5   5   5
2   3   3   3
3   1   1   1
4   0   0   0

The result will show the predicted class for each model across all test samples, and the predicted probabilities for each class.

Example output for y_pred.head():

For the predicted probabilities, the following output for y_prob.head() would be generated:

   lr        11        12
0  0.998939  0.998996  0.999998
1  0.988165  0.999799  0.999477
2  0.987592  0.995806  0.999697
3  0.981381  0.999181  0.999865
4  0.998277  0.999921  0.999997

Evaluating Model Performance

To assess the model's performance, we calculate various classification metrics, including precision, recall, F1-score, accuracy, and ROC-AUC. This code calculates each of the metrics:

from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

metrics = list()
cm = dict()

for lab in coeff_labels:
    # Precision, recall, f-score from the multi-class support function
    precision, recall, fscore, _ = score(y_test, y_pred[lab], average='weighted')

    # The usual way to calculate accuracy
    accuracy = accuracy_score(y_test, y_pred[lab])

    # ROC-AUC scores can be calculated by binarizing the data
    auc = roc_auc_score(label_binarize(y_test, classes=[0,1,2,3,4,5]),
                        label_binarize(y_pred[lab], classes=[0,1,2,3,4,5]),
                        average='weighted')

    # Last, the confusion matrix
    cm[lab] = confusion_matrix(y_test, y_pred[lab])

    metrics.append(pd.Series({'precision':precision, 'recall':recall, 'fscore':fscore, 'accuracy':accuracy, 'auc':auc},
                             name=lab))

metrics = pd.concat(metrics, axis=1)

print(metrics)

           lr        l1        l2
precision  0.984144  0.983514  0.984477
recall     0.984142  0.983495  0.984466
fscore     0.984143  0.983492  0.984464
accuracy   0.984142  0.983495  0.984466
auc        0.990384  0.989949  0.990553

Let’s visualize the results using a bar chart:

metrics.plot(kind='bar', figsize=(12, 6), rot=0, legend=False)
plt.title("Comparison of Evaluation Metrics with Different Regularization Methods")
plt.xlabel("Metrics")
plt.ylabel("Score")
plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.3), ncol=3)
plt.tight_layout()
plt.show()

The evaluation metrics for the three models (lr, l1, and l2) are very similar, with precision, recall, fscore, and accuracy ranging from 0.9835 to 0.9845 across all models. The auc values are also high, indicating strong model performance. The lr and l2 models show slightly better results than l1, with l2 achieving the highest auc (0.990553). Overall, all models perform similarly well, with minimal differences in the metrics.

8. Confusion Matrix for Each Model

Displaying the Confusion Matrix

A confusion matrix is a useful tool to evaluate the performance of a classification model by showing the counts of actual versus predicted labels for each class. It helps us understand how well the model is performing in terms of its predictions.

We can display the confusion matrix for each model—logistic regression (lr), L1-regularized logistic regression (l1), and L2-regularized logistic regression (l2). The following code generates and plots the confusion matrix for each model:

# Create the confusion matrix for each model
fig, axList = plt.subplots(nrows=2, ncols=2)
axList = axList.flatten()
fig.set_size_inches(12, 10)
axList[-1].axis('off')

# Confusion matrix labels
labels = ['0', '1', '2', '3', '4', '5']  # Adjust if you have more/less classes

for ax, lab in zip(axList[:-1], coeff_labels):
    cm = confusion_matrix(y_test, y_pred[lab])
    sns.heatmap(cm, ax=ax, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
    ax.set_title(f'Confusion Matrix - {lab}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('True')

plt.tight_layout()
plt.show()

The confusion matrices for the three models (Logistic Regression lr, L1-regularized Logistic Regression l1, and L2-regularized Logistic Regression l2) show that the models perform similarly with high accuracy in most classes (diagonal values). Here’s the concise analysis:

lr Model:
- The majority of the predictions are correct (high values on the diagonal).
- Misclassifications are minimal but present in some classes, with 1 and 5 being occasionally confused with other classes (e.g., 21 misclassified as 1 and 22 as 2).
l1 Model:
- The l1 model also performs well with high values on the diagonal, though it has a slightly higher number of misclassifications, especially between classes 1 and 5 (with 27 misclassified).
- Class 1 has more misclassifications (506 predicted correctly, but 27 misclassified as 2).
l2 Model:
- The l2 model shows minimal misclassifications, similar to the lr model.
- Class 2 has more misclassifications (e.g., 20 predicted as 3), but otherwise, the performance is strong.

Overall, all three models perform similarly well, but the l2 model and lr model exhibit slightly better performance with fewer misclassifications compared to the l1 model. The confusion matrices indicate strong model behavior across the classes, with only minor misclassifications that may be addressed through further tuning or analysis.

9. Conclusion

In this article, we successfully built and evaluated a multi-class logistic regression model for predicting human activity using smartphone data. We started by preparing the data, encoding activity labels, and performing feature analysis and correlation. The results indicated that the dataset contained many highly correlated features, which we carefully considered during model training.

We then trained baseline models using standard logistic regression, L1 (Lasso) regularization, and L2 (Ridge) regularization. We evaluated these models using accuracy, precision, recall, F1-score, and AUC, finding that all models performed similarly, with marginal differences in their ability to classify the activities correctly.

In addition, we visualized the confusion matrix for each model, which showed the number of correct and incorrect predictions across different activity classes. The model performance was high across the board, demonstrating the effectiveness of logistic regression for this classification task.

Finally, we observed that all models performed well with similar metrics, and the regularization did not drastically affect the model's ability to predict activity. Regularization could potentially help prevent overfitting when applied to more complex datasets. For future work, exploring other models, like random forests or gradient boosting machines, might provide further improvements in accuracy.

In summary, logistic regression, with or without regularization, proved to be a solid choice for activity recognition, achieving high accuracy and strong performance across multiple metrics.

Human Activity Recognition Using Logistic Regression

Table of contents

1. Introduction

Overview of Human Activity Recognition with Smartphones

The Objective of This Project

Why Logistic Regression?

Dataset: Human Activity Recognition with Smartphones

2. Data Import and Exploration

Importing the Data

Check Dataset Structure

Activity Label Distribution

Compute Class Distribution

Encoding Activity Labels

Why Label Encoding Won't Affect the Model

Encode Activity Labels with `LabelEncoder`

3. Feature Analysis and Correlation

Visualize the Correlation Distribution

Identify Highly Correlated Features

Implications of Correlated Features in Logistic Regression

4. Splitting Data for Model Training

Train-Test Split

Explanation:

Verify the Class Distribution in Train and Test Sets

5. Logistic Regression Model Training

Training a Baseline Logistic Regression Model

One-vs-Rest (OvR) Approach for Multi-Class Classification

Hyperparameter Tuning with Cross-Validation

Comparing L1 and L2 Regularization

6. Model Interpretation and Coefficients

Comparing the Magnitudes of the Coefficients

Visualizing the Coefficients

7. Model Evaluation and Predictions

Making Predictions

Evaluating Model Performance

8. Confusion Matrix for Each Model

Displaying the Confusion Matrix

9. Conclusion

Subscribe to my newsletter

Henry Ha

Henry Ha

Human Activity Recognition Using Logistic Regression

Table of contents

1. Introduction

Overview of Human Activity Recognition with Smartphones

The Objective of This Project

Why Logistic Regression?

Dataset: Human Activity Recognition with Smartphones

2. Data Import and Exploration

Importing the Data

Check Dataset Structure

Activity Label Distribution

Compute Class Distribution

Encoding Activity Labels

Why Label Encoding Won't Affect the Model

Encode Activity Labels with LabelEncoder

3. Feature Analysis and Correlation

Visualize the Correlation Distribution

Identify Highly Correlated Features

Implications of Correlated Features in Logistic Regression

4. Splitting Data for Model Training

Train-Test Split

Explanation:

Verify the Class Distribution in Train and Test Sets

5. Logistic Regression Model Training

Training a Baseline Logistic Regression Model

One-vs-Rest (OvR) Approach for Multi-Class Classification

Hyperparameter Tuning with Cross-Validation

Comparing L1 and L2 Regularization

6. Model Interpretation and Coefficients

Comparing the Magnitudes of the Coefficients

Visualizing the Coefficients

7. Model Evaluation and Predictions

Making Predictions

Evaluating Model Performance

8. Confusion Matrix for Each Model

Displaying the Confusion Matrix

9. Conclusion

Subscribe to my newsletter

Henry Ha

Henry Ha

Encode Activity Labels with `LabelEncoder`