Objective:

The goal of this project is to predict the profit of startups based on how much they spend on:

Research & Development (R&D)
Administration
Marketing
And the State they operate in

Using this, we aim to help businesses make informed investment decisions.

Dataset Description: `50_Startups.csv`

This dataset contains 50 records of startups, each with the following attributes:

Feature	Description
`R&D Spend`	Money spent on research and development
`Administration`	Administrative expenses (rent, salaries, etc.)
`Marketing Spend`	Money spent on marketing and advertising
`State`	Categorical feature representing the startup location
`Profit`	Target variable – the net profit earned

2. Splitting the Data

Defined the independent variables (X) and dependent variable (y = Profit)
Split the dataset into training and testing sets (80/20 ratio)

pythonCopyEditfrom sklearn.model_selection import train_test_split
X = df.drop("Profit", axis=1)
y = df["Profit"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Model Training

Used Multiple Linear Regression to fit the training data

pythonCopyEditfrom sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

The model tries to fit a linear equation:
Profit = b0 + b1*(R&D Spend) + b2*(Admin) + b3*(Marketing) + b4*(State_X) + ... + error

4. Model Evaluation

Evaluated using:
- R² Score: Tells how well the model explains the variation in profit
- Mean Squared Error (MSE): Measures average squared difference between predicted and actual profits

pythonCopyEditfrom sklearn.metrics import r2_score, mean_squared_error

y_pred = model.predict(X_test)
print("R² Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))

Also visualized predicted vs actual profits using matplotlib or seaborn:

pythonCopyEditimport matplotlib.pyplot as plt
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Profit")
plt.ylabel("Predicted Profit")
plt.title("Actual vs Predicted Profit")
plt.show()

5. Feature Optimization (Optional)

Used Backward Elimination with statsmodels to find which features are statistically significant.

pythonCopyEditimport statsmodels.api as sm
X_opt = sm.add_constant(X)  # Adds intercept
model_ols = sm.OLS(y, X_opt).fit()
print(model_ols.summary())

Based on p-values, we can iteratively remove insignificant features to simplify the model without losing accuracy.

Insights & Learnings

R&D Spend had the most significant positive impact on profit.
Marketing Spend also played a role, but to a lesser extent.
Administration costs and State had very little influence.
The project gave me hands-on exposure to:
- Feature engineering
- Dummy variable trap avoidance
- Model building and evaluation
- Statistical analysis with statsmodels

🔧 Tools & Libraries Used

Python
Pandas & NumPy
Scikit-learn
Statsmodels
Matplotlib & Seaborn
Google Colab

Conclusion

This project is a great real-world example of how machine learning models can help in business decision making. It allowed me to explore both technical ML implementation and business reasoning.

Predicting Startup Profit using Multiple Linear Regression

Objective:

Dataset Description: `50_Startups.csv`

2. Splitting the Data

3. Model Training

4. Model Evaluation

5. Feature Optimization (Optional)

Insights & Learnings

🔧 Tools & Libraries Used

Conclusion

Subscribe to my newsletter

Lokesh Patidar

Lokesh Patidar

Predicting Startup Profit using Multiple Linear Regression

Objective:

Dataset Description: 50_Startups.csv

2. Splitting the Data

3. Model Training

4. Model Evaluation

5. Feature Optimization (Optional)

Insights & Learnings

🔧 Tools & Libraries Used

Conclusion

Subscribe to my newsletter

Lokesh Patidar

Lokesh Patidar

Dataset Description: `50_Startups.csv`