Predicting Startup Profit using Multiple Linear Regression


Objective:
The goal of this project is to predict the profit of startups based on how much they spend on:
Research & Development (R&D)
Administration
Marketing
And the State they operate in
Using this, we aim to help businesses make informed investment decisions.
Dataset Description: 50_Startups.csv
This dataset contains 50 records of startups, each with the following attributes:
Feature | Description |
R&D Spend | Money spent on research and development |
Administration | Administrative expenses (rent, salaries, etc.) |
Marketing Spend | Money spent on marketing and advertising |
State | Categorical feature representing the startup location |
Profit | Target variable β the net profit earned |
2. Splitting the Data
Defined the independent variables (X) and dependent variable (y = Profit)
Split the dataset into training and testing sets (80/20 ratio)
pythonCopyEditfrom sklearn.model_selection import train_test_split
X = df.drop("Profit", axis=1)
y = df["Profit"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Model Training
- Used Multiple Linear Regression to fit the training data
pythonCopyEditfrom sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
- The model tries to fit a linear equation:
Profit = b0 + b1*(R&D Spend) + b2*(Admin) + b3*(Marketing) + b4*(State_X) + ... + error
4. Model Evaluation
Evaluated using:
RΒ² Score: Tells how well the model explains the variation in profit
Mean Squared Error (MSE): Measures average squared difference between predicted and actual profits
pythonCopyEditfrom sklearn.metrics import r2_score, mean_squared_error
y_pred = model.predict(X_test)
print("RΒ² Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
- Also visualized predicted vs actual profits using
matplotlib
orseaborn
:
pythonCopyEditimport matplotlib.pyplot as plt
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Profit")
plt.ylabel("Predicted Profit")
plt.title("Actual vs Predicted Profit")
plt.show()
5. Feature Optimization (Optional)
- Used Backward Elimination with
statsmodels
to find which features are statistically significant.
pythonCopyEditimport statsmodels.api as sm
X_opt = sm.add_constant(X) # Adds intercept
model_ols = sm.OLS(y, X_opt).fit()
print(model_ols.summary())
- Based on p-values, we can iteratively remove insignificant features to simplify the model without losing accuracy.
Insights & Learnings
R&D Spend had the most significant positive impact on profit.
Marketing Spend also played a role, but to a lesser extent.
Administration costs and State had very little influence.
The project gave me hands-on exposure to:
Feature engineering
Dummy variable trap avoidance
Model building and evaluation
Statistical analysis with
statsmodels
π§ Tools & Libraries Used
Python
Pandas & NumPy
Scikit-learn
Statsmodels
Matplotlib & Seaborn
Google Colab
Conclusion
This project is a great real-world example of how machine learning models can help in business decision making. It allowed me to explore both technical ML implementation and business reasoning.
Subscribe to my newsletter
Read articles from Lokesh Patidar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Lokesh Patidar
Lokesh Patidar
Hey, I'm Lokesh Patidar! I'm a 2nd-year student at SATI Vidisha, passionate about AI, Machine Learning, Full-Stack Development , and DSA. What I'm Learning: Currently Exploring Machine Learning π€ Completed DSA & Frontend Development π Now exploring Backend Development π‘ Interests: I love solving problems, building projects, and integrating AI into real-world applications. Excited to contribute to tech communities and share my learning journey! π Follow my blog for insights on AI, ML, and Full-Stack projects!