Predicting Startup Profit using Multiple Linear Regression

Lokesh PatidarLokesh Patidar
3 min read

Objective:

The goal of this project is to predict the profit of startups based on how much they spend on:

  • Research & Development (R&D)

  • Administration

  • Marketing

  • And the State they operate in

Using this, we aim to help businesses make informed investment decisions.

Dataset Description: 50_Startups.csv

This dataset contains 50 records of startups, each with the following attributes:

FeatureDescription
R&D SpendMoney spent on research and development
AdministrationAdministrative expenses (rent, salaries, etc.)
Marketing SpendMoney spent on marketing and advertising
StateCategorical feature representing the startup location
ProfitTarget variable – the net profit earned

2. Splitting the Data

  • Defined the independent variables (X) and dependent variable (y = Profit)

  • Split the dataset into training and testing sets (80/20 ratio)

pythonCopyEditfrom sklearn.model_selection import train_test_split
X = df.drop("Profit", axis=1)
y = df["Profit"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Model Training

  • Used Multiple Linear Regression to fit the training data
pythonCopyEditfrom sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
  • The model tries to fit a linear equation:
    Profit = b0 + b1*(R&D Spend) + b2*(Admin) + b3*(Marketing) + b4*(State_X) + ... + error

4. Model Evaluation

  • Evaluated using:

    • RΒ² Score: Tells how well the model explains the variation in profit

    • Mean Squared Error (MSE): Measures average squared difference between predicted and actual profits

pythonCopyEditfrom sklearn.metrics import r2_score, mean_squared_error

y_pred = model.predict(X_test)
print("RΒ² Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
  • Also visualized predicted vs actual profits using matplotlib or seaborn:
pythonCopyEditimport matplotlib.pyplot as plt
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Profit")
plt.ylabel("Predicted Profit")
plt.title("Actual vs Predicted Profit")
plt.show()

5. Feature Optimization (Optional)

  • Used Backward Elimination with statsmodels to find which features are statistically significant.
pythonCopyEditimport statsmodels.api as sm
X_opt = sm.add_constant(X)  # Adds intercept
model_ols = sm.OLS(y, X_opt).fit()
print(model_ols.summary())
  • Based on p-values, we can iteratively remove insignificant features to simplify the model without losing accuracy.

Insights & Learnings

  • R&D Spend had the most significant positive impact on profit.

  • Marketing Spend also played a role, but to a lesser extent.

  • Administration costs and State had very little influence.

  • The project gave me hands-on exposure to:

    • Feature engineering

    • Dummy variable trap avoidance

    • Model building and evaluation

    • Statistical analysis with statsmodels


πŸ”§ Tools & Libraries Used

  • Python

  • Pandas & NumPy

  • Scikit-learn

  • Statsmodels

  • Matplotlib & Seaborn

  • Google Colab


Conclusion

This project is a great real-world example of how machine learning models can help in business decision making. It allowed me to explore both technical ML implementation and business reasoning.


0
Subscribe to my newsletter

Read articles from Lokesh Patidar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Lokesh Patidar
Lokesh Patidar

Hey, I'm Lokesh Patidar! I'm a 2nd-year student at SATI Vidisha, passionate about AI, Machine Learning, Full-Stack Development , and DSA. What I'm Learning: Currently Exploring Machine Learning πŸ€– Completed DSA & Frontend Development 🌐 Now exploring Backend Development πŸ’‘ Interests: I love solving problems, building projects, and integrating AI into real-world applications. Excited to contribute to tech communities and share my learning journey! πŸ“Œ Follow my blog for insights on AI, ML, and Full-Stack projects!