Data Analytics and Python


Introduction
What is Data Analytics?
From the term Data Analytics, we can say that Data Analytics is Analytics on data.
But what is this data, and why do we do these analyses?
The data can be any Digital data collected over time, this can be from many sources like Social media, Sales data, Customer data from E-commerce platforms, etc.
Why do we need this analysis?
The main necessity of doing this analysis is to extract the insights from the data, lets say there is a retail store like Costco and let’s say in a particular location on a particular day every year there is a sudden rise of demand for a particular product, then using this insight Costco will increase supply of that product on that particular date in that location.
According to Google, data analytics is dominated by Python compared to other languages.
According to Google, the percentage of languages mentioned in Job Listings
Language | Percentage |
Python | 70% |
SQL | 69% |
Java | 32% |
R | 10-15% |
Why do data Analysts love Python?
The answer to this question is ease of usage, rich libraries, community support, etc. The other languages, like Java and R, which are other popular languages, have their own drawbacks, which make them distant from Data Analysis. Java is also another popular language, but it requires a lot of boilerplate code for simple tasks. Java does not have a native dataframe-like structure like that of Python. R is more suitable for educational purposes than for the industry.
Python in Data Analysis
Data collection
Python has the simplest syntax when it comes to data analysis. In the Data Collection, it has a wide range of packages.
Python can read data from CSV files or any other table-like files using pandas; it has packages that can communicate with websites, APIs, or Web services using the requests package, and it can also extract data from websites using web scraping using the bs4 package (beautiful soup).
Code in Python vs Java
import pandas as pd
df = pd.read_csv("data.csv")
BufferedReader br = new BufferedReader(new FileReader("data.csv"));
String line;
while ((line = br.readLine()) != null) {
String[] values = line.split(",");
}
Data Preparation using Python
Data processing is one of the crucial steps in data analysis and is also used in machine learning. Popular libraries in data processing include pandas, numpy, and sklearn.preprocessing, openpyxl, etc. Python has a rich community support and has a good number of packages for data preparation.
Data preparation includes handling missing values, Data type formatting, transforming data, etc.
Sample code for handling missing data:
df.isnull().sum()
df.dropna(inplace=True)
df.fillna(value={"price": 0, "category": "Unknown"}, inplace=True)
Data Visualization
Data visualization helps us see the insights and patterns in the data. It includes showing the data in bar graphs, scatter plots, Heatmaps, pie charts, etc.
Popular packages include: Matplotlib, Seaborn, Pandas, etc. All the tools and packages are open source, and they can be easily integrated with CSV, EXCEL, SQL, etc.
Sample Python code for data visualization
import seaborn as sns
import pandas as pd
df = sns.load_dataset('tips')
sns.boxplot(x="day", y="total_bill", data=df)
plt.title("Total Bill Distribution by Day")
plt.show()
Machine learning
Machine learning is a branch of AI that learns patterns from data and makes predictions based on training and without explicit programming.
Why is Python dominating the AI field?
Python has an easy syntax, a huge community support, and it integrates well with databases and APIs (which have data).
Popular Machine learning packages include: scikit-learn, xgboost, lightgbm, catboost, etc. Deep learning libraries include: TensorFlow, Keras, PyTorch, etc.
Sample Python code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
df = pd.read_csv("iris.csv")
X = df.drop("species", axis=1)
y = df["species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier()
model.fit(X_train, y_train)
preds = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, preds))
Conclusion
In conclusion, we can say that Python is the most powerful and easiest to use language for data analysis. It has good community support and offers a perfect balance between scalability and simplicity. It is beginner-friendly too.
Compared to Java, Python is readable and does not contain large boilerplate code, which makes Python more readable than Java. Java lacks native like data structures, which makes Java a bit lacking in data analysis. Python has a rich ecosystem of libraries: pandas, Numpy, scikit-learn, and matplotlib.
R is less versatile than Python. R is mostly used for statistical analysis. Python can be used in machine learning, automation, web development, etc., which makes Python a go-to language for data analysis. It also integrates well with production environments.
Subscribe to my newsletter
Read articles from ramnisanth simhadri directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
