A Beginner’s Guide to Data Analysis in Python
As a beginner in the programming world, it can be overwhelming to learn different libraries and tools available in a programming language. Python is a popular language, which has a robust ecosystem in data analysis and processing. In this article, we will discuss a beginner's Python data analysis guide. We will go through the libraries, tools, and concepts that you will need to get started with data analysis using Python.
Prerequisites
Before we dive into data analysis in Python, there are fundamental skills that you must have. Python assumes that you already have a good understanding of the basics of programming concepts like variables, loops, and functions. If you lack a solid foundation in these concepts, it is recommended to take a beginner's course in Python before attempting data analysis.
Libraries
Pandas
When it comes to data analysis in Python, Pandas
is the de facto library. It is a powerful library that has a broad range of tools and functionalities that make it very useful in data analysis. It is perfect for any dataset that can be represented as a table. It has data structures for representing data, operations that convert data structures, data cleaning, data visualization, and data manipulation.
A Pandas
dataframe is a two-dimensional table that has rows and columns, and each column can have different data types. The following example shows how to import the Pandas
library and create a dataframe;
import pandas as pd
data = {
"name": ["Python", "Java", "C++"],
"popularity": [2, 1, 3]
}
df = pd.DataFrame(data)
print(df)
This will output;
name popularity
0 Python 2
1 Java 1
2 C++ 3
NumPy
Numpy
is another powerful library that is suitable for scientific computing and data analysis. It has many tools for vector and matrix operations and also has tools for statistical computation. Numpy
ndarrays are similar to a Pandas
dataframe, but ndarrays have homogeneous types for every element. An example of how to create an ndarray;
import numpy as np
arr = np.array([1, 2, 3])
print(arr)
The output will be;
[1 2 3]
Matplotlib
Matplotlib
is an excellent visualization library in Python. It has a broad range of visualization options, from simple line diagrams, scatter plots, and many more.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
plt.plot(x, y, 'ro')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
The output will be a line graph.
Data Analysis in Python
The pandas library provides many tools that are very useful for data analysis. In this section, we will discuss some of these tools.
Data Access
Using pandas to access data is simple, as shown in the following example;
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
The read_csv
method creates a pandas dataframe object from a CSV file. The resulting dataframe can be processed however we want.
Data Filtering
We can filter a dataframe using boolean indexing. This is where the dataframe is traversed by checking each row where the condition is met. Here is an example
>>> import pandas as pd
>>> data = {
... 'name': ['Bob', 'Jane', 'Mike', 'Zac'],
... 'age': [25, 32, 18, 10]
... }
>>> df = pd.DataFrame(data)
>>> df[df['age'] > 20]
name age
0 Bob 25
1 Jane 32
Data Manipulation
Pandas
provides many tools for data manipulation. These include;
Grouping
The groupby() method is used to group dataframe objects based on their column value.
>>> import pandas as pd
>>> data = {
... 'name': ['Bob', 'Jane', 'Mike', 'Zac', 'Zara'],
... 'age': [25, 32, 18, 10, 24],
... 'gender': ['M', 'F', 'M', 'M', 'F']
... }
>>> df = pd.DataFrame(data)
>>> df.groupby('gender')['age'].mean()
gender
F 28.0
M 17.67
Name: age, dtype: float64
Visualizations
Matplotlib
should be used for data visualization, and Pandas
integrates well with it. It has a plot()
method for creating graphs that allow the user to customize the graph.
import pandas as pd
import matplotlib.pyplot as plt
data = {
'name': ['Bertie', 'Sandra', 'Chris', 'Peter', 'Amy', 'Lana', 'Mila'],
'age': [28, 30, 33, 25, 35, 40, 20],
'salary': [60000, 90000, 80000, 50000, 120000, 100000, 55000]
}
df = pd.DataFrame(data)
plt.scatter(df.age, df.salary, color='r')
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.grid(True)
plt.show()
Conclusion
In conclusion, Python provides a vast array of tools and libraries that make data analysis simple. We have learned about some of the libraries, tools, and concepts that we will need to get started with data analysis using Python. However, this is just the tip of the iceberg as the libraries mentioned have a lot of functions and functionality that can take years to master, but this article should provide a good starting point for beginners.
Subscribe to my newsletter
Read articles from Aniket Potabatti directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Aniket Potabatti
Aniket Potabatti
Hi Everyone, Aniket here! I am sharing my learnings and experience.