Pandas is a popular open-source library used for data manipulation and analysis in Python. It is built on top of NumPy, another popular library for numerical computations in Python which we will cover in the future. Pandas provides data structures and functions for efficient and easy handling of large and complex data sets. In this blog post, we will provide an introduction to Pandas and some basic operations with its most common data structures.
Main data structures
Pandas provides two primary data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type. A DataFrame is a two-dimensional tabular data structure consisting of rows and columns, where each column can have a different data type.
Series data structure
To create a Series, we can pass a Python list or dictionary to the Series constructor. Here’s an example:
import pandas as pd
# Create a Series from a list
s = pd.Series([10, 20, 30, 40, 50])
print(s)
0 10
1 20
2 30
3 40
4 50
dtype: int64
We can access Series data by index:
# Get the first element of the Series
print(s[0])
10
And we can also perform operations over the whole Series:
mean_value = s.mean()
max_value = s.max()
min_value = s.min()
print(f"mean: {mean_value}, max: {max_value}, min: {min_value}")
mean: 30, max: 50, min: 10
DataFrame data structure
To create a DataFrame, we can pass a dictionary of lists to the DataFrame constructor. Each key in the dictionary represents a column name, and each value is a list of values for that column. Here’s an example:
import pandas as pd
# create a DataFrame from a dictionary
data = {'name': ['Pablo', 'David', 'Alice', 'Bob'],
'age': [32, 31, 35, 40],
'gender': ['M', 'M', 'F', 'M']}
df = pd.DataFrame(data)
print(df)
name age gender
0 Pablo 32 M
1 David 31 M
2 Alice 35 F
3 Bob 40 M
A DataFrame allows us to access the data in different ways:
We can access DataFrame data by column name:
# Get the 'name' column of the DataFrame
print(df['name'])
0 Pablo
1 David
2 Alice
3 Bob
Name: name, dtype: object
We can access DataFrame data by row index:
# Get the first row of the DataFrame
print(df.loc[0])
name Pablo
age 32
gender M
Name: 0, dtype: object
We can access DataFrame data by row index and column name:
# Get the value in the 'name' column of the first row
print(df.loc[0, 'name'])
Pablo
Of course, we want to do some operations with the data in the DataFrame, we can do this very similarly to the Series data structure:
# Calculate the mean of the 'age' column
mean_age = df['age'].mean()
# Count the number of males and females in the 'gender' column
gender_counts = df['gender'].value_counts()
# Get the names of people older than 33
names = df.loc[df['age'] > 33, 'name'].tolist()
print(f"mean_age: {mean_age}")
print(gender_counts)
print(f"People over 33: {names.tolist()}"
mean_age: 34.5
M 3
F 1
People over 33: ['Alice', 'Bob']
Conclusion
In this blog post, we provided an introduction to Pandas and some basic operations with the two most used data structures. Pandas is a powerful tool for data manipulation and analysis, and we only scratched the surface of its capabilities. We encourage you to explore more and experiment with the various functions and methods provided by Pandas. Happy coding!
0 Comments