Pandas is a popular open-source library used for data manipulation and analysis in Python. It is built on top of NumPy, another popular library for numerical computations in Python which we will cover in the future. Pandas provides data structures and functions for efficient and easy handling of large and complex data sets. In this blog post, we will provide an introduction to Pandas and some basic operations with its most common data structures.

Main data structures

Pandas provides two primary data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type. A DataFrame is a two-dimensional tabular data structure consisting of rows and columns, where each column can have a different data type.

Series data structure

To create a Series, we can pass a Python list or dictionary to the Series constructor. Here’s an example:

Python
import pandas as pd

# Create a Series from a list
s = pd.Series([10, 20, 30, 40, 50])

print(s)
0    10
1    20
2    30
3    40
4    50
dtype: int64

We can access Series data by index:

Python
# Get the first element of the Series
print(s[0])
10

And we can also perform operations over the whole Series:

Python
mean_value = s.mean()
max_value = s.max()
min_value = s.min()

print(f"mean: {mean_value}, max: {max_value}, min: {min_value}")
mean: 30, max: 50, min: 10

DataFrame data structure

To create a DataFrame, we can pass a dictionary of lists to the DataFrame constructor. Each key in the dictionary represents a column name, and each value is a list of values for that column. Here’s an example:

Python
import pandas as pd

# create a DataFrame from a dictionary
data = {'name': ['Pablo', 'David', 'Alice', 'Bob'],
        'age': [32, 31, 35, 40],
        'gender': ['M', 'M', 'F', 'M']}

df = pd.DataFrame(data)

print(df)
    name  age gender
0  Pablo   32      M
1  David   31      M
2  Alice   35      F
3    Bob   40      M

A DataFrame allows us to access the data in different ways:

We can access DataFrame data by column name:

Python
# Get the 'name' column of the DataFrame
print(df['name'])
0    Pablo
1    David
2    Alice
3      Bob
Name: name, dtype: object

We can access DataFrame data by row index:

Python
# Get the first row of the DataFrame
print(df.loc[0])
name     Pablo
age         32
gender       M
Name: 0, dtype: object

We can access DataFrame data by row index and column name:

Python
# Get the value in the 'name' column of the first row
print(df.loc[0, 'name'])
Pablo

Of course, we want to do some operations with the data in the DataFrame, we can do this very similarly to the Series data structure:

Python
# Calculate the mean of the 'age' column
mean_age = df['age'].mean()

# Count the number of males and females in the 'gender' column
gender_counts = df['gender'].value_counts()

# Get the names of people older than 33
names = df.loc[df['age'] > 33, 'name'].tolist()

print(f"mean_age: {mean_age}")
print(gender_counts)
print(f"People over 33: {names.tolist()}"
mean_age: 34.5
M    3
F    1
People over 33: ['Alice', 'Bob']

Conclusion

In this blog post, we provided an introduction to Pandas and some basic operations with the two most used data structures. Pandas is a powerful tool for data manipulation and analysis, and we only scratched the surface of its capabilities. We encourage you to explore more and experiment with the various functions and methods provided by Pandas. Happy coding!

Categories: Practical

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *