Your Practical Guide to Data Wrangling in Python
Pandas is a Python library that allows us to store, manipulate, clean, and analyze tabular data (like Excel or SQL tables). It’s essential for every Data Scientist.
pip install pandas
import pandas as pd
ages = pd.Series([22, 38, 26])
print(ages)
df = pd.DataFrame({'Age': [10,12], 'Name': ['Alice', 'Bob'})
print(df.head())
We’ll use the Titanic dataset — a classic dataset containing information about passengers aboard the Titanic. You can download the titanic dataset from here
# to read the titanic dataset
df = pd.read_csv('train.csv'))
df.shape # Number of rows & columns
df.columns # Column names
df.info() # Data types & non-null counts
df.describe() # Stats for numeric columns
df['Age'] # Single column
df[['Name', 'Age']] # Multiple columns
df.iloc[0] # First row
df.loc[0] # First row (label-based)
df.iloc[0:5] # First 5 rows
df[df['Age'] > 60]
df[(df['Sex'] == 'female') & (df['Pclass'] == 1)]
df.isnull().sum()
df.dropna(inplace=False)
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
df.rename(columns={'Pclass': 'PassengerClass'}, inplace=True)
df['AgeGroup'] = df['Age'].apply(lambda x: 'Child' if x < 18 else 'Adult')
df.groupby('PassengerClass')['Age'].mean()
df.groupby('Sex')['Survived'].sum()
df['PassengerClass'].value_counts()
df.sort_values(by='Age', ascending=True).head()
pd.cut()
bins = [0, 20, 30, 40, 50, 60, 80]
labels = ['0-20', '21-30', '31-40', '41-50', '51-60', '61+']
df['AgeBin'] = pd.cut(df['Age'], bins=bins, labels=labels)
df['AgeBin'].value_counts().sort_index()
Task | Function / Method |
---|---|
Load CSV | pd.read_csv() |
Inspect data | df.info() , df.describe() |
Select data | df['col'] , df.iloc[] , df.loc[] |
Filter rows | Boolean indexing |
Handle missing data | isnull() , dropna() , fillna() |
Group & summarize | groupby() , agg() |
Sort data | sort_values() |
Bucket data | pd.cut() |
In Day 3, we’ll explore:
Code for Day 2 can be found on my github repository: Day 2 of Data Science
All content is licensed under the CC BY-SA 4.0 License unless otherwise specified