Combining datasets is crucial in real-world data science. Pandas offers SQL-style merging, index-based joining, and simple concatenation.
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
# Create simplified DataFrames for merging
passenger_info = df[['PassengerId', 'Name', 'Pclass', 'Sex', 'Age']]
fare_info = df[['PassengerId', 'Fare', 'Embarked']]
merge()
– SQL-style joinmerged_df = pd.merge(passenger_info, fare_info, on='PassengerId', how='inner')
PassengerId
in both DataFrames.passenger_info
, match from fare_info
.fare_info
, match from passenger_info
.# LEFT JOIN example
left_joined = pd.merge(passenger_info, fare_info, on='PassengerId', how='left')
# OUTER JOIN example with NaN for missing
outer_joined = pd.merge(passenger_info, fare_info, on='PassengerId', how='outer')
# If IDs are named differently
pd.merge(df1, df2, left_on='id1', right_on='id2')
pclass_info = df[['Pclass', 'Fare']].groupby('Pclass').mean().reset_index()
pd.merge(passenger_info, pclass_info, on='Pclass', how='left')
join()
– Index-based joinfare_info_indexed = fare_info.set_index('PassengerId')
joined_df = passenger_info.join(fare_info_indexed, on='PassengerId')
concat()
– Stack vertically or horizontally# Vertically
df_part1 = df.iloc[:100]
df_part2 = df.iloc[100:200]
combined = pd.concat([df_part1, df_part2], axis=0)
# Horizontally
df1 = df[['PassengerId', 'Sex']]
df2 = df[['PassengerId', 'Fare']]
combined_cols = pd.concat([df1, df2.drop('PassengerId', axis=1)], axis=1)
Window functions apply calculations across subsets of your data (like SQL). Ideal for feature engineering.
df_sorted = df.sort_values(['Pclass', 'Fare'], ascending=[True, False])
rank()
– Rank passengers by Fare within Pclassdf_sorted['FareRank'] = df_sorted.groupby('Pclass')['Fare'].rank(method='dense', ascending=False)
cumsum()
– Cumulative Fare sum by Pclassdf_sorted['FareCumSum'] = df_sorted.groupby('Pclass')['Fare'].cumsum()
cumcount()
– Order within groupsdf_sorted['CountWithinPclass'] = df_sorted.groupby('Pclass').cumcount() + 1
rolling()
– Moving average of Fare (window=3)df_sorted['FareRollingMean'] = df_sorted['Fare'].rolling(window=3).mean()
# Rolling mean within each class
df_sorted['RollingMeanWithinClass'] = (
df_sorted.groupby('Pclass')['Fare'].transform(lambda x: x.rolling(3, min_periods=1).mean())
)
expanding()
– Expanding window averagedf_sorted['FareExpandingMean'] = (
df_sorted.groupby('Pclass')['Fare'].transform(lambda x: x.expanding().mean())
)
shift()
– Get previous row’s faredf_sorted['FarePrev'] = df_sorted.groupby('Pclass')['Fare'].shift(1)
df_sorted['FareChange'] = df_sorted['Fare'] - df_sorted['FarePrev']
diff()
– Fare difference from previous rowdf_sorted['FareDiff'] = df_sorted.groupby('Pclass')['Fare'].diff()
Task | Function |
---|---|
SQL joins | merge() |
Index-based joins | join() |
Stack/append | concat() |
Rank within group | rank() |
Cumulative stats | cumsum() , cumcount() |
Rolling window | rolling() |
Expanding window | expanding() |
Previous row reference | shift() |
Delta between rows | diff() |
Code for Day 3 can be found on my github repository: Day 3 of Data Science
All content is licensed under the CC BY-SA 4.0 License unless otherwise specified