Day 3 - Advanced Pandas – Merging, Joining & Window Functions

29 Jul 2025 5 mins read
Data Science NumPy Python 60DayDSChallenge

Day 3: Advanced Pandas – Merging, Joining & Window Functions

Part 1: Merging and Joining in Pandas

Combining datasets is crucial in real-world data science. Pandas offers SQL-style merging, index-based joining, and simple concatenation.


Setup

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Create simplified DataFrames for merging
passenger_info = df[['PassengerId', 'Name', 'Pclass', 'Sex', 'Age']]
fare_info = df[['PassengerId', 'Fare', 'Embarked']]

merge() – SQL-style join

merged_df = pd.merge(passenger_info, fare_info, on='PassengerId', how='inner')
# LEFT JOIN example
left_joined = pd.merge(passenger_info, fare_info, on='PassengerId', how='left')
# OUTER JOIN example with NaN for missing
outer_joined = pd.merge(passenger_info, fare_info, on='PassengerId', how='outer')

Join on different column names

# If IDs are named differently
pd.merge(df1, df2, left_on='id1', right_on='id2')

Many-to-One join (Pclass info)

pclass_info = df[['Pclass', 'Fare']].groupby('Pclass').mean().reset_index()
pd.merge(passenger_info, pclass_info, on='Pclass', how='left')

join() – Index-based join

fare_info_indexed = fare_info.set_index('PassengerId')
joined_df = passenger_info.join(fare_info_indexed, on='PassengerId')

concat() – Stack vertically or horizontally

# Vertically
df_part1 = df.iloc[:100]
df_part2 = df.iloc[100:200]
combined = pd.concat([df_part1, df_part2], axis=0)

# Horizontally
df1 = df[['PassengerId', 'Sex']]
df2 = df[['PassengerId', 'Fare']]
combined_cols = pd.concat([df1, df2.drop('PassengerId', axis=1)], axis=1)

Part 2: Window Functions in Pandas

Window functions apply calculations across subsets of your data (like SQL). Ideal for feature engineering.


Setup: Sort by Pclass and Fare

df_sorted = df.sort_values(['Pclass', 'Fare'], ascending=[True, False])

rank() – Rank passengers by Fare within Pclass

df_sorted['FareRank'] = df_sorted.groupby('Pclass')['Fare'].rank(method='dense', ascending=False)

cumsum() – Cumulative Fare sum by Pclass

df_sorted['FareCumSum'] = df_sorted.groupby('Pclass')['Fare'].cumsum()

cumcount() – Order within groups

df_sorted['CountWithinPclass'] = df_sorted.groupby('Pclass').cumcount() + 1

rolling() – Moving average of Fare (window=3)

df_sorted['FareRollingMean'] = df_sorted['Fare'].rolling(window=3).mean()
# Rolling mean within each class
df_sorted['RollingMeanWithinClass'] = (
    df_sorted.groupby('Pclass')['Fare'].transform(lambda x: x.rolling(3, min_periods=1).mean())
)

expanding() – Expanding window average

df_sorted['FareExpandingMean'] = (
    df_sorted.groupby('Pclass')['Fare'].transform(lambda x: x.expanding().mean())
)

shift() – Get previous row’s fare

df_sorted['FarePrev'] = df_sorted.groupby('Pclass')['Fare'].shift(1)
df_sorted['FareChange'] = df_sorted['Fare'] - df_sorted['FarePrev']

diff() – Fare difference from previous row

df_sorted['FareDiff'] = df_sorted.groupby('Pclass')['Fare'].diff()

Summary Table

Task Function
SQL joins merge()
Index-based joins join()
Stack/append concat()
Rank within group rank()
Cumulative stats cumsum(), cumcount()
Rolling window rolling()
Expanding window expanding()
Previous row reference shift()
Delta between rows diff()

Code for Day 3 can be found on my github repository: Day 3 of Data Science


All content is licensed under the CC BY-SA 4.0 License unless otherwise specified