S30 Logo
S30 AI Labwww.thes30.com
Back
#2

Pandas Lab

Easy🐍 Python & DataW1 D3

Pandas Lab

Build strong intuition for Pandas transformations used in real ML pipelines (feature engineering, joins, groupby, leakage-safe logic).

Students can write correct, efficient Pandas code, explain tradeoffs, and avoid common interview/production gotchas.

Progress — 0/8 tasks

1Tasks
2DataFrame Fundamentals
3GroupBy
4Joins & Merge
5Leakage-safe Time Features
6Apply vs Vectorization

Interview Angles

  • Conditional aggregation best practices
  • Counting purchases via `event == 'purchase'` vs `amount > 0`
  • Why `.apply` is usually slow (Python calls per row)

FAANG Gotchas

  • How to detect many-to-many join explosion early (rowcount checks, key uniqueness)
  • Why `cumsum` without `shift` leaks labels on purchase rows

Asked At

GoogleGitHub
Python 3 — Notebook
0/8 solvedSubstack Notes
1
Dataset & Setup

Pandas Lab — Student Notebook

Complete each # TODO cell. Run the checks to verify your work. Use Shift+Enter to run each cell.

Loading editor...
Solution
1

DataFrame Fundamentals

2
Filter purchases and extract event day
2

Section 1 — DataFrame Fundamentals

Task 1.1: Filter purchases and extract event day

Filter events to only purchase rows, then create an event_day column by flooring event_ts to day.

HINT: Use events[events['event'] == 'purchase'] and .dt.floor('D')

Loading editor...
Solution
2

GroupBy

3
Build user-level features with groupby
2

Section 2 — GroupBy

Task 2.1: Build user-level features with groupby

Group events by user_id and compute:

  • n_events: count of events
  • n_purchases: sum of purchase flags
  • total_revenue: sum of amount

HINT: Use .assign() to create is_purchase, then .groupby().agg() with named aggregations

Loading editor...
Solution
4
Row-level feature with transform
2

Task 2.2: Row-level feature with transform

Add a user_event_count column to events using transform('size') — this keeps the original row count.

HINT: groupby('user_id')['event'].transform('size')

Loading editor...
Solution
3

Joins & Merge

5
Left-join users with user_features
1

Section 3 — Joins & Merge

Task 3.1: Left-join users with user_features

Merge users with user_features on user_id (left join). Fill NaN numeric columns with 0.

HINT: users.merge(user_features, on='user_id', how='left') then .fillna(0)

Loading editor...
Solution
6
Diagnose and fix a many-to-many join explosion

Task 3.2: Diagnose and fix a many-to-many join explosion

Join left and right on key k. Notice the row explosion. Fix it by deduplicating the right side first.

FAANG gotcha: Always validate row counts after joins to catch many-to-many explosions early.

Loading editor...
Solution
4

Leakage-safe Time Features

7
Cumulative purchase count (strictly before current row)
1

Section 4 — Leakage-safe Time Features

Task 4.1: Cumulative purchase count (strictly before current row)

Sort events by user and time, compute cumulative purchase count, then shift by 1 so each row only sees past data.

FAANG gotcha: Using cumsum() without shift(1) leaks the current row's label into the feature.

HINT: groupby().cumsum() then groupby().shift(1).fillna(0)

Loading editor...
Solution
5

Apply vs Vectorization

8
Compare apply vs vectorized boolean
2

Section 5 — Apply vs Vectorization

Task 5.1: Compare apply vs vectorized boolean

Create country_is_in column two ways: with .apply(lambda) and with vectorized ==. Compare timing.

Use a larger DataFrame so timing is meaningful.

HINT:

  • Build u with many rows using pd.concat([users]*25000, ignore_index=True)
  • Vectorized: u['country'] == 'IN'
Loading editor...
Solution

Need help? Share feedback