Pandas Lab
Build strong intuition for Pandas transformations used in real ML pipelines (feature engineering, joins, groupby, leakage-safe logic).
Students can write correct, efficient Pandas code, explain tradeoffs, and avoid common interview/production gotchas.
Progress — 0/8 tasks
Interview Angles
- • Conditional aggregation best practices
- • Counting purchases via `event == 'purchase'` vs `amount > 0`
- • Why `.apply` is usually slow (Python calls per row)
FAANG Gotchas
- • How to detect many-to-many join explosion early (rowcount checks, key uniqueness)
- • Why `cumsum` without `shift` leaks labels on purchase rows
Asked At
Pandas Lab — Student Notebook
Complete each # TODO cell. Run the checks to verify your work.
Use Shift+Enter to run each cell.
DataFrame Fundamentals
Section 1 — DataFrame Fundamentals
Task 1.1: Filter purchases and extract event day
Filter events to only purchase rows, then create an event_day column by flooring event_ts to day.
HINT: Use events[events['event'] == 'purchase'] and .dt.floor('D')
GroupBy
Section 2 — GroupBy
Task 2.1: Build user-level features with groupby
Group events by user_id and compute:
- ●
n_events: count of events - ●
n_purchases: sum of purchase flags - ●
total_revenue: sum of amount
HINT: Use .assign() to create is_purchase, then .groupby().agg() with named aggregations
Task 2.2: Row-level feature with transform
Add a user_event_count column to events using transform('size') — this keeps the original row count.
HINT: groupby('user_id')['event'].transform('size')
Joins & Merge
Section 3 — Joins & Merge
Task 3.1: Left-join users with user_features
Merge users with user_features on user_id (left join). Fill NaN numeric columns with 0.
HINT: users.merge(user_features, on='user_id', how='left') then .fillna(0)
Task 3.2: Diagnose and fix a many-to-many join explosion
Join left and right on key k. Notice the row explosion. Fix it by deduplicating the right side first.
FAANG gotcha: Always validate row counts after joins to catch many-to-many explosions early.
Leakage-safe Time Features
Section 4 — Leakage-safe Time Features
Task 4.1: Cumulative purchase count (strictly before current row)
Sort events by user and time, compute cumulative purchase count, then shift by 1 so each row only sees past data.
FAANG gotcha: Using cumsum() without shift(1) leaks the current row's label into the feature.
HINT: groupby().cumsum() then groupby().shift(1).fillna(0)
Apply vs Vectorization
Section 5 — Apply vs Vectorization
Task 5.1: Compare apply vs vectorized boolean
Create country_is_in column two ways: with .apply(lambda) and with vectorized ==. Compare timing.
Use a larger DataFrame so timing is meaningful.
HINT:
- ●Build
uwith many rows usingpd.concat([users]*25000, ignore_index=True) - ●Vectorized:
u['country'] == 'IN'