Pandas Lab
Build strong intuition for Pandas transformations used in real ML pipelines (feature engineering, joins, groupby, leakage-safe logic).
Students can write correct, efficient Pandas code, explain tradeoffs, and avoid common interview/production gotchas.
Progress — 0/8 tasks
Interview Angles
- • Conditional aggregation best practices
- • Counting purchases via `event == 'purchase'` vs `amount > 0`
- • Why `.apply` is usually slow (Python calls per row)
FAANG Gotchas
- • How to detect many-to-many join explosion early (rowcount checks, key uniqueness)
- • Why `cumsum` without `shift` leaks labels on purchase rows
Pandas Lab — FAANG-Level Hands-On
Goal: Build strong intuition for Pandas transformations used in ML pipelines (joins, groupby, leakage-safe features).
Outcome: You can write correct, scalable Pandas code and explain common gotchas (join explosion, leakage, apply misuse).
Section 0 — Synthetic Dataset (Interview-Friendly)
We’ll use synthetic tables to mirror typical ML feature engineering tasks.
DataFrame Fundamentals
Create:
- ●
purchases: rows whereevent == 'purchase' - ●
add column
event_day= date (no time) - ●
Use boolean indexing
- ●
Use
.dt.floor('D')or.dt.date(but prefer datetime64)
GroupBy (Core for Features)
Compute per-user features:
- ●
n_events - ●
n_purchases - ●
total_revenue(sum of amount) - ●
Use
groupby('user_id').agg(...) - ●
For
n_purchases, use conditional aggregation
Add a column to events called user_event_count = number of events for that user.
- ●Use
groupby(...).transform('count')
Joins & Merge (Feature Table Building)
Create user_table by joining users with user_features on user_id.
- ●Use
merge(..., how='left') - ●Fill missing aggregates with 0
Construct a tiny example where a join becomes many-to-many and explodes rows. Then fix it.
- ●Create
leftwith duplicate keys - ●Create
rightwith duplicate keys - ●Merge and observe rowcount
Time-based Features & Leakage (FAANG System Thinking)
For each user and each event row, compute purchases_before = number of purchases strictly BEFORE that event_ts.
Constraints:
- ●
No Python loops over rows
- ●
Must be time-aware
- ●
Sort by user_id, event_ts
- ●
Create boolean
is_purchase - ●
Use groupby + cumsum, then shift
Apply vs Vectorization
Create a country_is_in boolean column on users.
- ●First: do it with
.apply(...)(slow style) - ●Then: replace it with vectorized code