S30 Logo
S30 AI Labwww.thes30.com
Back
#2

Pandas Lab

Easy🐍 Python & DataW1 D3

Pandas Lab

Build strong intuition for Pandas transformations used in real ML pipelines (feature engineering, joins, groupby, leakage-safe logic).

Students can write correct, efficient Pandas code, explain tradeoffs, and avoid common interview/production gotchas.

Progress — 0/8 tasks

1Tasks
2DataFrame Fundamentals
3GroupBy (Core for Features)
4Joins & Merge (Feature Table Building)
5Time-based Features & Leakage (FAANG System Thinking)
6Apply vs Vectorization

Interview Angles

  • Conditional aggregation best practices
  • Counting purchases via `event == 'purchase'` vs `amount > 0`
  • Why `.apply` is usually slow (Python calls per row)

FAANG Gotchas

  • How to detect many-to-many join explosion early (rowcount checks, key uniqueness)
  • Why `cumsum` without `shift` leaks labels on purchase rows
Python 3 — Notebook
1
Dataset & Setup

Pandas Lab — FAANG-Level Hands-On

Goal: Build strong intuition for Pandas transformations used in ML pipelines (joins, groupby, leakage-safe features).

Outcome: You can write correct, scalable Pandas code and explain common gotchas (join explosion, leakage, apply misuse).

Section 0 — Synthetic Dataset (Interview-Friendly)

We’ll use synthetic tables to mirror typical ML feature engineering tasks.

Loading editor...
1

DataFrame Fundamentals

2
Basic filtering + new columns
2

Create:

  • purchases: rows where event == 'purchase'

  • add column event_day = date (no time)

  • Use boolean indexing

  • Use .dt.floor('D') or .dt.date (but prefer datetime64)

Explain: Why does vectorized `.dt` beat per-row parsing?
Loading editor...
2

GroupBy (Core for Features)

3
Per-user aggregates
2

Compute per-user features:

  • n_events

  • n_purchases

  • total_revenue (sum of amount)

  • Use groupby('user_id').agg(...)

  • For n_purchases, use conditional aggregation

Loading editor...
4
GroupBy transform (row-level feature)
2

Add a column to events called user_event_count = number of events for that user.

  • Use groupby(...).transform('count')
Explain: Why is `transform` different from `agg`?
Loading editor...
3

Joins & Merge (Feature Table Building)

5
Join user features back to users
1

Create user_table by joining users with user_features on user_id.

  • Use merge(..., how='left')
  • Fill missing aggregates with 0
Loading editor...
6
Join explosion debugging (mini)

Construct a tiny example where a join becomes many-to-many and explodes rows. Then fix it.

  • Create left with duplicate keys
  • Create right with duplicate keys
  • Merge and observe rowcount
Explain: What join cardinality do you expect (1:1, 1:n, n:1, n:n)?
Loading editor...
4

Time-based Features & Leakage (FAANG System Thinking)

7
Leakage-safe feature
1

For each user and each event row, compute purchases_before = number of purchases strictly BEFORE that event_ts.

Constraints:

  • No Python loops over rows

  • Must be time-aware

  • Sort by user_id, event_ts

  • Create boolean is_purchase

  • Use groupby + cumsum, then shift

Loading editor...
5

Apply vs Vectorization

8
Remove slow apply
1

Create a country_is_in boolean column on users.

  • First: do it with .apply(...) (slow style)
  • Then: replace it with vectorized code
Explain: Why is `.apply` often slower in Pandas?
Loading editor...