Pandas Lab

Easy🐍 Python & DataW1 D3

Pandas Lab

Build strong intuition for Pandas transformations used in real ML pipelines (feature engineering, joins, groupby, leakage-safe logic).

Students can write correct, efficient Pandas code, explain tradeoffs, and avoid common interview/production gotchas.

Progress — 0/8 tasks

1Tasks

2DataFrame Fundamentals

3GroupBy (Core for Features)

4Joins & Merge (Feature Table Building)

5Time-based Features & Leakage (FAANG System Thinking)

6Apply vs Vectorization

Interview Angles

• Conditional aggregation best practices
• Counting purchases via `event == 'purchase'` vs `amount > 0`
• Why `.apply` is usually slow (Python calls per row)

FAANG Gotchas

• How to detect many-to-many join explosion early (rowcount checks, key uniqueness)
• Why `cumsum` without `shift` leaks labels on purchase rows

Python 3 — Notebook

0/8 solvedSolutions Substack Notes

Dataset & Setup

Pandas Lab — FAANG-Level Hands-On

Goal: Build strong intuition for Pandas transformations used in ML pipelines (joins, groupby, leakage-safe features).

Outcome: You can write correct, scalable Pandas code and explain common gotchas (join explosion, leakage, apply misuse).

Section 0 — Synthetic Dataset (Interview-Friendly)

We’ll use synthetic tables to mirror typical ML feature engineering tasks.

Loading editor...

DataFrame Fundamentals

Basic filtering + new columns

Create:

●
purchases: rows where event == 'purchase'
●
add column event_day = date (no time)
●
Use boolean indexing
●
Use .dt.floor('D') or .dt.date (but prefer datetime64)

Explain: Why does vectorized `.dt` beat per-row parsing?

Loading editor...

GroupBy (Core for Features)

Per-user aggregates

Compute per-user features:

●
n_events
●
n_purchases
●
total_revenue (sum of amount)
●
Use groupby('user_id').agg(...)
●
For n_purchases, use conditional aggregation

Loading editor...

GroupBy transform (row-level feature)

Add a column to events called user_event_count = number of events for that user.

●Use groupby(...).transform('count')

Explain: Why is `transform` different from `agg`?

Loading editor...

Joins & Merge (Feature Table Building)

Join user features back to users

Create user_table by joining users with user_features on user_id.

●Use merge(..., how='left')
●Fill missing aggregates with 0

Loading editor...

Join explosion debugging (mini)

Construct a tiny example where a join becomes many-to-many and explodes rows. Then fix it.

●Create left with duplicate keys
●Create right with duplicate keys
●Merge and observe rowcount

Explain: What join cardinality do you expect (1:1, 1:n, n:1, n:n)?

Loading editor...

Time-based Features & Leakage (FAANG System Thinking)

Leakage-safe feature

For each user and each event row, compute purchases_before = number of purchases strictly BEFORE that event_ts.

Constraints:

●
No Python loops over rows
●
Must be time-aware
●
Sort by user_id, event_ts
●
Create boolean is_purchase
●
Use groupby + cumsum, then shift

Loading editor...

Apply vs Vectorization

Remove slow apply

Create a country_is_in boolean column on users.

●First: do it with .apply(...) (slow style)
●Then: replace it with vectorized code

Explain: Why is `.apply` often slower in Pandas?

Loading editor...