S30 Logo
S30 AI Labwww.thes30.com
Back
#19

Feature Engineering Strategies

MediumπŸ” Evaluation & DebuggingW5 D1

Feature Engineering Strategies

Practice the feature engineering moves that win interviews and ship in production: missing data handling, categorical encoding, nonlinear transforms, and interaction featuresβ€”*without leaking labels*.

Students can: - build leakage-safe feature pipelines with sklearn, - choose appropriate encoding strategies, - add meaningful interaction features, - validate that features help via disciplined evaluation.

Progress β€” 0/7 tasks

1Tasks
2Baseline pipeline
3Missing data: indicators
4Encoding: One-hot vs Target encoding
5Nonlinear transforms + binning
6Interaction features

Interview Angles

  • β€’ When do interaction features help linear models?

FAANG Gotchas

  • β€’ Imputation can hide a strong signal: missingness itself.

Asked At

GoogleGitHub
Python 3 β€” Notebook
0/7 solvedSubstack Notes
1
Dataset & Setup

Feature Engineering Strategies β€” Student Lab (Titanic)

Focus: missing data, encoding, nonlinear transforms, and interactions β€” leakage-safe.

Section 0 β€” Load Titanic (Kaggle) with fallback

Expected: data/titanic/train.csv

If missing, a tiny synthetic dataset is used so the notebook runs.

Loading editor...
Solution
1

Baseline pipeline

2
Train a baseline with leakage-safe preprocessing

Section 1 β€” Baseline pipeline

Task 1.1: Train a baseline with leakage-safe preprocessing

Use:

  • ●

    Numeric: Age, Fare, SibSp, Parch

  • ●

    Categorical: Pclass, Sex, Embarked

  • ●

    Create X/y

  • ●

    Build ColumnTransformer (impute + scale numeric, impute + one-hot categorical)

  • ●

    Evaluate LogisticRegression with StratifiedKFold CV

Checkpoint: Why is the pipeline necessary to avoid leakage in CV?

Loading editor...
Solution
2

Missing data: indicators

3
Add missingness indicator features

Section 2 β€” Missing data: indicators

Task 2.1: Add missingness indicator features

Add: Age_is_missing, Cabin_is_missing

  • ●Create these columns
  • ●Add them to numeric pipeline
  • ●Re-run CV and compare

Checkpoint: Why can missingness itself be predictive?

Loading editor...
Solution
3

Encoding: One-hot vs Target encoding

4
Extract Title from Name and one-hot encode

Section 3 β€” Encoding: One-hot vs Target encoding

Task 3.1: Extract Title from Name and one-hot encode

Example titles: Mr, Mrs, Miss, Master.

  • ●Extract Title with regex
  • ●Add it as categorical feature
  • ●Re-run CV
Loading editor...
Solution
5
Naive target encoding (demonstrate leakage)

Task 3.2: Naive target encoding (demonstrate leakage)

We will intentionally do something wrong: compute mean Survived per Title using the full dataset, then map it back.

  • ●Create Title_target_mean using full data (this is leakage)
  • ●Compare CV score and explain why it inflates

Checkpoint: How do you fix target encoding correctly?

Loading editor...
Solution
4

Nonlinear transforms + binning

6
Log-transform Fare

Section 4 β€” Nonlinear transforms + binning

Task 4.1: Log-transform Fare

Loading editor...
Solution
5

Interaction features

7
FamilySize and IsAlone

Section 5 β€” Interaction features

Task 5.1: FamilySize and IsAlone

FamilySize = SibSp + Parch + 1 IsAlone = FamilySize == 1

Task 5.2: Sex Γ— Pclass cross

Create a categorical cross feature like Sex_Pclass.

Loading editor...
Solution

Need help? Share feedback