Feature Engineering Strategies
Practice the feature engineering moves that win interviews and ship in production: missing data handling, categorical encoding, nonlinear transforms, and interaction featuresβ*without leaking labels*.
Students can: - build leakage-safe feature pipelines with sklearn, - choose appropriate encoding strategies, - add meaningful interaction features, - validate that features help via disciplined evaluation.
Progress β 0/7 tasks
Interview Angles
- β’ When do interaction features help linear models?
FAANG Gotchas
- β’ Imputation can hide a strong signal: missingness itself.
Asked At
Feature Engineering Strategies β Student Lab (Titanic)
Focus: missing data, encoding, nonlinear transforms, and interactions β leakage-safe.
Section 0 β Load Titanic (Kaggle) with fallback
Expected: data/titanic/train.csv
If missing, a tiny synthetic dataset is used so the notebook runs.
Baseline pipeline
Section 1 β Baseline pipeline
Task 1.1: Train a baseline with leakage-safe preprocessing
Use:
- β
Numeric: Age, Fare, SibSp, Parch
- β
Categorical: Pclass, Sex, Embarked
- β
Create X/y
- β
Build ColumnTransformer (impute + scale numeric, impute + one-hot categorical)
- β
Evaluate LogisticRegression with StratifiedKFold CV
Checkpoint: Why is the pipeline necessary to avoid leakage in CV?
Missing data: indicators
Section 2 β Missing data: indicators
Task 2.1: Add missingness indicator features
Add: Age_is_missing, Cabin_is_missing
- βCreate these columns
- βAdd them to numeric pipeline
- βRe-run CV and compare
Checkpoint: Why can missingness itself be predictive?
Encoding: One-hot vs Target encoding
Section 3 β Encoding: One-hot vs Target encoding
Task 3.1: Extract Title from Name and one-hot encode
Example titles: Mr, Mrs, Miss, Master.
- βExtract Title with regex
- βAdd it as categorical feature
- βRe-run CV
Task 3.2: Naive target encoding (demonstrate leakage)
We will intentionally do something wrong: compute mean Survived per Title using the full dataset, then map it back.
- βCreate Title_target_mean using full data (this is leakage)
- βCompare CV score and explain why it inflates
Checkpoint: How do you fix target encoding correctly?
Nonlinear transforms + binning
Section 4 β Nonlinear transforms + binning
Task 4.1: Log-transform Fare
Interaction features
Section 5 β Interaction features
Task 5.1: FamilySize and IsAlone
FamilySize = SibSp + Parch + 1 IsAlone = FamilySize == 1
Task 5.2: Sex Γ Pclass cross
Create a categorical cross feature like Sex_Pclass.