Page de couverture de When Clean Data Is Actually Dirty

When Clean Data Is Actually Dirty

When Clean Data Is Actually Dirty

Auteur(s): StatHarbor Analytics
Écouter gratuitement

À propos de cet audio

We often treat data cleaning as a neutral step.

Delete missing rows. Fill gaps with the mean. Move on.

But cleaning is not neutral. It is a modeling decision.

In this episode, we unpack the statistical consequences of deletion and simple imputation, and why what looks “clean” can fundamentally alter your estimand, distort variance, and bias inference.

We walk through:

  • The formal role of the missingness indicator
  • The difference between MCAR, MAR, and MNAR
  • Why complete-case analysis is rarely as safe as it seems
  • How mean imputation collapses variance and attenuates regression slopes
  • When multiple imputation and inverse probability weighting are appropriate
  • Why sensitivity analysis becomes essential under MNAR

If you cannot defend MCAR, deletion and mean imputation are high-risk defaults.

Cleaning is not preprocessing.

Cleaning is inference.

This episode is for data scientists, statisticians, epidemiologists, and analysts who want to bring rigor back to real-world data.

StatHarbor Analytics
Épisodes
  • When Clean Data Is Actually Dirty
    Feb 16 2026

    “Cleaning” data is often treated as a harmless preprocessing step.

    Delete missing rows.

    Fill gaps with the mean.

    Move forward.

    But cleaning is not neutral.

    It is a modeling decision that can change:

    • The estimand
    • The sampling mechanism
    • The bias–variance trade-off

    In this episode, we examine the statistical dangers of deletion and simple imputation — and why naïve cleaning can quietly corrupt inference.

    Voir plus Voir moins
    6 min
Pas encore de commentaire