
ELO Ratings Questions
Échec de l'ajout au panier.
Veuillez réessayer plus tard
Échec de l'ajout à la liste d'envies.
Veuillez réessayer plus tard
Échec de la suppression de la liste d’envies.
Veuillez réessayer plus tard
Échec du suivi du balado
Ne plus suivre le balado a échoué
-
Narrateur(s):
-
Auteur(s):
À propos de cet audio
- Thesis: Using ELO for AI agent evaluation = measuring noise
- Problem: Wrong evaluators, wrong metrics, wrong assumptions
- Solution: Quantitative assessment frameworks
Chess ELO
- FIDE arbiters: 120hr training
- Binary outcome: win/loss
- Test-retest: r=0.95
- Cohen's κ=0.92
AI Agent ELO
- Random users: Google engineer? CS student? 10-year-old?
- Undefined dimensions: accuracy? style? speed?
- Test-retest: r=0.31 (coin flip)
- Cohen's κ=0.42
- Anchoring: 34% rating variance in first 3 seconds
- Confirmation: 78% selective attention to preferred features
- Dunning-Kruger: d=1.24 effect size
- Result: Circular preferences (A>B>C>A)
Objective Metrics
- McCabe complexity ≤20
- Test coverage ≥80%
- Big O notation comparison
- Self-admitted technical debt
- Reliability: r=0.91 vs r=0.42
- Effect size: d=2.18
Dream
- World's best engineers
- Annotated metrics
- Standardized criteria
Reality
- Random internet users
- No expertise verification
- Subjective preferences
- Stop: Using preference votes as quality metrics
- Start: Automated complexity analysis
- ROI: 4.7 months to break even
- Kapoor et al. (2025): "AI agents that matter" - κ=0.42 finding
- Santos et al. (2022): Technical Debt Grading validation
- Regan & Haworth (2011): Chess arbiter reliability κ=0.92
- Chapman & Johnson (2002): 34% anchoring effect
"You can't rate chess with basketball fans"
"0.31 reliability? That's a coin flip with extra steps"
"Every preference vote is a data crime"
"The psychometrics are screaming"
Resources- Technical Debt Grading (TDG) Framework
- PMAT (Pragmatic AI Labs MCP Agent Toolkit)
- McCabe Complexity Calculator
- Cohen's Kappa Calculator
- 🤖 Master GenAI Engineering - Build Production AI Systems
- 🦀 Learn Professional Rust - Industry-Grade Development
- 📊 AWS AI & Analytics - Scale Your ML in Cloud
- ⚡ Production GenAI on AWS - Deploy at Enterprise Scale
- 🛠️ Rust DevOps Mastery - Automate Everything
- 💼 Production ML Program - Complete MLOps & Cloud Mastery
- 🎯 Start Learning Now - Fast-Track Your ML Career
- 🏢 Trusted by Fortune 500 Teams
Learn end-to-end ML engineering from industry veterans at PAIML.COM
Pas encore de commentaire