Page de couverture de ELO Ratings Questions

ELO Ratings Questions

ELO Ratings Questions

Écouter gratuitement

Voir les détails du balado

À propos de cet audio

Key Argument
  • Thesis: Using ELO for AI agent evaluation = measuring noise
  • Problem: Wrong evaluators, wrong metrics, wrong assumptions
  • Solution: Quantitative assessment frameworks
The Comparison (00:00-02:00)

Chess ELO

  • FIDE arbiters: 120hr training
  • Binary outcome: win/loss
  • Test-retest: r=0.95
  • Cohen's κ=0.92

AI Agent ELO

  • Random users: Google engineer? CS student? 10-year-old?
  • Undefined dimensions: accuracy? style? speed?
  • Test-retest: r=0.31 (coin flip)
  • Cohen's κ=0.42
Cognitive Bias Cascade (02:00-03:30)
  • Anchoring: 34% rating variance in first 3 seconds
  • Confirmation: 78% selective attention to preferred features
  • Dunning-Kruger: d=1.24 effect size
  • Result: Circular preferences (A>B>C>A)
The Quantitative Alternative (03:30-05:00)

Objective Metrics

  • McCabe complexity ≤20
  • Test coverage ≥80%
  • Big O notation comparison
  • Self-admitted technical debt
  • Reliability: r=0.91 vs r=0.42
  • Effect size: d=2.18
Dream Scenario vs Reality (05:00-06:00)

Dream

  • World's best engineers
  • Annotated metrics
  • Standardized criteria

Reality

  • Random internet users
  • No expertise verification
  • Subjective preferences
Key StatisticsMetricChessAI AgentsInter-rater reliabilityκ=0.92κ=0.42Test-retestr=0.95r=0.31Temporal drift±10 pts±150 ptsHurst exponent0.890.31Takeaways
  1. Stop: Using preference votes as quality metrics
  2. Start: Automated complexity analysis
  3. ROI: 4.7 months to break even
Citations Mentioned
  • Kapoor et al. (2025): "AI agents that matter" - κ=0.42 finding
  • Santos et al. (2022): Technical Debt Grading validation
  • Regan & Haworth (2011): Chess arbiter reliability κ=0.92
  • Chapman & Johnson (2002): 34% anchoring effect
Quotable Moments

"You can't rate chess with basketball fans"

"0.31 reliability? That's a coin flip with extra steps"

"Every preference vote is a data crime"

"The psychometrics are screaming"

Resources
  • Technical Debt Grading (TDG) Framework
  • PMAT (Pragmatic AI Labs MCP Agent Toolkit)
  • McCabe Complexity Calculator
  • Cohen's Kappa Calculator

🔥 Hot Course Offers:
  • 🤖 Master GenAI Engineering - Build Production AI Systems
  • 🦀 Learn Professional Rust - Industry-Grade Development
  • 📊 AWS AI & Analytics - Scale Your ML in Cloud
  • ⚡ Production GenAI on AWS - Deploy at Enterprise Scale
  • 🛠️ Rust DevOps Mastery - Automate Everything
🚀 Level Up Your Career:
  • 💼 Production ML Program - Complete MLOps & Cloud Mastery
  • 🎯 Start Learning Now - Fast-Track Your ML Career
  • 🏢 Trusted by Fortune 500 Teams

Learn end-to-end ML engineering from industry veterans at PAIML.COM

Pas encore de commentaire