Épisodes

  • GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI
    Dec 20 2025

    This week on Ship It Weekly, Brian looks at how the “platform tax” is showing up everywhere: pricing model shifts, CI dependencies, and new security boundaries thanks to AI agents.

    We start with GitHub Actions. GitHub announced a new “cloud platform” charge for self-hosted runners in private/internal repos… then hit pause after backlash. Hosted runner price reductions for 2026 are still planned. We also got the perfect timing joke: a GitHub incident the same week.

    Next up is HashiCorp. Legacy HCP Terraform (Terraform Cloud) Free is reaching end-of-life in 2026, with orgs moving to the newer Free tier capped at 500 managed resources. If you’re running real infrastructure, this is a good moment to audit what you’re actually managing and decide whether you’re cleaning up, paying, or planning a migration.

    Then we talk PromptPwnd: why stuffing untrusted PR/issue text into AI agent prompts (inside CI) can turn into a supply chain/security problem. The short version: treat AI inputs like hostile user input, keep tokens/permissions minimal, and don’t let agents “run with scissors.”

    We also cover the Home Depot report about long-lived access exposure as a reminder that secrets hygiene, blast radius, and detection still matter more than the shiny tools.

    In the lightning round: CDKTF is sunset/archived, Bitbucket is cleaning up free unused workspaces, and SourceHut is proposing pricing changes. We wrap with a human note on “platform whiplash” and why a simple watchlist beats carrying all this stuff in your head.

    Links from this episode

    GitHub Actions pricing + pause https://runs-on.com/blog/github-self-hosted-runner-fee-2026/ https://x.com/github/status/2001372894882918548 https://www.githubstatus.com/incidents/x696x0g4t85l

    HashiCorp / Terraform Cloud free plan changes https://github.com/hashicorp/terraform-cdk?tab=readme-ov-file#sunset-notice https://www.reddit.com/r/Terraform/s/slYm77wzYr

    PromptPwnd / AI agents in CI https://www.aikido.dev/blog/promptpwnd-github-actions-ai-agents

    Home Depot access exposure report https://techcrunch.com/2025/12/12/home-depot-exposed-access-to-internal-systems-for-a-year-says-researcher/

    Bitbucket cleanup https://community.atlassian.com/forums/Bitbucket-articles/Bitbucket-cleanup-of-free-unused-workspaces-what-you-need-to/ba-p/3144063

    SourceHut pricing proposal https://sourcehut.org/blog/2025-12-01-proposed-pricing-changes/

    Voir plus Voir moins
    12 min
  • IBM Buys Confluent, React2Shell, and Netflix on Aurora
    Dec 12 2025

    In this episode of Ship It Weekly, Brian powers through a cold and digs into a very “infra grown-up” week in DevOps.

    First up, IBM is buying Confluent for $11B. We talk about what that means if you’re on Confluent Cloud today, still running your own Kafka, or trying to choose between Confluent, MSK, and DIY. It’s part of a bigger pattern after IBM’s HashiCorp deal, and it has real implications for vendor concentration and “plan B” strategies.

    Then we shift to React2Shell, a 10.0 RCE in React Server Components that’s already being exploited in the wild. Even if you never touch React, if you run platforms or Kubernetes for teams using Next.js or RSC, you’re on the hook for patching windows, WAF rules, and blast-radius thinking.

    We also look at Netflix’s write-up on consolidating relational databases onto Aurora PostgreSQL, with big performance gains and cost savings. It’s a good excuse to step back and ask whether your own Postgres fleet still makes sense at the scale you’re at now.

    In the lightning round, we hit OpenTofu 1.11’s new language features, practical Terraform “tips from the trenches,” Ghostty becoming a non-profit project, and two spec-driven dev tools (Spec Kit and OpenSpec) that show what sane AI-assisted development might look like.

    For the human side, we close with “Your Brain on Incidents” and what high-stress outages actually do to people, plus a few concrete ideas for making on-call less brutal.

    If you’re on a platform team, own SLOs, or you’re the person people ping when “something is wrong with prod,” this one should give you a mix of immediate to-dos and longer-term questions for your roadmap.

    Links:

    IBM + Confluent https://www.confluent.io/blog/ibm-to-acquire-confluent/ https://newsroom.ibm.com/2025-12-08-ibm-to-acquire-confluent-to-create-smart-data-platform-for-enterprise-generative-ai

    React2Shell (CVE-2025-55182) https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components

    Netflix on Aurora PostgreSQL https://aws.amazon.com/blogs/database/netflix-consolidates-relational-database-infrastructure-on-amazon-aurora-achieving-up-to-75-improved-performance/

    Tools & tips https://opentofu.org/blog/opentofu-1-11-0/ https://rosesecurity.dev/2025/12/04/terraform-tips-and-tricks.html https://mitchellh.com/writing/ghostty-non-profit https://github.com/github/spec-kit https://github.com/Fission-AI/OpenSpec

    Human side https://uptimelabs.io/your-brain-on-incidents/

    Voir plus Voir moins
    16 min
  • AWS re:Invent for Platform Teams, GKE at 130k Nodes, and Killing Staging
    Dec 4 2025

    In this episode of Ship It Weekly, Brian looks at re:Invent through a platform/SRE lens and pulls out the updates that actually change how you design and run systems.

    We talk about regional NAT Gateways and Route 53 Global Resolver on the networking side, ECS Express Mode and EKS Capabilities as new paved roads for app teams, S3 Vectors GA and 50 TB S3 objects for AI and data lakes, Aurora PostgreSQL dynamic data masking, CodeCommit’s return to full GA, and IAM Policy Autopilot for AI-assisted IAM policies. This was recorded mid–re:Invent, so consider it a “what matters so far” pass, not a full recap.

    Outside AWS, we get into Google’s 130,000-node GKE cluster and what actually applies if you’re running normal-sized clusters, plus the “It’s time to kill staging” argument and what responsible testing in production looks like with feature flags, progressive delivery, and solid observability.

    In the lightning round, we hit Zachary Loeber’s Terraform MCP server and terraform-ingest (letting AI tools speak your real Terraform modules), Runs-On’s EC2 instance rankings so you stop picking instance types by vibes, and Airbnb’s adaptive traffic management for their key-value store. We close with Nolan Lawson’s “The fate of small open source” and what it means when your platform quietly depends on one-maintainer libraries.

    Links from this episode:

    AWS highlights:

    https://aws.amazon.com/about-aws/whats-new/2025/11/aws-nat-gateway-regional-availability

    https://aws.amazon.com/blogs/aws/introducing-amazon-route-53-global-resolver-for-secure-anycast-dns-resolution-preview

    https://aws.amazon.com/about-aws/whats-new/2025/11/announcing-amazon-ecs-express-mode

    https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-s3-vectors-generally-available/

    Other topics:

    https://cloud.google.com/blog/products/containers-kubernetes/how-we-built-a-130000-node-gke-cluster

    https://thenewstack.io/its-time-to-kill-staging-the-case-for-testing-in-production/

    https://blog.zacharyloeber.com/article/terraform-custom-module-mcp-server/

    https://go.runs-on.com/instances/ranking

    https://medium.com/airbnb-engineering/from-static-rate-limiting-to-adaptive-traffic-management-in-airbnbs-key-value-store-29362764e5c2

    https://nolanlawson.com/2025/11/16/the-fate-of-small-open-source/

    Voir plus Voir moins
    22 min
  • Kubernetes Config Reality Check, EKS Control Planes, and GitHub Guardrails
    Nov 26 2025

    In this episode of Ship It Weekly, Brian digs into what’s new for people actually running infra: Kubernetes config, EKS control planes and networking, and GitHub’s latest CI/CD and Copilot updates.

    We start with Kubernetes’ new configuration good practices post and how to turn it into a checklist to clean up Helm/Kustomize and kill off “hotfix from my laptop” manifests.

    Then we hit AWS: EKS Provisioned Control Plane to size control plane capacity for big or noisy clusters, plus new network observability so you can see who’s talking to what across clusters and AZs instead of guessing from node metrics.

    On the GitHub side, Actions OIDC tokens now include a check_run_id for tighter access control, and Copilot adds instructions files and custom agents so you can encode platform and security expectations directly into reviews and workflows.

    In the lightning round, we touch on Terrascan being archived, Microsoft’s write-up of a 15.72 Tbps Aisuru DDoS attack against Azure, and AWS flat-rate CloudFront plans that bundle CDN and security into more predictable pricing.

    We close with Lorin Hochstein’s “Two thought experiments” and what it looks like to write incident reports as if an AI (and your future teammates) will rely on them to debug the next outage.

    If run Kubernetes in prod this one should give you a few concrete ideas for your roadmap.

    Links from episode

    https://kubernetes.io/blog/2025/11/25/configuration-good-practices/

    https://aws.amazon.com/about-aws/whats-new/2025/11/amazon-eks-provisioned-control-plane/

    https://aws.amazon.com/blogs/aws/monitor-network-performance-and-traffic-across-your-eks-clusters-with-container-network-observability/

    https://github.blog/changelog/2025-11-13-github-actions-oidc-token-claims-now-include-check_run_id/

    https://github.blog/ai-and-ml/unlocking-the-full-power-of-copilot-code-review-master-your-instructions-files/

    https://docs.github.com/en/copilot/how-tos/use-copilot-agents/coding-agent/create-custom-agents

    Lightning Round

    https://github.com/tenable/terrascan

    https://www.bleepingcomputer.com/news/microsoft/microsoft-aisuru-botnet-used-500-000-ips-in-15-tbps-azure-ddos-attack/

    https://aws.amazon.com/about-aws/whats-new/2025/11/aws-flat-rate-pricing-plans/

    https://sreweekly.com/sre-weekly-issue-498/ (Lorin's Article)

    Voir plus Voir moins
    17 min
  • Kubernetes Shake-ups, Platform Reality, and AI-Native SRE
    Nov 21 2025

    In this episode of Ship It Weekly, Brian digs into 3 big themes for anyone running Kubernetes or building internal platforms.

    First, Kubernetes is officially retiring Ingress NGINX and moving it into best-effort maintenance until March 2026. We talk about what that actually means if you’re still using it and how to think about choosing and rolling out a replacement ingress.

    Second, we look at how CNCF is defining platform engineering and what “platform as a product” looks like in practice, plus some hard-earned lessons from running Kubernetes in production.

    Third, we talk about AI as a first-class workload on Kubernetes. CNCF’s new Certified Kubernetes AI Conformance Program aims to standardize how AI runs on K8s, and recent writing on SRE in the age of AI looks at what reliability means when systems learn and drift.

    In the lightning round, we hit good reads on database migrations, Postgres upgrades, and a distributed priority queue on Kafka. We wrap with the human side of incidents: fixation during incident response and using incidents as landmarks for the tradeoffs you’ve been making over time.

    If you’re on a platform team, responsible for SLOs, or the person people ping when “Kubernetes is weird,” this one should give you concrete questions to take back to your roadmap and runbooks.

    Links from this episode

    https://kubernetes.io/blog/2025/11/11/ingress-nginx-retirement/

    https://www.haproxy.com/blog/ingress-nginx-is-retiring

    https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/

    https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/

    https://devops.com/sre-in-the-age-of-ai-what-reliability-looks-like-when-systems-learn/

    Lightning round

    https://www.cncf.io/blog/2025/11/18/top-5-hard-earned-lessons-from-the-experts-on-managing-kubernetes/

    https://www.tines.com/blog/zero-downtime-database-migrations-lessons-from-moving-a-live-production

    https://palark.com/blog/postgresql-upgrade-no-data-loss-downtime/

    https://klaviyo.tech/building-a-distributed-priority-queue-in-kafka-1b2d8063649e

    https://sreweekly.com/sre-weekly-issue-497/

    https://ferd.ca/ongoing-tradeoffs-and-incidents-as-landmarks.html

    Voir plus Voir moins
    16 min
  • Special: When the Cloud Has a Bad Day: Cloudflare, AWS us-east-1 & GitHub Outages
    Nov 20 2025

    In this special kickoff episode of Ship It Weekly, Brian walks through three major outages from the last few weeks and what they actually mean for DevOps, SRE, and platform teams.

    Instead of just reading status pages, we look at how each incident exposes assumptions in our own architectures and runbooks:

    Topics in this episode:

    • Cloudflare’s global outage and what happens when your CDN/WAF becomes a single point of failure

    • The AWS us-east-1 incident and why “multi-AZ in one region” isn’t a full disaster recovery strategy

    • GitHub’s Git operations / Codespaces outage and how fragile our CI/CD and GitOps flows can be

    • Practical questions to ask about your own setup: CDN bypass, cross-region readiness, backups for Git and CI

    This episode is more of a themed “special” to kick things off.

    Going forward, most episodes will follow a lighter news format: a couple of main stories from the week in DevOps/SRE/platform engineering, a quick tools and releases segment, and one culture/on-call or burnout topic. Specials like this will pop up when there’s a big incident or theme worth unpacking.

    If you’re the person people DM when production is acting weird, or you’re responsible for the platform everyone ships on, this one’s for you.

    Links from this episode

    Cloudflare outage – November 18, 2025

    https://blog.cloudflare.com/18-november-2025-outage/

    https://www.thousandeyes.com/blog/cloudflare-outage-analysis-november-18-2025

    AWS us-east-1 outage – October 2025

    https://aws.amazon.com/message/101925/

    https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025

    GitHub outage – November 18, 2025

    https://us.githubstatus.com/incidents/f3f7sg2d1m20

    https://currently.att.yahoo.com/att/github-down-now-not-just-211700617.html

    Voir plus Voir moins
    13 min