Semantic Caches: Scaling AI with Smarter Caching (Chapter 15)
Échec de l'ajout au panier.
Échec de l'ajout à la liste d'envies.
Échec de la suppression de la liste d’envies.
Échec du suivi du balado
Ne plus suivre le balado a échoué
-
Narrateur(s):
-
Auteur(s):
À propos de cet audio
emantic caches are transforming how AI systems handle costly reasoning by intelligently reusing prior agent workflows to slash latency and inference costs. In this episode, we unpack Chapter 15 of Keith Bourne’s "Unlocking Data with Generative AI and RAG," exploring the architectures, trade-offs, and practical engineering of semantic caches for production AI.
In this episode:
- What semantic caches are and why they reduce AI inference latency by up to 100x
- Core techniques: vector embeddings, entity masking, and CrossEncoder verification
- Comparing semantic cache variants and fallback strategies for robust performance
- Under-the-hood implementation details using ChromaDB, sentence-transformers, and CrossEncoder
- Real-world use cases across finance, customer support, and enterprise AI assistants
- Key challenges: tuning thresholds, cache eviction, and maintaining precision in production
Key tools and technologies mentioned:
- ChromaDB vector database
- Sentence-transformers embedding models (e.g., all-mpnet-base-v2)
- CrossEncoder models for verification
- Regex-based entity masking
- Adaptive similarity thresholding
Timestamps:
00:00 - Introduction and episode overview
02:30 - What are semantic caches and why now?
06:15 - Core architecture: embedding, masking, and verification
10:00 - Semantic cache variants and fallback approaches
13:30 - Implementation walkthrough using Python and ChromaDB
16:00 - Real-world applications and performance metrics
18:30 - Open problems and engineering challenges
19:30 - Final thoughts and book spotlight
Resources:
- "Unlocking Data with Generative AI and RAG" by Keith Bourne - Search for 'Keith Bourne' on Amazon and grab the 2nd edition
- Memriq AI: https://Memriq.ai