Discussion about this post

User's avatar
Louis-François Bouchard's avatar

Thanks for having us Miguel! :) Super happy to be the first to guest post and also super happy to jump in and share our insights from Towards AI.

Expand full comment
Nicos's avatar

Welcome Louis-Francois on your fabulous waves, Miguel! I skimmed through this fascinating yet dense deep dive. My intention is not to bluntly critique but to help your fans have better clarity/intuition from the onset - why bother?

As it was numerously stated in the post CAG is a different beast, requires thorough understanding of underlying mechanisms (transformer, KV, FSDP, 3D compute etc) for real world setup.

Use cases warranting CAG implementation

——

- You are enterprise, that have expensive GPU capacities and want to strike cost effective latency/throughput balance

- your typical sessions are either very long or across many sessions(users) there is repeated use of context

- context can’t change otherwise (like every caching mechanism) it has to be recomputed

What are we missing?

- evaluate, evaluate, evaluate - perfect prompt and taming hallucinations, alignment requires multilayer/multistage evaluations. Highly dynamic. Forget “black box” CAG. Well, if you insist and do open heart surgery to all kinds of CAG hookups - that’s different pay grade,

not even post-doc might help :-)

- enterprise has to sift through terabytes of data, swap in/swap out for analysis by LLM. Forget about limited context window even with CAG machinery.

RAG with its own demons is here to save you. Or sort of, because more often than not it is clunky and ugly with way too many moving parts and steps (thank you non-determinism and never ending evaluations).

If one is super adventurous and pragmatic (and loaf of determination) front-ending CAG with RAG would be the way to go. Of course, devil is in details. But it is not necessarily CAG vs RAG.

Expand full comment
3 more comments...

No posts