Getting started with LLMOps: A practical guide

How to monitor agentic applications with Opik

May 07, 2025

Free Close-up of a computer monitor displaying cyber security data and code, indicative of system hacking or programming. Stock Photo — Image by Christina Morillo (source: pexels)

So, you’ve built your Agentic Workflow using LangGraph. It runs fine on your laptop, the responses seem solid, latency looks good, token usage isn’t crazy … I mean, what could possibly go wrong?

Well—actually—a lot. And I mean a lot.

In today’s article, we’re diving into LLMOps (that’s LLM Operations), which is basically a mashup of DevOps, MLOps, and a bunch of LLM-specific practices. So yes, more “Ops” than anyone asked for.

But don’t worry—I’m not here to dump a bunch of dry theory on you. We’re taking a hands-on, practical approach. Specifically, I’ll show you how LLMOps fits into PhiloAgents, my open-source course with Paul Iusztin where you’ll learn to build a video game agent simulation using MongoDB, LangGraph, Groq and Opik.

This article builds on the theory and code covered in previous ones, so be sure to check them out if you haven’t already!
Lesson 1: AI Agents inside a videogame
Lesson 2: A hands-on introduction to Agentic RAG
Lesson 3: A breakdown of memory in Agentic Systems
Lesson 4: How to build realtime agentic applications

Ready? Let’s go! 👇

Wait … what is LLMOps?

In short, LLMOps (Large Language Model Operations) is the set of practices, tools, and workflows used to deploy, monitor, and manage LLMs in real-world applications. Think of it as a specialized branch of MLOps—built to handle the quirks and challenges that come with working specifically with large language models.

Like most things in the GenAI space, LLMOps is still very much evolving. But from what I’ve seen so far, it helps to break it down into six key components (check out the image below for a visual overview).

Let’s break down each component:

Model Deployment - Getting your LLM live—this includes deploying to inference servers, fine-tuning, and managing versions.
Monitoring - Track performance, catch errors, log traces, and keep response times in check. (We’ll cover this today.)
Data Management - Curate and maintain high-quality training and evaluation data. (Also part of today’s demo.)
Security - Privacy, permissions, guardrails, and staying compliant—essential for any real-world app.
Prompt Versioning - Just like code, prompts need version control. It’s often skipped, but key for stability in production.
Evaluation - Measure how your system performs—hallucinations, relevance, moderation, and more.

Today, we’re focusing on prompt versioning, monitoring, data management and evaluation with a real, hands-on example.

LLMOps with Opik

For the rest of this article, we’ll be using Opik—an open-source LLM evaluation framework packed with tools that make it easier to apply LLMOps principles to your agentic app.

Let’s begin with prompt versioning.

Prompt Versioning

As agentic applications become more common, prompt versioning is turning into a must-have. These days, you wouldn’t dream of shipping a system without versioning your code, models, or data—so why treat prompts any differently?

With Opik, keeping track of prompt changes is straightforward. Just check out this example class from the PhiloAgents source code:

All it takes is giving your prompt a name and passing it to opik.Prompt—and you’re good to go!

Once you fire up the PhiloAgents app, head over to your Opik dashboard, and under the Prompt Library tab, you’ll see something like this:

You’ll find a list of all the prompts used in PhiloAgents—like the summarization prompt, philosopher card prompt, and more. You can also see how many versions each prompt has (the character card, for example, has 18—we iterated a lot) and when each was last updated.

Now that we’ve got a handle on prompt versioning, let’s move on to monitoring.

Agent Monitoring

Agentic frameworks like LangGraph make it incredibly easy to build complex workflows with just a few lines of code.

The downside? That convenience comes with a tradeoff—a lot of what’s happening under the hood gets abstracted away, turning your agent’s behavior into a bit of a black box. That’s why it’s so important to track every single message, tool call, and feedback loop your agent goes through.

Thankfully, Opik makes this dead simple—just check out the snippet below to see how easy it is to plug in.

All you need to do is instantiate an OpikTracer and pass it as a callback when invoking your graph. From there, Opik handles the rest, automatically sending every trace you need straight to the dashboard.

In the Traces tab of the Opik dashboard, you’ll see a full breakdown: each node called, total execution time, time per node, inputs, outputs, and even the graph structure itself.

Now it’s time to move on to the next step: evaluating our Agentic RAG application.

Evaluation (and Data Management)

The first step in evaluating a RAG application is building an evaluation dataset. To do that, we’ll follow the process outlined in the diagram below.

First, we download a document about a specific philosopher (like Plato) from sources like Wikipedia or the Stanford Encyclopedia of Philosophy.

Next, we use LangChain’s RecursiveCharacterTextSplitter to break the text into manageable chunks.

Then comes the fun part: we run a chain powered by Groq’s LLaMA 3.3 70B to generate synthetic user–assistant conversations based on those chunks.

Each conversation becomes a row in our evaluation dataset, which we then push to Opik for analysis.

Now that we’ve got our dataset, it’s time to evaluate the Agentic RAG app using five key metrics:

Hallucination – Checks if the response includes made-up info, using input, output, and (if available) context.
Answer Relevance – Rates how well the response answers the question, focusing on relevance over factual accuracy.
Moderation – Scores how appropriate the response is, on a scale from 1 to 10.
Context Precision – Measures how closely the response aligns with the provided context.
Context Recall – Evaluates whether important parts of the context were actually used in the response.

You’ll find all the evaluation logic in our evaluate_agent method—see the snippet below.

Once the evaluation is complete, you’ll get a panel showing all the metrics from your experiment. If you head to Projects → Metrics, you’ll also find a plot of the daily averages, which is super useful for tracking your app’s performance over time.

Phew … that was a lot to take in, I know.

But here’s the good news: if you prefer learning by watching, and you're up for 40+ minutes of LLMOps goodness, you’re in luck!

Also make sure to read Anca Ioana Muscalagiu’s article on Observability for RAG Agents on Decoding ML. Both resources are complementary, and we recommend checking out both to level up your understanding and get the full picture.

Decoding ML

Observability for RAG Agents

The fifth lesson of the open-source PhiloAgents course: a free course on building gaming simulation agents that transform NPCs into human-like characters in an interactive game environment…

7 months ago · 26 likes · 2 comments · Anca Ioana Muscalagiu and Paul Iusztin

Have an amazing week and see you next Wednesday! 👋

Discussion about this post

Ready for more?