🇯🇵 Langfuse Cloud Japan is live →
DocsOverview
DocsEvaluationOverview

Evaluation Overview

Evals give you a repeatable check of your LLM application's behavior. You replace guesswork with data.

LLM evaluation is part of the AI engineering loop:

  1. Trace -> Capture real behavior with Observability.
  2. Monitor -> Track production quality with online evaluators and Score Analytics.
  3. Build datasets -> Turn traces into reusable evaluation assets with annotations and datasets.
  4. Experiment -> Validate changes with experiments and CI/CD checks before they ship.
  5. Evaluate -> Decide what is good enough to ship.
🎥

Watch this walkthrough of Langfuse Evaluation and how to use it to improve your LLM application.

Trace

Capture what users asked, what your system did, and where the output came from with traces and observations. Trace data enables the rest of the loop: monitoring, datasets, and regression checks. For the broader pattern, see tracing.

Monitor

Use production signals to find the traces worth reviewing. Monitoring helps you spot data drift, flag quality issues automatically, and discover examples for your evaluation set.

Build datasets

Turn raw traces into reusable evaluation assets. Start with human review, name the failure modes, then convert useful examples into datasets and score definitions. Start with datasets and error analysis.

Experiment

Use experiments to confirm that a prompt, model, retrieval setup, agent implementation, or evaluator variant improves quality without regressions.

Evaluate

Judge experiment outputs before shipping. Start with manual review to understand quality and failure modes, then automate dedicated evaluators where they add repeatable signal. The tradeoffs are covered in evaluation methods.

  • Use manual evaluation to build intuition and calibrate automated evaluators.
  • Use scores via API/SDK for custom evaluation pipelines, guardrails, runtime checks, user feedback, and internal review workflows.
  • Use LLM-as-a-Judge for qualities that require language understanding, such as relevance, tone, completeness, or factuality.

Which Langfuse feature should I use?

If you want to...Use this Langfuse feature
Capture application behaviorObservability, traces and observations
Segment traces for later reviewTags, metadata, users, sessions, environments, releases
Review examples manuallyAnnotation Queues, Scores via UI
Open Coding: capture open-ended notesTEXT scores, Annotation Queues
Axial Coding: derive failure modesStable error categories, evaluation criteria
Create reusable test casesDatasets
Compare changes before shippingExperiments via UI, Experiments via SDK
Gate pull requests or deploysCI/CD experiments
Monitor production qualityLLM-as-a-Judge, Scores via API/SDK
Analyze evaluator resultsScore Analytics, custom dashboards

GitHub Discussions


Was this page helpful?