How to Evaluate Rag Systems

Explore top LinkedIn content from expert professionals.

Summary

Retrieval-Augmented Generation (RAG) systems combine a language model with a retrieval mechanism, allowing AI to answer questions by referencing relevant documents and providing more accurate, domain-specific responses. Understanding how to evaluate RAG systems is critical for ensuring reliable results, especially as businesses rely on them for handling complex data and real-world scenarios.

  • Build domain datasets: Create evaluation datasets with real-world queries and documents that represent the challenges your RAG system will face in production.
  • Apply diverse metrics: Use both quantitative metrics, such as precision and recall, and qualitative checks like completeness and context understanding to assess how well the system retrieves and generates answers.
  • Automate quality checks: Implement automated frameworks, like LLM-as-a-Judge or nugget-based evaluations, to scale judgment of answer correctness, faithfulness, and semantic alignment across many responses.
Summarized by AI based on LinkedIn member posts
  • View profile for Matt Wood
    Matt Wood Matt Wood is an Influencer

    CTIO, PwC

    79,003 followers

    New! We’ve published a new set of automated evaluations and benchmarks for RAG - a critical component of Gen AI used by most successful customers today. Sweet. Retrieval-Augmented Generation lets you take general-purpose foundation models - like those from Anthropic, Meta, and Mistral - and “ground” their responses in specific target areas or domains using information which the models haven’t seen before (maybe confidential, private info, new or real-time data, etc). This lets gen AI apps generate responses which are targeted to that domain with better accuracy, context, reasoning, and depth of knowledge than the model provides off the shelf. In this new paper, we describe a way to evaluate task-specific RAG approaches such that they can be benchmarked and compared against real-world uses, automatically. It’s an entirely novel approach, and one we think will help customers tune and improve their AI apps much more quickly, and efficiently. Driving up accuracy, while driving down the time it takes to build a reliable, coherent system. 🔎 The evaluation is tailored to a particular knowledge domain or subject area. For example, the paper describes tasks related to DevOps troubleshooting, scientific research (ArXiv abstracts), technical Q&A (StackExchange), and financial reporting (SEC filings). 📝 Each task is defined by a specific corpus of documents relevant to that domain. The evaluation questions are generated from and grounded in this corpus. 📊 The evaluation assesses the RAG system's ability to perform specific functions within that domain, such as answering questions, solving problems, or providing relevant information based on the given corpus. 🌎 The tasks are designed to mirror real-world scenarios and questions that might be encountered when using a RAG system in practical applications within that domain. 🔬 Unlike general language model benchmarks, these task-specific evaluations focus on the RAG system's performance in retrieving and applying information from the given corpus to answer domain-specific questions. ✍️ The approach allows for creating evaluations for any task that can be defined by a corpus of relevant documents, making it adaptable to a wide range of specific use cases and industries. Really interesting work from the Amazon science team, and a new totem of evaluation for customers choosing and tuning their RAG systems. Very cool. Paper linked below.

  • View profile for Armand Ruiz
    Armand Ruiz Armand Ruiz is an Influencer

    building AI systems

    206,023 followers

    Explaining the Evaluation method LLM-as-a-Judge (LLMaaJ). Token-based metrics like BLEU or ROUGE are still useful for structured tasks like translation or summarization. But for open-ended answers, RAG copilots, or complex enterprise prompts, they often miss the bigger picture. That’s where LLMaaJ changes the game. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝘁? You use a powerful LLM as an evaluator, not a generator. It’s given: - The original question - The generated answer - And the retrieved context or gold answer 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗮𝘀𝘀𝗲𝘀𝘀𝗲𝘀: ✅ Faithfulness to the source ✅ Factual accuracy ✅ Semantic alignment—even if phrased differently 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: LLMaaJ captures what traditional metrics can’t. It understands paraphrasing. It flags hallucinations. It mirrors human judgment, which is critical when deploying GenAI systems in the enterprise. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗟𝗟𝗠𝗮𝗮𝗝-𝗯𝗮𝘀𝗲𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: - Answer correctness - Answer faithfulness - Coherence, tone, and even reasoning quality 📌 If you’re building enterprise-grade copilots or RAG workflows, LLMaaJ is how you scale QA beyond manual reviews. To put LLMaaJ into practice, check out EvalAssist; a new tool from IBM Research. It offers a web-based UI to streamline LLM evaluations: - Refine your criteria iteratively using Unitxt - Generate structured evaluations - Export as Jupyter notebooks to scale effortlessly A powerful way to bring LLM-as-a-Judge into your QA stack. - Get Started guide: https://lnkd.in/g4QP3-Ue - Demo Site: https://lnkd.in/gUSrV65s - Github Repo: https://lnkd.in/gPVEQRtv - Whitepapers: https://lnkd.in/gnHi6SeW

  • View profile for Sarthak Rastogi

    AI engineer | Posts on agents + advanced RAG | Experienced in LLM research, ML engineering, Software Engineering

    24,807 followers

    Your RAG app is NOT going to be usable in production (especially at large enterprises) if you overlook these evaluation steps -- - Before anything else, FIRST create a comprehensive evaluation dataset by writing queries that match real production use cases. - Evaluate retriever performance with non-rank metrics like Recall@k (how many relevant chunks are found in top-k results) and Precision@k (what fraction of retrieved chunks are actually relevant). These show if the right content is being found regardless of order :) - Assess retriever ranking quality with rank-based metrics including: 1. MRR (position of first relevant chunk) 2. MAP (considers all relevant chunks and their ranks) 3. NDCG (compares actual ranking to ideal ranking) These measure how well your relevant content is prioritized. - Measure generator citation performance by designing prompts that request explicit citations like [1], [2] or source sections. Calculate citation Recall@k (relevant chunks that were actually cited) and citation Precision@k (cited chunks that are actually relevant). - Evaluate response quality with quantitative metrics like F1 score at token level by tokenising both generated and ground truth responses. - Apply qualitative assessment across key dimensions including completeness (fully answers query), relevancy (answer matches question), harmfulness (potential for harm through errors), and consistency (aligns with provided chunks). Finally, with your learnings from the eval results, you can implement systematic optimisation in three sequential stages: 1. pre-processing (chunking, embeddings, query rewriting) 2. processing (retrieval algorithms, LLM selection, prompts) 3. post-processing (safety checks, formatting). With the right evaluation strategies and metrics in place, you can drastically enhance the performance and reliability of RAG systems :) Link to a the brilliant article by Ankit Vyas from neptune.ai on how to implement these steps: https://lnkd.in/guDnkdMT #RAG #AIAgents #GenAI

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    15,690 followers

    Evaluating Retrieval-Augmented Generation (RAG) systems has long been a challenge, given the complexity and subjectivity of long-form responses. A recent collaborative research paper from institutions including the University of Waterloo, Microsoft, and Snowflake presents a promising solution: the AutoNuggetizer framework. This innovative approach leverages Large Language Models (LLMs) to automate the "nugget evaluation methodology," initially proposed by TREC in 2003 for assessing responses to complex questions. Here's a technical breakdown of how it works under the hood: 1. Nugget Creation:   - Initially, LLMs automatically extract "nuggets," or atomic pieces of essential information, from a set of related documents.   - Nuggets are classified as "vital" (must-have) or "okay" (nice-to-have) based on their importance in a comprehensive response.   - An iterative prompt-based approach using GPT-4o ensures the nuggets are diverse and cover different informational facets. 2. Nugget Assignment:   - LLMs then automatically evaluate each system-generated response, assigning nuggets as "support," "partial support," or "no support."   - This semantic evaluation allows the model to recognize supported facts even without direct lexical matching. 3. Evaluation and Correlation:   - Automated evaluation scores strongly correlated with manual evaluations, particularly at the system-run level, suggesting this methodology could scale efficiently for broad usage.   - Interestingly, the automation of nugget assignment alone significantly increased alignment with manual evaluations, highlighting its potential as a cost-effective evaluation approach. Through rigorous validation against human annotations, the AutoNuggetizer framework demonstrates a practical balance between automation and evaluation quality, providing a scalable, accurate method to advance RAG system evaluation. The research underscores not just the potential of automating complex evaluations, but also opens avenues for future improvements in RAG systems.

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,144 followers

    Many companies have started experimenting with simple RAG systems, probably as their first use case, to test the effectiveness of generative AI in extracting knowledge from unstructured data like PDFs, text files, and PowerPoint files. If you've used basic RAG architectures with tools like LlamaIndex or LangChain, you might have already encountered three key problems: 𝟭. 𝗜𝗻𝗮𝗱𝗲𝗾𝘂𝗮𝘁𝗲 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: Existing metrics fail to catch subtle errors like unsupported claims or hallucinations, making it hard to accurately assess and enhance system performance. 𝟮. 𝗗𝗶𝗳𝗳𝗶𝗰𝘂𝗹𝘁𝘆 𝗛𝗮𝗻𝗱𝗹𝗶𝗻𝗴 𝗖𝗼𝗺𝗽𝗹𝗲𝘅 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀: Standard RAG methods often struggle to find and combine information from multiple sources effectively, leading to slower responses and less relevant results. 𝟯. 𝗦𝘁𝗿𝘂𝗴𝗴𝗹𝗶𝗻𝗴 𝘁𝗼 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗮𝗻𝗱 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻𝘀: Basic RAG approaches often miss the deeper relationships between information pieces, resulting in incomplete or inaccurate answers that don't fully meet user needs. In this post I will introduce three useful papers to address these gaps: 𝟭. 𝗥𝗔𝗚𝗖𝗵𝗲𝗸𝗲𝗿: introduces a new framework for evaluating RAG systems with a focus on fine-grained, claim-level metrics. It proposes a comprehensive set of metrics: claim-level precision, recall, and F1 score to measure the correctness and completeness of responses; claim recall and context precision to evaluate the effectiveness of the retriever; and faithfulness, noise sensitivity, hallucination rate, self-knowledge reliance, and context utilization to diagnose the generator's performance. Consider using these metrics to help identify errors, enhance accuracy, and reduce hallucinations in generated outputs. 𝟮. 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁𝗥𝗔𝗚: It uses a labeler and filter mechanism to identify and retain only the most relevant parts of retrieved information, reducing the need for repeated large language model calls. This iterative approach refines search queries efficiently, lowering latency and costs while maintaining high accuracy for complex, multi-hop questions. 𝟯. 𝗚𝗿𝗮𝗽𝗵𝗥𝗔𝗚: By leveraging structured data from knowledge graphs, GraphRAG methods enhance the retrieval process, capturing complex relationships and dependencies between entities that traditional text-based retrieval methods often miss. This approach enables the generation of more precise and context-aware content, making it particularly valuable for applications in domains that require a deep understanding of interconnected data, such as scientific research, legal documentation, and complex question answering. For example, in tasks such as query-focused summarization, GraphRAG demonstrates substantial gains by effectively leveraging graph structures to capture local and global relationships within documents. It's encouraging to see how quickly gaps are identified and improvements are made in the GenAI world.

  • View profile for Agus Sudjianto

    A geek who can speak: Co-creator of PiML and MoDeVa, SVP Risk & Technology H2O.ai, former EVP-Head of Wells Fargo MRM

    27,400 followers

    Building Trustworthy RAG Systems for Regulated Industries Ensuring the reliability of Retrieval-Augmented Generation (RAG) systems in regulated industries is a challenging but critical task. High-stakes applications demand rigorous validation and robust safeguards to prevent errors that could lead to significant harm. Here’s how to approach it: 1. Bounded Inputs and Outputs: RAG systems rely on a controlled set of documents and bounded prompts. By limiting inputs to knowledge-relevant queries, we can restrict outputs to reliable and context-grounded responses. 2. Comprehensive Testing: • Topics and Query Types: Design test cases covering a wide range of topics and query complexities (e.g., factual, reasoning, multi-hop). • Metrics: Evaluate key factors like recall, precision, groundedness, and answer relevancy to measure performance and identify gaps. 3. Transparent Evaluation: • Use embedding-based and statistical metrics for groundedness and relevancy. Avoid relying on LLMs as sole evaluators. • Complement automated metrics with human-calibrated assessments to ensure nuanced evaluation. 4. Systemic Weakness Identification: Perform marginal analyses by topic and query type to pinpoint weaknesses and build resilience against rare or challenging scenarios. 5. Conformal Prediction: Use uncertainty quantification methods like conformal prediction to detect and address high-risk scenarios systematically. Link to paper: https://lnkd.in/eMUyx9Wu

  • How do you know your RAG system actually works—and isn’t just sounding smart? Evaluation metrics are the backbone of reliable Retrieval-Augmented Generation (RAG). They help teams move from “it looks correct” to “it is measurably correct.” A RAG system has two brains: Retriever (finds relevant knowledge) and Generator (uses it to answer). The real question is: ▪️ Did we retrieve the right context? ▪️Did the model stay grounded? ▪️Did it hallucinate? Key metrics that matter: 🔷 Retrieval quality • Recall@K – Did we fetch the right information? • Precision@K – How much of the retrieved context is actually useful? • Context Relevancy – How closely does the context match the question? 🔷Generation quality • Faithfulness – Are claims supported by retrieved sources (hallucination check)? • Groundedness – Is the response truly tied to the provided documents? • Answer Relevancy – Does the answer directly address the user’s question? In production AI systems, these metrics shift teams from intuition to evidence—reducing hallucinations, improving search quality, and enabling objective comparison across RAG architectures. Tools like Ragas, TruLens, DeepEval, LangChain, and LlamaIndex now make continuous evaluation part of the AI lifecycle—not a one-time exercise Bottom line: If you’re deploying enterprise GenAI, shipping without evaluation is like launching software without tests. #GenerativeAI #RAG #AIEngineering #LLMEvaluation #AIGovernance #MLOps #EnterpriseA

  • View profile for Jason Liu

    Applied AI Consultant / Educator

    7,599 followers

    How to Systematically Improve Your RAG Applications After years consulting on applied AI—from recommendation systems to spam detection to generative search—I've realized that simply connecting an LLM to your data is just the first step in building effective RAG (Retrieval-Augmented Generation) systems. The real magic happens when you measure, iterate, and prevent regression. Here's what I've learned: Common Pitfalls to Avoid **Absence Bias**: Ignoring what you can't see—especially the retrieval step. Everyone focuses on prompt tweaking or model upgrades, but if you're retrieving the wrong content chunks, no LLM upgrade will fix that. **Intervention Bias**: The urge to do anything to feel in control—implementing every new prompt trick or fancy architecture without measuring if it actually helps. This creates unmaintainable systems. A Systematic Approach 1. **Start with Retrieval Metrics**: Measure precision and recall first. If your system can't find relevant information, everything else collapses. 2. **Use Segmentation**: Break down your data to identify specific failure points. A 70% overall recall might hide that important queries are failing at 5%. 3. **Implement Structured Extraction**: Parse documents properly—dates, tables, and images all need specialized handling beyond simple text chunks. 4. **Develop Query Routing**: Create specialized indices and tools for different data types, then build a system to route queries to the right tool. 5. **Fine-Tune Your Embeddings**: Customize embeddings for your domain using actual query-document pairs from your users. 6. **Close the Feedback Loop**: Make it easy for users to provide feedback, and feed this data back into your training pipeline. The journey doesn't end after implementation. A truly effective RAG system follows a continuous improvement cycle: • Ship a minimal version • Log user interactions • Identify failing segments • Add specialized handling • Train better embeddings • Collect more feedback • Repeat For a deeper dive into these techniques, check out improvingrag.com, a free guide based on my Maven course. What challenges are you facing with your RAG applications? I'd love to hear about your experiences in the comments.

  • View profile for Shivani Virdi

    AI Engineering | Founder @ NeoSage | ex-Microsoft • AWS • Adobe | Teaching 70K+ How to Build Production-Grade GenAI Systems

    83,039 followers

    I've spent countless hours building and evaluating AI systems. This is the 3-part evaluation roadmap I wish I had on day one. Evaluating an LLM system isn't one task. It's about measuring the performance of each component in the pipeline. You don't just test "the AI"; You test the retrieval, the generation, and the overall agentic workflow. 𝗣𝗮𝗿𝘁 𝟭: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (𝗧𝗵𝗲 𝗥𝗔𝗚 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲) Your system is only as good as the context it retrieves. 𝗞𝗲𝘆 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: ↳ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻: How much of the retrieved context is actually relevant vs. noise? ↳ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗥𝗲𝗰𝗮𝗹𝗹: Did you retrieve all the necessary information to answer the query? ↳ 𝗡𝗗𝗖𝗚: How high up in the retrieved list are the most relevant documents? 𝗞𝗲𝘆 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: ↳ 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸: RAGAs Framework (Repo) https://lnkd.in/gAPdCRzh ↳ 𝗣𝗮𝗽𝗲𝗿: RAGAs Paper https://lnkd.in/gUKVe4ac 𝗣𝗮𝗿𝘁 𝟮: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗧𝗵𝗲 𝗟𝗟𝗠'𝘀 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲) Once you have the context, how good is the model's actual output? 𝗞𝗲𝘆 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: ↳ 𝗙𝗮𝗶𝘁𝗵𝗳𝘂𝗹𝗻𝗲𝘀𝘀: Does the answer stay grounded in the provided context, or does it start to hallucinate? ↳ 𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝗲: Is the answer directly addressing the user's original prompt? ↳ 𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻 𝗙𝗼𝗹𝗹𝗼𝘄𝗶𝗻𝗴: Did the model adhere to the output format you requested? 𝗞𝗲𝘆 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: ↳ 𝗧𝗲𝗰𝗵𝗻𝗶𝗾𝘂𝗲: LLM-as-Judge Paper https://lnkd.in/gyhaU5CC ↳ 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸𝘀: OpenAI Evals & LangChain Evals https://lnkd.in/g9rjmfGS https://lnkd.in/gmJt7ZBa 𝗣𝗮𝗿𝘁 𝟯: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝘁𝗵𝗲 𝗔𝗴𝗲𝗻𝘁 (𝗧𝗵𝗲 𝗘𝗻𝗱-𝘁𝗼-𝗘𝗻𝗱 𝗦𝘆𝘀𝘁𝗲𝗺) Does the system actually accomplish the task from start to finish? 𝗞𝗲𝘆 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: ↳ 𝗧𝗮𝘀𝗸 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗶𝗼𝗻 𝗥𝗮𝘁𝗲: Did the agent successfully achieve its final goal? This is your north star. ↳ 𝗧𝗼𝗼𝗹 𝗨𝘀𝗮𝗴𝗲 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆: Did it call the correct tools with the correct arguments? ↳ 𝗖𝗼𝘀𝘁/𝗟𝗮𝘁𝗲𝗻𝗰𝘆 𝗽𝗲𝗿 𝗧𝗮𝘀𝗸: How many tokens and how much time did it take to complete the task? 𝗞𝗲𝘆 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: ↳ 𝗚𝗼𝗼𝗴𝗹𝗲'𝘀 𝗔𝗗𝗞 𝗗𝗼𝗰𝘀: https://lnkd.in/g2TpCWsq ↳ 𝗗𝗲𝗲𝗽𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴(.)𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗘𝘃𝗮𝗹 𝗖𝗼𝘂𝗿𝘀𝗲: https://lnkd.in/gcY8WyjV Stop testing your AI like a monolith. Start evaluating the components like a systems engineer. That's how you build systems that you can actually trust. Save this roadmap. What's the hardest part of your current eval pipeline? ♻️ Repost this to help your network build better systems. ➕ Follow Shivani Virdi for more.

  • View profile for Reyhan Merekar

    Building AI Products & Teaching You How Along The Way | AI Engineer @ Accenture

    7,954 followers

    Building Retrieval-Augmented Generation (RAG) systems is one thing, but ensuring they deliver accurate, relevant, and reliable information is where the real challenge lies. 4 key metrics to assess RAG performance are: Context Precision: A high precision score means fewer irrelevant results, boosting response quality. Context Recall: High recall ensures no critical information is left behind. Faithfulness: Helps reduce hallucinations and maintain factual correctness. Answer Relevance: Ensures users get exactly what they need without unnecessary information. By focusing on these metrics, developers can build RAG systems that are not only powerful but also reliable and user-focused.

Explore categories