In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y
How to Improve LLM Accuracy
Explore top LinkedIn content from expert professionals.
Summary
Improving large language model (LLM) accuracy means making sure these AI programs give more reliable and truthful answers. Because LLMs generate responses based on patterns in data, careful methods are needed to help them avoid mistakes or “hallucinations.”
- Ground with real data: Connect your LLM to trustworthy sources, like databases or long-term memory systems, so answers are anchored in facts and relevant context.
- Refine your prompts: Experiment with clear, tested prompts and use methods like few-shot prompting or meta-prompting to help the model better understand what you want.
- Verify and cross-check: Ask the model to support answers with evidence, cite sources, or even have multiple models review each other's responses to catch errors and inconsistencies.
-
-
Stop asking LLMs to "check for accuracy." >> Make the models work instead. There are ways to improve the accuracy of chatbot answers. Instead of accepting it's initial output, you can force it to reevaluate its work in meaningful ways. You can get to truth by forcing your LLM to transform, not give a wink and a nod to the answer it already generated. Have it reprocess your draft. And provide evidence. Some sweet tactics you can try: 🔹Rebuild: "Recreate this answer from fresh sources only. Return what changed." 🔹Cite everything: "Attach a source and short quote after every claim." 🔹Diff it: "Compare the rebuild to the original. List conflicts and missing pieces." 🔹Justify: "For each bullet, add ‘Because: [evidence] >> [claim]’." 🔹Expand: "Add 1 example, 1 edge case, 1 failure mode for each item." 🔹Pros and cons: "Give tradeoffs for each. Note who benefits and who loses." 🔹Disprove: "Try to falsify each point. Provide counterexamples." 🔹Contradiction scan: "Find claims that conflict with each other." 🔹Freshness check: "Verify dates, versions, and timelines. Flag anything stale." 🔹Triangulate: "Give 3 independent passes, then merge them with a rationale." 🔹Referee mode: "Score another LLM’s output with a rubric and evidence." Try using multiple LLMs to cross-check each other. Bottom line: don’t ask "Accurate?" Make the model to work.
-
Researchers from Meta have developed a "Self-Taught Evaluator" that can significantly improve the accuracy of large language models (LLMs) in judging the quality of AI-generated responses—without using any human-labeled data! So how do you create a Self-Taught Evaluator without using human-labeled data? 1. Initialization - Start with a large set of human-written user instructions (e.g., from production systems). - Select an initial seed large language model (LLM). 2. Instruction Selection - Categorize and filter the instructions using an LLM to select a balanced, challenging subset. - Focus on categories like reasoning, coding, etc. 3. Response Pair Construction - For each selected instruction: - Generate a high-quality baseline response using the seed LLM. - Create a "noisy" version of the original instruction. - Generate a response to the noisy instruction, which will serve as a lower-quality response. - This creates synthetic preference pairs without human labeling. 4. Judgment Annotation - Use the current LLM-as-a-Judge model to generate multiple reasoning traces and judgments for each example. - Apply rejection sampling to keep only correct judgments. - If no correct judgment is found, discard the example. 5. Model Fine-tuning - Fine-tune the seed LLM on the collected synthetic judgments. - This creates an improved LLM-as-a-Judge model. 6. Iterative Improvement - Repeat steps 4-5 multiple times, using the improved model from each iteration. - As the model improves, it should generate more correct judgments, creating a curriculum effect. 7. Evaluation - Test the final model on benchmarks like RewardBench, MT-Bench, etc. - Optionally, use majority voting at inference time for improved performance. This approach allows the creation of a strong evaluator model without relying on costly human-labeled preference data, while still achieving competitive performance compared to models trained on human annotations. What are your thoughts on self-taught AI evaluators? How might this impact the future of AI development?
-
Are your LLM apps still hallucinating? Zep used to as well—a lot. Here’s how we worked to solve Zep's hallucinations. We've spent a lot of cycles diving into why LLMs hallucinate and experimenting with the most effective techniques to prevent it. Some might sound familiar, but it's the combined approach that really moves the needle. First, why do hallucinations happen? A few core reasons: 🔍 LLMs rely on statistical patterns, not true understanding. 🎲 Responses are based on probabilities, not verified facts. 🤔 No innate ability to differentiate truth from plausible fiction. 📚 Training datasets often include biases, outdated info, or errors. Put simply: LLMs predict the next likely word—they don’t actually "understand" or verify what's accurate. When prompted beyond their knowledge, they creatively fill gaps with plausible (but incorrect) info. ⚠️ Funny if you’re casually chatting—problematic if you're building enterprise apps. So, how do you reduce hallucinations effectively? The #1 technique: grounding the LLM in data. - Use Retrieval-Augmented Generation (RAG) to anchor responses in verified data. - Use long-term memory systems like Zep to ensure the model is always grounded in personalization data: user context, preferences, traits etc - Fine-tune models on domain-specific datasets to improve response consistency and style, although fine-tuning alone typically doesn't add substantial new factual knowledge. - Explicit, clear prompting—avoid ambiguity or unnecessary complexity. - Encourage models to self-verify conclusions when accuracy is essential. - Structure complex tasks with chain-of-thought prompting (COT) to improve outputs or force "none"/unknown responses when necessary. - Strategically tweak model parameters (e.g., temperature, top-p) to limit overly creative outputs. - Post-processing verification for mission-critical outputs, for example, matching to known business states. One technique alone rarely solves hallucinations. For maximum ROI, we've found combining RAG with a robust long-term memory solution (like ours at Zep) is the sweet spot. Systems that ground responses in factual, evolving knowledge significantly outperform. Did I miss any good techniques? What are you doing in your apps?
-
Prompt optimization is becoming foundational for anyone building reliable AI agents Hardcoding prompts and hoping for the best doesn’t scale. To get consistent outputs from LLMs, prompts need to be tested, evaluated, and improved—just like any other component of your system This visual breakdown covers four practical techniques to help you do just that: 🔹 Few Shot Prompting Labeled examples embedded directly in the prompt help models generalize—especially for edge cases. It's a fast way to guide outputs without fine-tuning 🔹 Meta Prompting Prompt the model to improve or rewrite prompts. This self-reflective approach often leads to more robust instructions, especially in chained or agent-based setups 🔹 Gradient Prompt Optimization Embed prompt variants, calculate loss against expected responses, and backpropagate to refine the prompt. A data-driven way to optimize performance at scale 🔹 Prompt Optimization Libraries Tools like DSPy, AutoPrompt, PEFT, and PromptWizard automate parts of the loop—from bootstrapping to eval-based refinement Prompts should evolve alongside your agents. These techniques help you build feedback loops that scale, adapt, and close the gap between intention and output
-
Few Lessons from Deploying and Using LLMs in Production Deploying LLMs can feel like hiring a hyperactive genius intern—they dazzle users while potentially draining your API budget. Here are some insights I’ve gathered: 1. “Cheap” is a Lie You Tell Yourself: Cloud costs per call may seem low, but the overall expense of an LLM-based system can skyrocket. Fixes: - Cache repetitive queries: Users ask the same thing at least 100x/day - Gatekeep: Use cheap classifiers (BERT) to filter “easy” requests. Let LLMs handle only the complex 10% and your current systems handle the remaining 90%. - Quantize your models: Shrink LLMs to run on cheaper hardware without massive accuracy drops - Asynchronously build your caches — Pre-generate common responses before they’re requested or gracefully fail the first time a query comes and cache for the next time. 2. Guard Against Model Hallucinations: Sometimes, models express answers with such confidence that distinguishing fact from fiction becomes challenging, even for human reviewers. Fixes: - Use RAG - Just a fancy way of saying to provide your model the knowledge it requires in the prompt itself by querying some database based on semantic matches with the query. - Guardrails: Validate outputs using regex or cross-encoders to establish a clear decision boundary between the query and the LLM’s response. 3. The best LLM is often a discriminative model: You don’t always need a full LLM. Consider knowledge distillation: use a large LLM to label your data and then train a smaller, discriminative model that performs similarly at a much lower cost. 4. It's not about the model, it is about the data on which it is trained: A smaller LLM might struggle with specialized domain data—that’s normal. Fine-tune your model on your specific data set by starting with parameter-efficient methods (like LoRA or Adapters) and using synthetic data generation to bootstrap training. 5. Prompts are the new Features: Prompts are the new features in your system. Version them, run A/B tests, and continuously refine using online experiments. Consider bandit algorithms to automatically promote the best-performing variants. What do you think? Have I missed anything? I’d love to hear your “I survived LLM prod” stories in the comments!
-
Can we finetune our LLM and retriever together to improve RAG performance? This paper proposes a technique to do exactly that! RAG Basics: When you prompt an LLM, RAG supplies relevant documents. A separate retrieval model computes the probability of each text chunk being relevant and provides the top chunks to the LLM. The LLM generates tokens based on the chunks, prompt, and previous tokens. In Short: Fine-tuning LLMs and retrieval models together improves performance without extensive data processing, enabling better retrieval-augmented generation. LLMs aren't exposed to retrieval-augmented inputs during pretraining, limiting their ability to use retrieved text effectively. Fine-tuning the LLM and retrieval model together can improve performance without requiring extensive data processing. How it Works: Authors from Meta fine-tuned Llama 2 (65B parameters) and DRAGON+, a retriever, to create RA-DIT 65B. They fine-tuned Llama 2 on prompts with retrieved text and questions, and fine-tuned DRAGON+ to retrieve more relevant chunks. Fine-tuning was supervised for tasks like question-answering and self-supervised for text chunk completion. Results: RA-DIT 65B achieved 49.1% accuracy on average across four question datasets, outperforming LLaMA 2 65B with DRAGON+ (45.1%) and LLaMA 2 65B alone (32.9%). With five example inputs, RA-DIT 65B reached 51.8% accuracy. RA-DIT offers an efficient way to enhance LLM performance with RAG, making it a valuable technique for developers. Details: RA-DIT fine-tunes Llama 2 and DRAGON+ to work together effectively, leveraging the strengths of both models to generate better output. By fine-tuning the LLM to better use retrieved knowledge and the retrieval model to select more relevant text, RA-DIT achieves improved performance without requiring extensive data processing. https://lnkd.in/gf4fGVkC
-
Despite the impressive capabilities of LLMs, developers still face challenges in getting the most out of these systems. LLMs often need a lot of fine-tuning and prompt adjustments to produce the best results. First, LLMs currently lack the ability to refine and improve their own responses autonomously and second, they have limited research capabilities. It would be highly beneficial if LLMs could conduct their own research, equipped with a powerful search engine to access and integrate a broader range of resources. In the past couple of weeks, several studies have taken on these challenges: 1. Recursive Introspection (RISE): RISE introduces a novel fine-tuning approach where LLMs are trained to introspect and correct their responses iteratively. By framing the process as a multi-turn Markov decision process (MDP) and employing strategies from online imitation learning and reinforcement learning, RISE has shown significant performance improvements in models like LLaMa2 and Mistral. RISE enhanced LLaMa3-8B's performance by 8.2% and Mistral-7B's by 6.6% on specific reasoning tasks. 2. Self-Reasoning Framework: This framework enhances the reliability and traceability of RALMs by introducing a three-stage self-reasoning process, encompassing relevance-aware processing, evidence-aware selective processing, and trajectory analysis. Evaluations across multiple datasets demonstrated that this framework outperforms existing state-of-the-art models, achieving an 83.9% accuracy on the FEVER fact verification dataset, improving the model's ability to evaluate the necessity of external knowledge augmentation. 3. Meta-Rewarding with LLM-as-a-Meta-Judge: The Meta-Rewarding approach incorporates a meta-judge role into the LLM’s self-rewarding mechanism, allowing the model to critique its judgments as well as evaluate its responses. This self-supervised approach mitigates rapid saturation in self-improvement processes, as evidenced by an 8.5% improvement in the length-controlled win rate for models like LLaMa2-7B over multiple iterations, surpassing traditional self-rewarding methods. 4. Multi-Agent Framework for Complex Queries: It mimics human cognitive processes by decomposing complex queries into sub-tasks using dynamic graph construction. It employs multiple agents—WebPlanner and WebSearcher—that work in parallel to retrieve and integrate information from large-scale web sources. This approach led to significant improvements in response quality when compared to existing solutions like ChatGPT-Web and Perplexity.ai. The combination of these four studies would create a highly powerful system: It would self-improve through recursive introspection, continuously refining its responses, accurately assess its performance and learn from evaluations to prevent saturation, and efficiently acquire additional information as needed through dynamic and strategic search planning. How do you think a system with these capabilities reshape the future?
-
🤔 What if, instead of using prompts, you could fine-tune LLMs to incorporate self-feedback and improvement mechanisms more effectively? Self-feedback and improvement have been shown to be highly beneficial for LLMs and agents, allowing them to reflect on their behavior and reasoning and correct their mistakes as more computational resources or interactions become available. The authors mention that frequently used test-time methods like prompt tuning and few-shot learning that are used for self-improvement, often fail to enable models to correct their mistakes in complex reasoning tasks. ⛳ The paper introduces RISE: Recursive Introspection, an approach to improve LLMs by teaching them how to introspect and improve their responses iteratively. ⛳ RISE leverages principles from online imitation learning and reinforcement learning to develop a self-improvement mechanism within LLMs. By treating each prompt as part of a multi-turn Markov decision process (MDP), RISE allows models to learn from their previous attempts and refine their answers over multiple turns, ultimately improving their problem-solving capabilities. ⛳It models the fine-tuning process as a multi-turn Markov decision process, where the initial state is the prompt, and subsequent states involve recursive improvements. ⛳It employs a reward-weighted regression (RWR) objective to learn from both high- and low-quality rollouts, enabling models to improve over turns. The approach uses data generated by the learner itself or more capable models to supervise improvements iteratively. RISE significantly improves the performance of LLMs like LLaMa2, LLaMa3, and Mistral on math reasoning tasks, outperforming single-turn strategies with the same computational resources. Link: https://lnkd.in/e2JDQr8M
-
LLMs are powerful— but can you trust them? Every AI founder knows the real problem isn’t just building large language models…it’s making sure they don’t lie with confidence. Hallucinations are the Achilles' heel of modern LLMs. They generate plausible-sounding but completely fabricated information, which is a problem when you’re deploying AI in medicine, finance, security, or any high-stakes field. And the worst part? The more advanced these models get, the better they get at making up convincing nonsense. At Future AGI, we tackle this head-on. 1️⃣ Garbage in, garbage out. Bad training data? Not on our watch. We leverage high-quality synthetic data and annotated production data— a gold mine for training more reliable models. 2️⃣ Instead of relying on generic uncertainty estimation, we build business-specific evaluation systems to fine-tune prompt behavior in real-world applications. 3️⃣ External validation is non-negotiable. Knowledge bases, retrieval-augmented generation (RAG), and multi-modal evaluations help ground responses in reality. 4️⃣ We run scientific experimentation on prompt outputs to understand what works, what doesn’t, and how to systematically reduce hallucinations without unnecessary compute overhead. 5️⃣ We don’t just tweak prompts— we auto-tune them. Our "Optimize" feature dynamically adjusts prompts for better accuracy, while the "Experiment" feature lets users test variations across different LLMs and datasets, selecting the best-performing one. How is your team handling hallucinations in AI? My comments are open.