Best Practices for LLM Task Design

Explore top LinkedIn content from expert professionals.

Summary

Best practices for LLM task design focus on structuring prompts and processes to guide large language models (LLMs) toward producing high-quality, reliable outputs for a variety of tasks. These methods help reduce unpredictability and improve both the accuracy and consistency of AI-generated responses across different use cases.

  • Test prompt variations: Experiment with different prompt phrasings and document which versions consistently deliver the results you want.
  • Use clear examples: Include sample inputs and desired outputs within your prompts to show the model exactly what kind of response to generate.
  • Start simple and measure: Begin with basic prompts or rule-based approaches, and only increase complexity when necessary while regularly checking accuracy and maintaining human review.
Summarized by AI based on LinkedIn member posts
  • View profile for Ross Dawson
    Ross Dawson Ross Dawson is an Influencer

    Futurist | Board advisor | Global keynote speaker | Founder: AHT Group - Informivity - Bondi Innovation | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice

    35,400 followers

    Small variations in prompts can lead to very different LLM responses. Research that measures LLM prompt sensitivity uncovers what matters, and the strategies to get the best outcomes. A new framework for prompt sensitivity, ProSA, shows that response robustness increases with factors including higher model confidence, few-shot examples, and larger model size. Some strategies you should consider given these findings: 💡 Understand Prompt Sensitivity and Test Variability: LLMs can produce different responses with minor rephrasings of the same prompt. Testing multiple prompt versions is essential, as even small wording adjustments can significantly impact the outcome. Organizations may benefit from creating a library of proven prompts, noting which styles perform best for different types of queries. 🧩 Integrate Few-Shot Examples for Consistency: Including few-shot examples (demonstrative samples within prompts) enhances the stability of responses, especially in larger models. For complex or high-priority tasks, adding a few-shot structure can reduce prompt sensitivity. Standardizing few-shot examples in key prompts across the organization helps ensure consistent output. 🧠 Match Prompt Style to Task Complexity: Different tasks benefit from different prompt strategies. Knowledge-based tasks like basic Q&A are generally less sensitive to prompt variations than complex, reasoning-heavy tasks, such as coding or creative requests. For these complex tasks, using structured, example-rich prompts can improve response reliability. 📈 Use Decoding Confidence as a Quality Check: High decoding confidence—the model’s level of certainty in its responses—indicates robustness against prompt variations. Organizations can track confidence scores to flag low-confidence responses and identify prompts that might need adjustment, enhancing the overall quality of outputs. 📜 Standardize Prompt Templates for Reliability: Simple, standardized templates reduce prompt sensitivity across users and tasks. For frequent or critical applications, well-designed, straightforward prompt templates minimize variability in responses. Organizations should consider a “best-practices” prompt set that can be shared across teams to ensure reliable outcomes. 🔄 Regularly Review and Optimize Prompts: As LLMs evolve, so may prompt performance. Routine prompt evaluations help organizations adapt to model changes and maintain high-quality, reliable responses over time. Regularly revisiting and refining key prompts ensures they stay aligned with the latest LLM behavior. Link to paper in comments.

  • View profile for Rishab Kumar

    Staff DevRel at Twilio | GitHub Star | GDE | AWS Community Builder

    22,566 followers

    I recently went through the Prompt Engineering guide by Lee Boonstra from Google, and it offers valuable, practical insights. It confirms that getting the best results from LLMs is an iterative engineering process, not just casual conversation. Here are some key takeaways I found particularly impactful: 1. 𝐈𝐭'𝐬 𝐌𝐨𝐫𝐞 𝐓𝐡𝐚𝐧 𝐉𝐮𝐬𝐭 𝐖𝐨𝐫𝐝𝐬: Effective prompting goes beyond the text input. Configuring model parameters like Temperature (for creativity vs. determinism), Top-K/Top-P (for sampling control), and Output Length is crucial for tailoring the response to your specific needs. 2. 𝐆𝐮𝐢𝐝𝐚𝐧𝐜𝐞 𝐓𝐡𝐫𝐨𝐮𝐠𝐡 𝐄𝐱𝐚𝐦𝐩𝐥𝐞𝐬: Zero-shot, One-shot, and Few-shot prompting aren't just academic terms. Providing clear examples within your prompt is one of the most powerful ways to guide the LLM on desired output format, style, and structure, especially for tasks like classification or structured data generation (e.g., JSON). 3. 𝐔𝐧𝐥𝐨𝐜𝐤𝐢𝐧𝐠 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠: Techniques like Chain of Thought (CoT) prompting – asking the model to 'think step-by-step' – significantly improve performance on complex tasks requiring reasoning (logic, math). Similarly, Step-back prompting (considering general principles first) enhances robustness. 4. 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐚𝐧𝐝 𝐑𝐨𝐥𝐞𝐬 𝐌𝐚𝐭𝐭𝐞𝐫: Explicitly defining the System's overall purpose, providing relevant Context, or assigning a specific Role (e.g., "Act as a senior software architect reviewing this code") dramatically shapes the relevance and tone of the output. 5. 𝐏𝐨𝐰𝐞𝐫𝐟𝐮𝐥 𝐟𝐨𝐫 𝐂𝐨𝐝𝐞: The guide highlights practical applications for developers, including generating code snippets, explaining complex codebases, translating between languages, and even debugging/reviewing code – potential productivity boosters. 6. 𝐁𝐞𝐬𝐭 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐞𝐬 𝐚𝐫𝐞 𝐊𝐞𝐲: Specificity: Clearly define the desired output. Ambiguity leads to generic results. Instructions > Constraints: Focus on telling the model what to do rather than just what not to do. Iteration & Documentation: This is critical. Documenting prompt versions, configurations, and outcomes (using a structured template, like the one suggested) is essential for learning, debugging, and reproducing results. Understanding these techniques allows us to move beyond basic interactions and truly leverage the power of LLMs. What are your go-to prompt engineering techniques or best practices? Let's discuss! #PromptEngineering #AI #LLM

  • An important lesson from working with hundreds of customers on LLM deployments: there's a **big difference** in how to evaluate and fine-tune language models based on whether your task has **one right answer** or **many**. Let me explain why this matters. Tasks with one correct answer (let's call them "deterministic") include things like classification, structured extraction, and Copilot flows that produce a single action. These are cases where you can quickly check if an output is objectively correct. In contrast, "freeform" tasks have infinitely many valid outputs - think summaries, email drafts, and chatbots. Here, correctness is more subjective, with no single "right" answer. Looking at 1,000 recent datasets on OpenPipe: ~63% were freeform ~37% deterministic. Interestingly though, among the highest-volume tasks, 60% were deterministic - likely because machine-consumed outputs tend to run at higher volume. This distinction drives three key differences in implementation: 1️⃣ Deterministic tasks usually need temperature=0 for consistent, correct outputs. Freeform tasks benefit from higher temperatures (0.7-1.0) to enable creativity and variety. 2️⃣ evaluation approaches differ. Deterministic tasks can use "golden datasets" with known-correct outputs. Freeform tasks often need vibe checks, LLM-as-judge approaches, or direct user feedback. 3️⃣ fine-tuning strategies diverge. For deterministic tasks, Reinforcement Fine-Tuning (RFT) shows promise when correctness is verifiable. For freeform tasks, preference-based methods like DPO or RLHF work better for guiding style and tone. Some practical tips for deterministic tasks: - Consider smaller, specialized models for classification/extraction - Use logprobs to measure classification confidence - You can often reduce costs significantly by going small without losing accuracy For freeform tasks: - Use DPO to train on pairs of good/bad outputs - Consider RLHF to optimize for real user feedback or business metrics - Focus on measuring and improving subjective quality The key is matching your approach to your use case. Don't automatically reach for the largest, most expensive model - sometimes a smaller, more focused solution works better! Lots more details and examples in my post here: https://lnkd.in/gFWdA7kr

  • View profile for Hashim Rehman

    Co-Founder & CTO @ Entropy (YC S24)

    6,167 followers

    Most companies overcomplicate AI implementation. I see teams making the same mistakes: jumping to complex AI solutions (agents, toolchains, orchestration) when all they need is a simple prompt. This creates bloated systems, wastes time, and becomes a maintenance nightmare. While everyone's discussing Model Context Protocol, I've been exploring another MCP: the Minimum Complexity Protocol. The framework forces teams to start simple and only escalate when necessary: Level 1: Non-LLM Solution → Would a boolean, logic or rule based system solve the problem more efficiently? Level 2: Single LLM Prompt → Start with a single, straightforward prompt to a general purpose model. Experiment with different models - some are better with particular tasks. Level 3: Preprocess Data → Preprocess your inputs. Split long documents, simplify payloads. Level 4: Divide & Conquer → Break complex tasks into multiple focused prompts where each handles one specific aspect. LLMs are usually better at handling a specific task at a time. Level 5: Few Shot Prompting → Add few-shot examples within your prompt to guide the model toward better outputs. A small number of examples can greatly increase accuracy. Level 6: Prompt Chaining → Connect multiple prompts in a predetermined sequence. The output of one prompt becomes the input for the next. Level 7: Resource Injection → Implement RAG to connect your model to relevant external knowledge bases such as APIs, databases and vector stores. Level 8: Fine Tuning → Fine tune existing models on your domain specific data when other techniques are no longer effective. Level 9 (Optional): Build Your Own Model → All else fails? Develop custom models when the business case strongly justifies the investment. Level 10: Agentic Tool Selection → LLMs determine which tools or processes to execute for a given job. The tools can recursively utilise more LLMs while accessing and updating resources. Human oversight is still recommended here. Level 11: Full Agency → Allow agents to make decisions, call tools, and access resources independently. Agents self-evaluate accuracy and iteratively operate until the goal is completed. At each level, measure accuracy via evals and establish human review protocols. The secret to successful AI implementation isn't using the most advanced technique. It's using the simplest solution that delivers the highest accuracy with the least effort. What's your experience? Are you seeing teams overcomplicate their AI implementations?

  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    DeepLearning.AI, AI Fund and AI Aspire

    2,444,465 followers

    Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]

  • View profile for Rahul Agarwal

    Staff ML Engineer | Meta, Roku, Walmart | 1:1 @ topmate.io/MLwhiz

    45,084 followers

    Few Lessons from Deploying and Using LLMs in Production Deploying LLMs can feel like hiring a hyperactive genius intern—they dazzle users while potentially draining your API budget. Here are some insights I’ve gathered: 1. “Cheap” is a Lie You Tell Yourself: Cloud costs per call may seem low, but the overall expense of an LLM-based system can skyrocket. Fixes: - Cache repetitive queries: Users ask the same thing at least 100x/day - Gatekeep: Use cheap classifiers (BERT) to filter “easy” requests. Let LLMs handle only the complex 10% and your current systems handle the remaining 90%. - Quantize your models: Shrink LLMs to run on cheaper hardware without massive accuracy drops - Asynchronously build your caches — Pre-generate common responses before they’re requested or gracefully fail the first time a query comes and cache for the next time. 2. Guard Against Model Hallucinations: Sometimes, models express answers with such confidence that distinguishing fact from fiction becomes challenging, even for human reviewers. Fixes: - Use RAG - Just a fancy way of saying to provide your model the knowledge it requires in the prompt itself by querying some database based on semantic matches with the query. - Guardrails: Validate outputs using regex or cross-encoders to establish a clear decision boundary between the query and the LLM’s response. 3. The best LLM is often a discriminative model: You don’t always need a full LLM. Consider knowledge distillation: use a large LLM to label your data and then train a smaller, discriminative model that performs similarly at a much lower cost. 4. It's not about the model, it is about the data on which it is trained: A smaller LLM might struggle with specialized domain data—that’s normal. Fine-tune your model on your specific data set by starting with parameter-efficient methods (like LoRA or Adapters) and using synthetic data generation to bootstrap training. 5. Prompts are the new Features: Prompts are the new features in your system. Version them, run A/B tests, and continuously refine using online experiments. Consider bandit algorithms to automatically promote the best-performing variants. What do you think? Have I missed anything? I’d love to hear your “I survived LLM prod” stories in the comments!

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    622,392 followers

    If you’re building anything with LLMs, your system architecture matters more than your prompts. Most people stop at “call the model, get the output.” But LLM-native systems need workflows, blueprints that define how multiple LLM calls interact, how routing, evaluation, memory, tools, or chaining come into play. Here’s a breakdown of 6 core LLM workflows I see in production: 🧠 LLM Augmentation Classic RAG + tools setup. The model augments its own capabilities using: → Retrieval (e.g., from vector DBs) → Tool use (e.g., calculators, APIs) → Memory (short-term or long-term context) 🔗 Prompt Chaining Workflow Sequential reasoning across steps. Each output is validated (pass/fail) → passed to the next model. Great for multi-stage tasks like reasoning, summarizing, translating, and evaluating. 🛣 LLM Routing Workflow Input routed to different models (or prompts) based on the type of task. Example: classification → Q&A → summarization all handled by different call paths. 📊 LLM Parallelization Workflow (Aggregator) Run multiple models/tasks in parallel → aggregate the outputs. Useful for ensembling or sourcing multiple perspectives. 🎼 LLM Parallelization Workflow (Synthesizer) A more orchestrated version with a control layer. Think: multi-agent systems with a conductor + synthesizer to harmonize responses. 🧪 Evaluator–Optimizer Workflow The most underrated architecture. One LLM generates. Another evaluates (pass/fail + feedback). This loop continues until quality thresholds are met. If you’re an AI engineer, don’t just build for single-shot inference. Design workflows that scale, self-correct, and adapt. 📌 Save this visual for your next project architecture review. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg

  • View profile for Brij kishore Pandey
    Brij kishore Pandey Brij kishore Pandey is an Influencer

    AI Architect & Engineer | AI Strategist

    716,223 followers

    In the world of Generative AI, 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹-𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗥𝗔𝗚) is a game-changer. By combining the capabilities of LLMs with domain-specific knowledge retrieval, RAG enables smarter, more relevant AI-driven solutions. But to truly leverage its potential, we must follow some essential 𝗯𝗲𝘀𝘁 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀: 1️⃣ 𝗦𝘁𝗮𝗿𝘁 𝘄𝗶𝘁𝗵 𝗮 𝗖𝗹𝗲𝗮𝗿 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲 Define your problem statement. Whether it’s building intelligent chatbots, document summarization, or customer support systems, clarity on the goal ensures efficient implementation. 2️⃣ 𝗖𝗵𝗼𝗼𝘀𝗲 𝘁𝗵𝗲 𝗥𝗶𝗴𝗵𝘁 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗕𝗮𝘀𝗲 - Ensure your knowledge base is 𝗵𝗶𝗴𝗵-𝗾𝘂𝗮𝗹𝗶𝘁𝘆, 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱, 𝗮𝗻𝗱 𝘂𝗽-𝘁𝗼-𝗱𝗮𝘁𝗲. - Use vector embeddings (e.g., pgvector in PostgreSQL) to represent your data for efficient similarity search. 3️⃣ 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗠𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺𝘀 - Use hybrid search techniques (semantic + keyword search) for better precision. - Tools like 𝗽𝗴𝗔𝗜, 𝗪𝗲𝗮𝘃𝗶𝗮𝘁𝗲, or 𝗣𝗶𝗻𝗲𝗰𝗼𝗻𝗲 can enhance retrieval speed and accuracy. 4️⃣ 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗲 𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 (𝗢𝗽𝘁𝗶𝗼𝗻𝗮𝗹) - If your use case demands it, fine-tune the LLM on your domain-specific data for improved contextual understanding. 5️⃣ 𝗘𝗻𝘀𝘂𝗿𝗲 𝗦𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 - Architect your solution to scale. Use caching, indexing, and distributed architectures to handle growing data and user demands. 6️⃣ 𝗠𝗼𝗻𝗶𝘁𝗼𝗿 𝗮𝗻𝗱 𝗜𝘁𝗲𝗿𝗮𝘁𝗲 - Continuously monitor performance using metrics like retrieval accuracy, response time, and user satisfaction. - Incorporate feedback loops to refine your knowledge base and model performance. 7️⃣ 𝗦𝘁𝗮𝘆 𝗦𝗲𝗰𝘂𝗿𝗲 𝗮𝗻𝗱 𝗖𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝘁 - Handle sensitive data responsibly with encryption and access controls. - Ensure compliance with industry standards (e.g., GDPR, HIPAA). With the right practices, you can unlock its full potential to build powerful, domain-specific AI applications. What are your top tips or challenges?

  • View profile for Greg Coquillo
    Greg Coquillo Greg Coquillo is an Influencer

    Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure

    227,215 followers

    I consider prompting techniques some of the lowest-hanging fruits one can use to achieve step-change improvement with their model performance. This isn’t to say that “typing better instructions” is that simple. As a matter of fact, it can be quite complex. Prompting has evolved into a full discipline with frameworks, reasoning methods, multimodal techniques, and role-based structures that dramatically change how models think, plan, analyse, and create. This guide that breaks down every major prompting category you need to build powerful, reliable, and structured AI workflows: 1️⃣ Core Prompting Techniques The foundational methods include few-shot, zero-shot, one-shot, style prompts. They teach the model patterns, tone, and structure. 2️⃣ Reasoning-Enhancing Techniques Approaches like Chain-of-Thought, Graph-of-Thought, ReAct, and Deliberate prompting help LLMs reason more clearly, avoid shortcuts, and solve complex tasks step-by-step. 3️⃣ Instruction & Role-Based Prompting Define the task clearly or assign the model a “role” such as planner, analyst, engineer, or teacher to get more predictable, domain-focused outputs. 4️⃣ Prompt Composition Techniques Methods like prompt chaining, meta-prompting, dynamic variables, and templates help you build multi-step, modular workflows used in real agent systems. 5️⃣ Tool-Augmented Prompting Combine prompts with vector search, retrieval (RAG), planners, executors, or agent-style instructions to turn LLMs into decision-making systems rather than passive responders. 6️⃣ Optimization & Safety Techniques Guardrails, verification prompts, bias checks, and error-correction prompts improve reliability, factual accuracy, and trustworthiness. These are essential for production systems. 7️⃣ Creativity-Enhancing Techniques Analogy prompts, divergent prompts, story prompts, and spatial diagrams unlock creative reasoning, exploration, and alternative problem-solving paths. 8️⃣ Multimodal Prompting Use images, audio, video, transcripts, diagrams, code, or mixed-media prompts (text + JSON + tables) to build richer and more intelligent multimodal workflows. Modern prompting has fully evolved to designing thinking systems. When you combine reasoning techniques, structured instructions, memory, tools, and multimodal inputs, you unlock a level of performance that avoids costly fine tuning methods. What best practices have you used when designing prompts for your LLM? #LLM

  • View profile for Pau Labarta Bajo

    Building and teaching AI that works > Maths Olympian> Father of 1.. sorry 2 kids

    70,049 followers

    LLM engineering 101 ↓ 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Imagine creating an app that extracts crypto market signals from financial news. You pass a piece of news like "FED to increase interest rates" and get a sentiment score (-1, 0, or 1) with reasoning behind it. The question is: 𝗛𝗼𝘄 𝗱𝗼 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗮𝗻 𝗟𝗟𝗠 𝗮𝗽𝗽 𝘁𝗵𝗮𝘁 𝗲𝘅𝗰𝗲𝗹𝘀 𝗮𝘁 𝘁𝗵𝗶𝘀 𝘁𝗮𝘀𝗸 ❓ 𝗦𝘁𝗲𝗽𝘀 1️⃣ 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲 𝗮 𝗵𝗶𝗴𝗵-𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗶𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 This step is THE MOST CRITICAL - you'll spend 80% of your development time here. For our example, we need crypto news (inputs) mapped to sentiment scores (outputs). I recommend a hybrid approach: > Fetch real news from the same API you'll use in production > Use a strong LLM like Claude to map news to scores > Have human experts validate and fix the LLM-generated scores 2️⃣ 𝗕𝘂𝗶𝗹𝗱 𝗮 𝗯𝗮𝘀𝗲𝗹𝗶𝗻𝗲 𝗺𝗼𝗱𝗲𝗹 A strong general-purpose LLM with a simple prompt often gets you up and running quickly. I like using Claude for this. 3️⃣ 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝘆𝗼𝘂𝗿 𝘀��𝗹𝘂𝘁𝗶𝗼𝗻 Choose metrics that make sense for your task. For our structured output example: > Is the output correctly formatted as JSON? (Type Error) > Do sentiment scores match ground truth? (Mean Absolute Error) 4️⃣ 𝗜𝘀 𝗶𝘁 𝗴𝗼𝗼𝗱 𝗲𝗻𝗼𝘂𝗴𝗵? If yes, deploy! If not, continue to step 5. 5️⃣ 𝗘𝗿𝗿𝗼𝗿 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 Manually inspect failed examples to understand why your model struggled: > Does it lack context? Consider RAG > Would external tools help? Equip your LLM with tools > Did complex reasoning steps cause failure? Try a chain of LLMs. 6️⃣ 𝗠𝗼𝗱𝗲𝗹 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 Start with prompt engineering (try AdalFlow library). If that is not enough, and you want to boost performance and reduce inference costs, do supervised fine tuning. AND Remember ☝️ There's no such thing as spending too much time generating good training data! This is what makes or breaks the deal. ----- Hi there! It's Pau Labarta Bajo 👋 Every day I share free, hands-on content, on production-grade ML, to help you build real-world ML products. 𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and 𝗰𝗹𝗶𝗰𝗸 𝗼𝗻 𝘁𝗵𝗲 🔔 so you don't miss what's coming next

Explore categories