Using evolutionary programming with OpenEvolve (my open-source implementation of DeepMind's AlphaEvolve), I successfully optimized Metal kernels for transformer attention on Apple Silicon, achieving 12.5% average performance improvements with 106% peak speedup on specific workloads. What makes this particularly exciting: 🔬 No human expert provided GPU programming knowledge - the system autonomously discovered hardware-specific optimizations including perfect SIMD vectorization for Apple Silicon and novel algorithmic improvements like two-pass online softmax 📊 Comprehensive evaluation across 20 diverse inference scenarios showed workload-dependent performance with significant gains on dialogue tasks (+46.6%) and extreme-length generation (+73.9%), though some regressions on code generation (-16.5%) ⚡ The system discovered genuinely novel optimizations: 8-element vector operations that perfectly match Apple Silicon's capabilities, memory access patterns optimized for Qwen3's 40:8 grouped query attention structure, and algorithmic innovations that reduce memory bandwidth requirements 🎯 This demonstrates that evolutionary code optimization can compete with expert human engineering, automatically discovering hardware-specific optimizations that would require deep expertise in GPU architecture, Metal programming, and attention algorithms The broader implications are significant. As hardware architectures evolve rapidly (new GPU designs, specialized AI chips), automated optimization becomes invaluable for discovering optimizations that would be extremely difficult to find manually. This work establishes evolutionary programming as a viable approach for automated GPU kernel discovery with potential applications across performance-critical computational domains. All code, benchmarks, and evolved kernels are open source and available for the community to build upon. The technical write-up with complete methodology and results is published on Hugging Face. The intersection of evolutionary algorithms and systems optimization is just getting started. Links in first comment 👇 #AI #MachineLearning #GPUOptimization #PerformanceEngineering #OpenSource #EvolutionaryAlgorithms #AppleSilicon #TransformerOptimization #AutomatedProgramming
Innovations Driving Machine Learning Optimization
Explore top LinkedIn content from expert professionals.
Summary
Innovations driving machine learning optimization refer to new techniques and technologies that make artificial intelligence models smarter, faster, and more resource-efficient. These advances help AI deliver impressive results without simply relying on ever-larger models or massive computing power, making intelligent solutions more accessible and practical.
- Embrace smarter architecture: Explore new model designs that use efficient memory management and dynamic expert selection to reduce resource needs while maintaining high performance.
- Apply automated discovery: Harness evolutionary algorithms and advanced training strategies to let machines uncover hardware-specific improvements and novel techniques that humans might miss.
- Choose compact solutions: Consider smaller, well-trained models and innovative retrieval methods—like recursive query reformulation—to achieve fast, accurate results without overwhelming computational costs.
-
-
🚀Yesterday, I had shared how DeepSeek-V3 achieved impressive performance with limited resources. As promised here is the 1st of the 3 deep dive posts. I’ll be covering three posts on: 1️⃣ Architectural innovations (today's focus) 2️⃣ Training strategies & optimization 3️⃣ Post-training refinements ⚙️ Architectural Innovations DeepSeek-V3 made significant breakthroughs in architecture to improve efficiency without compromising performance. 🔹 Multi-Head Latent Attention (MLA) – Efficient Memory Management for Attention Traditional Transformers remember all previous words (tokens) by storing key-value pairs, which takes up a lot of memory. Multi-Head Latent Attention (MLA) reduces this by compressing these stored values using low-rank matrices—like summarizing long notes into key points while keeping the important details. It also compresses queries during training, further cutting down memory usage without losing accuracy. To Simplify - Imagine a library where, instead of keeping full books open, you store only short summaries that still let you find the right information quickly. 🔹 DeepSeekMoE (Mixture-of-Experts) – Smarter Expert Selection for Cost Efficiency Unlike standard MoE models, DeepSeek-V3 introduces finer-grained experts and shared experts. Instead of every input activating the same number of experts, some experts are dynamically shared, reducing redundancy. This improves efficiency while maintaining diversity in learned representations. To Simplify - Think of a consulting firm with specialists in different fields. Instead of randomly assigning experts to tasks, DeepSeek assigns only the most relevant ones, while keeping a few generalists available for shared work. 🔹 Auxiliary-Loss-Free Load Balancing – Smarter Expert Utilization Most MoE models use auxiliary loss functions to ensure experts are equally utilized. However, these losses can degrade performance. DeepSeek-V3 replaces them with dynamic bias terms, adjusting expert selection on the fly based on workload distribution. To Simplify - Imagine a manager distributing work among employees. Instead of punishing overworked employees, the system automatically shifts tasks to balance the load while keeping performance high. 🔹 Multi-Token Prediction (MTP) – Speeding Up Training & Inference Instead of predicting just one token at a time, DeepSeek-V3 predicts multiple tokens in parallel. This provides denser training signals, leading to faster convergence. During inference, speculative decoding allows it to process sequences more efficiently, reducing latency. To Simplify - Instead of typing one word at a time, imagine predicting whole phrases ahead. This speeds up both writing and understanding. These architectural innovations contribute to #DeepSeek-V3's high performance at a fraction of the usual compute cost. I write about #artificialintelligence | #technology | #startups | #mentoring | #leadership | #financialindependence PS: All views are personal Vignesh Kumar
-
Exciting New Research Alert: Small Language Models Are Proving Their Worth! A groundbreaking survey from Amazon researchers reveals that Small Language Models (SLMs) with just 1-8B parameters can match or even outperform their larger counterparts. Here's what makes this fascinating: Technical Innovations: - SLMs like Mistral 7B implement grouped-query attention (GQA) and sliding window attention with rolling buffer cache to achieve performance equivalent to 38B parameter models - Phi-1, with just 1.3B parameters trained on 7B tokens, outperforms models like Codex-12B (100B tokens) and PaLM-Coder-540B through high-quality "textbook" data - TinyLlama (1.1B) leverages Rotary Positional Embedding, RMSNorm, and SwiGLU activation functions to match larger models on key benchmarks Architecture Breakthroughs: - Hybrid approaches like Hymba combine transformer attention with state space models in parallel layers - Qwen models use enhanced tokenization (152K vocabulary) with untied embedding and FP32 precision RoPE - Novel quantization and pruning techniques enable deployment on mobile devices Performance Highlights: - Gemini Nano (1.8B-3.25B parameters) shows exceptional capabilities in factual retrieval and reasoning - Orca 13B achieves 88% of ChatGPT's performance on reasoning tasks - Phi-4 surpasses GPT-4-mini on mathematical reasoning The research demonstrates that with optimized architectures, high-quality training data, and innovative techniques, smaller models can deliver impressive performance while being more efficient and deployable. This is a game-changer for organizations looking to implement AI solutions with limited computational resources. The future of AI might not necessarily be about building bigger models, but smarter ones.
-
AI progress has long been dominated by raw scale—larger datasets, bigger models, and massive compute budgets. But recent breakthroughs suggest that efficiency in training, retrieval, and reasoning may now be more important than brute force scaling. The first shock came with DeepSeek-R1, an open-source model that demonstrated that reinforcement learning (RL) alone—without extensive supervised fine-tuning—can develop reasoning capabilities comparable to proprietary models [1]. This shift is reinforced by Qwen 2.5’s architecture optimizations and Janus-Pro’s multimodal advancements, proving that cheaper, faster, and more effective AI is possible without simply increasing parameter counts [2]. DeepSeek-R1 shows that RL can be a primary mechanism for improving LLM reasoning, not just an alignment tool [1]. Its initial version, DeepSeek-R1-Zero, trained purely via RL, displayed strong reasoning but suffered from readability issues. The refined DeepSeek-R1, incorporating minimal cold-start data and rejection sampling fine-tuning, reached OpenAI-o1-1217-level performance at a fraction of the cost. This challenges the conventional pretraining-heavy paradigm. AI architecture is also undergoing a fundamental shift. Janus-Pro, from DeepSeek-AI, introduces a decoupled approach to multimodal AI, separating image understanding from image generation [2]. Unlike previous models that forced both tasks through a shared transformer, Janus-Pro optimizes each independently, outperforming DALL-E 3 and Stable Diffusion 3 Medium in instruction-following image generation. At a more fundamental level, Bytedance’s Over-Tokenized Transformers reveal a silent inefficiency in LLM design: tokenization is a bottleneck [3]. Their research shows that expanding input vocabulary—while keeping output vocabulary manageable—drastically reduces training costs and improves performance. A 400M parameter model with an optimized tokenizer matched the efficiency of a 1B parameter baseline (!), proving that many LLMs are computationally bloated due to suboptimal tokenization strategies. Beyond efficiency, AI is also becoming more structured in reasoning and retrieval. Google DeepMind’s Mind Evolution introduces a genetic algorithm-like refinement process [4], evolving multiple solution candidates in parallel and iteratively improving them. This could lead to AI systems that autonomously refine their own answers rather than relying on static generation. Meanwhile, Microsoft’s CoRAG is redefining RAG by solving the multi-hop retrieval challenge [5]. Standard RAG models retrieve once before generating a response, failing on multi-step queries. CoRAG introduces recursive retrieval, dynamically reformulating queries at each step, leading to a 10+ point improvement on multi-hop QA benchmarks. The combined effect of these breakthroughs is a shift in how AI is trained, how it retrieves knowledge, and how it reasons in real time - everything you need to design more intelligent brains.
-
🚀 [New Preprint Alert! Hybrid ML & Optimization] I’m excited to share our latest research: “A Hybrid Machine Learning and Optimization Framework for Large-Scale Multi-Order Courier Assignment” led by Minh Vu 📝 Preprint link: https://lnkd.in/g47EMu_g With the explosive growth of on-demand delivery, efficiently matching thousands of couriers to rapidly incoming orders has become a high-stakes challenge. Traditional optimization methods often overlook a critical factor—courier behavior, especially whether couriers will accept the assignments offered to them. 🔍 What We Introduce: ✅ A high-performance machine learning model to predict courier acceptance probability (AUC = 0.924, Accuracy = 86.5%) ✅ A multi-objective MILP model that balances: • Order acceptance • Delivery time • Fair workload distribution 📈 Key Results (Real-World Dataset – 5,960 courier-order pairs): 📦 Actual acceptance rate improved from 73.2% → 81.5% ⚖️ Trade-off between fairness and efficiency made explicit and adjustable 🎯 Operators can now tune their strategy along the efficiency–fairness frontier 💡 Why It Matters: This hybrid ML + optimization framework turns a complex operational problem into a controllable decision-making process, empowering last-mile logistics platforms to align strategies with real-world service goals. If you’re working in last-mile logistics, dispatch systems, dynamic matching, or decision support, I’d love to hear your thoughts and discuss potential collaborations. #MachineLearning #Optimization #OnDemandDelivery #LastMileLogistics #DecisionSupport #OperationsResearch #AIInLogistics #Purdue #SoET #IndustrialEngineeringTechnology #PPI
-
Couple of weeks ago, amongst other things I called out that DeepSeek AI’s FlashMLA announced a suite of efficiency solutions that will improve AI workload GPU utilization, with increased speed. 🔸TLDR: It’s fascinating to see such quick innovations in CUDA programming right after DeepSeek, aiming to achieve substantial efficiency gains in variable-length prompt processing and small-batch inference scenarios. 🔹As such, Stanford researchers soft launched ThunderMLA, an optimized GPU decoding mechanism designed to accelerate large language model inference by implementing a fully fused “megakernel” for attention decoding. 🔹In other words, this megakernel consolidates multiple kernel operations into a single execution unit, reducing the overhead associated with individual kernel launches, such as setup and teardown times, while mitigating tail effects and improving memory bandwidth utilization. 🔹By leveraging custom scheduling strategies, including static and makespan-backward schedulers, ThunderMLA optimizes task execution order and resource allocation, achieving a 20-35% speedup over FlashMLA. 🔹Behind this performance gain, we find ThunderKittens, an embedded domain-specific language (DSL) developed by the researchers. It simplifies writing high-performance AI kernels for GPUs. 🔹Thunderkittens maintains extensibility and uses fundamental objects that align with tensor cores for optimal utilization, while abstracting complex GPU programming tasks. 🔹It provides a PyTorch-like API, making it accessible while remaining hardware-transparent for developers needing fine-grained control. Looking forward to the technical report, as well as an extension of this Multi-Head Latent Attention speed up to other areas. I’ll be glad to share it! See more below #genai #technology #artificialintelligence