After spending the past 2 years at Cleanlab, I’m excited to share that I’m starting a new role as a Senior Machine Learning Researcher in the DataLab team at Protege.
Working alongside Curtis Northcutt, Jonas Mueller, and Anish Athalye at Cleanlab was one of the most formative experiences of my career. I’m deeply grateful for the chance I had to learn from them to work on data-centric AI with such thoughtful researchers and builders, and to contribute during a period that ultimately led to Cleanlab being acquired into Handshake AI. I learned a tremendous amount about the importance of data quality, evaluation, and trustworthiness in modern AI systems to make them more accurate and reliable.
Throughout my time there, my conviction only grew that the next major advances in AI will come not just from better models or more compute, but from better data.
I’m excited to now be joining Bobby Samuels, Engy Ziedan, and the rest of the awesome team at Protege, where I’ll be working in the DataLab on the research and systems needed to help close this “data gap”.
At DataLab, our goal is to treat the data layer of AI with the same scientific rigor that model labs apply to algorithms by building a dedicated research institution for AI data: designing high-fidelity datasets and multimodal benchmarks grounded in real-world scenarios, working closely with frontier labs on their hardest data challenges, and developing standardized ways, including “FICO scores for AI data”, to measure dataset quality, contamination, and benchmark reliability.
Another important piece of this work is understanding how different kinds of data support different parts of the AI training stack. Reinforcement learning (RL) environments are a powerful form of training data that generate structured training tuples like (state, action, reward, next state) and are extremely useful for post-training optimization when the world can be simulated. But many of the highest-value domains for AI, including healthcare, enterprise workflows, and complex multimodal reasoning, cannot be faithfully simulated. Advancing models in these areas requires real-world datasets, carefully designed benchmarks, and domain-specific data for pre-training and mid-training adaptation.
The idea behind DataLab is simple but important: every major leap in AI capability has historically followed a breakthrough in data (from ImageNet to large-scale web corpora). As models and compute continue to advance rapidly, closing the data gap, the gap between the data that AI systems need and the data that actually exists in usable form, may be one of the most important challenges for the field.
You can read more about the vision behind DataLab here:
https://lnkd.in/e_pzVaq5
Excited for what’s ahead!