Selene Is Betting an 8-Billion-Parameter Judge Can Grade GPT-4o's Homework

The London YC startup, backed by Creandum's $5M seed, wants to be the scoring layer every AI agent runs through before shipping.

About Selene

Published

Every team building an AI agent eventually hits the same wall: how do you know the model's output is any good? Human review does not scale. Regex does not catch hallucinations. So the industry has settled on an awkward but workable answer, point one large language model at another and ask it to grade the work. That is the market Selene, the London-based startup from founders Maurice Burger and Roman Engeler, is going after.

The company's pitch is direct. Selene calls itself the world's most accurate LLM-as-a-Judge, designed to plug into popular models including GPT-4o and Claude 3.5 Sonnet [Y Combinator]. The flagship developer-facing product, Selene Mini, is an 8-billion-parameter evaluator available via API that the company says outperforms top small models, including GPT-4o mini, on average across 11 evaluation benchmarks [Atla AI]. Built as a fine-tune of Llama-3.1-8B, it is positioned as the small-model judge that punches above its weight [Galtea].

The bet

Selene's wedge is narrow on purpose. Rather than competing with foundation labs, the company is selling the scoring layer that sits between an AI app and production. Developers call the API, pass in a prompt and a model output, and get back a structured judgment: is this answer faithful, relevant, safe, on-brand. The thesis is that judging is itself a specialized task and that a purpose-built small model will be cheaper, faster, and more accurate than asking GPT-4o to grade itself. Selene Mini, the company reports, beats models several times its size on RewardBench, EvalBiasBench, and Auto-J, including GPT-4o on those specific benchmarks [Atla AI].

That framing matters because evaluation is the unglamorous infrastructure problem nobody could ignore once agents started shipping. If you are running a customer-support bot, a coding copilot, or a RAG pipeline, you need a way to score thousands of outputs continuously. Doing that with a frontier model gets expensive fast. An 8B specialist, if the accuracy claims hold, is the kind of primitive every AI team eventually wires in.

Why it could be big

The round tells you who else thinks so. Selene closed a $5 million seed in December 2023 led by Creandum, with Y Combinator also on the cap table [Crunchbase]. Creandum is the firm behind Spotify, Klarna, and Depop, and its early-stage AI bets have been concentrated rather than scattershot. Y Combinator's involvement gives the company a distribution channel into thousands of other YC startups already shipping LLM products, the exact buyer profile for an evaluation API.

Metric Value
Seed funding 5 $M
Team size 10 people
Selene Mini parameters 8 billion
Benchmarks cited 11 count

The competitive set is real and getting more crowded. Patronus AI, Galileo, and Braintrust are all chasing variants of the same problem, with Patronus and Galileo both better capitalized at this point. But the category itself is expanding fast enough that more than one winner is plausible. Every company integrating an LLM eventually needs an evaluation harness, and most do not want to build one in-house. Whether they buy from Selene or a competitor will come down to accuracy benchmarks, latency, and pricing, and Selene has chosen to compete on the first of those publicly and aggressively.

The team and traction

Burger and Engeler, the latter serving as CTO, run a team of roughly 10 employees with backgrounds the company describes as drawn from leading AI labs and startups [Y Combinator, LinkedIn]. The headcount is small for a seed-stage AI infrastructure company, which is consistent with the research-heavy nature of the work: training and fine-tuning an evaluation model that beats GPT-4o on targeted benchmarks is closer to a research project than a CRUD app. The decision to open-source weights for Selene 1 Mini on Hugging Face [Hugging Face / Atla AI] is the kind of move that buys credibility with the developer audience the company needs, even if it complicates monetization down the line.

The honest counterfactual

The bear case is straightforward. Frontier labs keep improving their own evaluation tooling, and OpenAI, Anthropic, and Google could each ship native judging APIs that good-enough developers use by default [Toolify]. Patronus and Galileo have raised more and are further along on enterprise sales motion. The risk is that LLM-as-a-Judge becomes a feature rather than a category, absorbed into the platforms developers already use. The bull answer, from the cited evidence, is that specialist accuracy compounds: if Selene Mini genuinely outperforms GPT-4o on judging tasks at a fraction of the cost [Atla AI], teams running evaluation at scale will route through the cheaper, more accurate model regardless of who hosts the underlying app. The 8B form factor also makes self-hosting realistic for customers with data-residency constraints, a wedge the hyperscalers structurally cannot match.

What to watch

The next 12 months will turn on three things. First, whether Selene moves from open-weights research credibility into paid API revenue, and how quickly the customer logos start appearing in case studies. Second, whether the company raises a Series A in 2025, which at current AI infrastructure multiples would likely value the business in the $40 million to $80 million range (estimated) if traction is there. Third, whether a larger judging model follows the Mini, giving the product line a clear good-better-best structure that maps to enterprise procurement.

The bigger question for readers: as AI agents get deployed into customer-facing workflows, who owns the scorecard, the lab that built the model, or the specialist that grades it?

Read on Startuply.vc