Selene
The world’s most accurate LLM-as-a-Judge
Website: https://atla-ai.com
Cover Block
PUBLIC
| Field | Value |
|---|---|
| Name | Selene (developed by Atla AI) |
| Tagline | "The world's most accurate LLM-as-a-Judge" [Y Combinator] |
| Headquarters | London, United Kingdom |
| Founded | 2023 |
| Stage | Seed |
| Business Model | API / Developer Platform |
| Industry | AI Evaluation / Developer Infrastructure |
| Technology Type | AI / Machine Learning (fine-tuned LLM evaluators) |
| Growth Profile | Venture Scale |
| Funding Label | Seed |
| Total Disclosed | ~$5,000,000 [Crunchbase] |
Links
PUBLIC
- Website: https://www.atla-ai.com/
- LinkedIn: https://www.linkedin.com/company/atla-ai
- Y Combinator profile: https://www.ycombinator.com/companies/atla
- Hugging Face (Selene 1 Mini): https://huggingface.co/blog/AtlaAI/selene-1-mini
- Crunchbase: https://www.crunchbase.com/organization/atla-037f
Executive Summary
PUBLIC
Selene, the flagship product of London-based Atla AI, is a family of fine-tuned language models purpose-built to evaluate the outputs of other language models, a category commonly called "LLM-as-a-Judge." The company was founded in 2023 by Maurice Burger and Roman Engeler and went through Y Combinator before raising a seed round of approximately $5 million led by Creandum on December 7, 2023 [Crunchbase]. Its core technical bet is that a small, specialized evaluator model can grade the quality of generative outputs more accurately than a general-purpose frontier model called via API, and the team's published benchmarks for Selene 1 Mini, a fine-tuned Llama-3.1-8B variant, claim to outperform GPT-4o on RewardBench, EvalBiasBench and Auto-J [Atla AI]. The founding team is described as a small, highly technical group of researchers and engineers drawn from leading AI labs and startups, with roughly ten employees on LinkedIn [Y Combinator] [LinkedIn]. Commercially, Selene is offered via API and positioned to sit inside the development and observability loops of teams shipping LLM-powered products, putting Atla in direct competition with Patronus AI, Galileo and Braintrust. Over the next 12 to 18 months, the questions that matter are whether Selene's accuracy lead on public benchmarks translates into paid API consumption from production AI teams, whether the company can extend from "judge model" into the broader evaluation and observability stack, and whether a Seed round of this size is sufficient runway to reach a defensible Series A milestone in a category that is attracting well-capitalized challengers.
Data Accuracy: GREEN -- Confirmed by Crunchbase, Y Combinator and Atla AI's own technical posts.
Taxonomy Snapshot
| Axis | Value |
|---|---|
| Stage | Seed |
| Business Model | API / Developer Platform |
| Industry / Vertical | AI Evaluation Infrastructure |
| Technology Type | Fine-tuned small language models for evaluation |
| Geography | United Kingdom (London HQ) |
| Growth Profile | Venture Scale |
| Founding Team | Maurice Burger (CEO), Roman Engeler (CTO) |
| Funding | Seed, ~$5,000,000 disclosed [Crunchbase] |
Company Overview
PUBLIC
Atla AI, the legal entity behind the Selene product line, was founded in 2023 and is headquartered in London. The company entered Y Combinator and used the program's launch channel to introduce Selene as "the world's most accurate LLM-as-a-Judge," pitched as a drop-in evaluator that works with frontier models including GPT-4o and Claude 3.5 Sonnet [Y Combinator]. Cofounders Maurice Burger and Roman Engeler, the latter serving as CTO, lead a team that LinkedIn lists at roughly ten people, described publicly as researchers and engineers who previously worked at leading AI labs and startups [LinkedIn] [Y Combinator].
The company's first publicly traceable financing event is a Seed round closed on December 7, 2023 for approximately $5 million, led by Creandum with participation from Y Combinator [Crunchbase]. The most visible product milestone since then is the release of Selene 1 Mini, a fine-tuned Llama-3.1-8B model, which Atla published both as a research artifact (an arXiv preprint titled "Atla Selene Mini: A General Purpose Evaluation Model") and as a hosted API offering [arXiv / Atla] [Atla AI]. Hugging Face hosts the model weights and an accompanying blog post positioning Selene 1 Mini as "the best small language model-as-a-judge" [Hugging Face / Atla AI].
The combination of a peer-style technical paper, an open-weights release, and a commercial API points to a deliberate dual-track go-to-market: build credibility with the AI research community while monetizing production usage through a managed endpoint. That posture is consistent with how several developer-tools companies in adjacent categories have built early adoption.
Data Accuracy: GREEN -- Confirmed by Crunchbase, Y Combinator, LinkedIn and Atla's own publications.
Product and Technology
MIXED
Selene is a family of evaluator models designed to score, critique or rank the outputs of other language models against criteria such as factuality, helpfulness, safety and adherence to instructions. According to Atla, Selene 1 Mini is an 8-billion-parameter model fine-tuned from Meta's Llama-3.1-8B, and it is offered both as open weights on Hugging Face and as a hosted API [Hugging Face / Atla AI] [Atla AI]. Atla's own benchmark write-up reports that Selene 1 Mini "outperform[s] top small models including GPT-4o mini on average performance across 11 benchmarks for evaluations" [Atla AI], and a separate post claims it "beats models several times its size, outperforming GPT-4o on RewardBench, EvalBiasBench, and Auto-J" [Atla AI]. Independent commentary from Galtea, an evaluation-tooling vendor, describes Selene-1-Mini-Llama-3.1-8B as consistently outperforming other compact models on accuracy in their internal review [Galtea].
Functionally, the product slots into the developer workflow at two points. The first is offline evaluation: a team running regression tests on prompt or model changes can call Selene to score outputs and detect quality drift before shipping. The second is online evaluation: a team running an LLM in production can route a sample of traffic through Selene to monitor for hallucinations, bias or policy violations. The Y Combinator launch page emphasizes drop-in compatibility with major model providers, framing Selene as model-agnostic infrastructure rather than a competing frontier model [Y Combinator].
The technology stack beyond the published model details (fine-tuning recipe, inference serving) is not disclosed in the captured sources, and there are no public job postings on the major ATS hosts surfaced in research that would let an outside reader infer the engineering stack. Readers should treat the engineering and infrastructure side of the product as a known unknown rather than a documented strength.
Data Accuracy: YELLOW -- Model claims are sourced primarily from Atla's own publications and the arXiv preprint, with partial third-party corroboration from Galtea; production deployment scale is not publicly available.
Market Research and Opportunity
PUBLIC
LLM evaluation has gone from a research afterthought to a budgeted line item in roughly eighteen months, and Selene is one of a small number of companies built specifically to own that line item. As enterprises move generative AI features from prototype into customer-facing production, the absence of reliable, automated quality measurement has become the limiting factor on shipping velocity. Human review does not scale, and using a frontier model as the judge is expensive and, according to Atla's published benchmarks, less accurate than a purpose-built evaluator on tasks such as RewardBench and Auto-J [Atla AI] [arXiv / Atla].
No named third-party report on the LLM-evaluation segment specifically appears in the captured research, so a precise TAM figure would have to be estimated rather than cited. As an analogous frame of reference, the broader AI observability and MLOps category has attracted material venture capital into companies such as Galileo and Patronus AI, both directly named as Selene competitors, suggesting institutional investors are already underwriting the thesis that evaluation is a standalone budget. Demand drivers surfaced in the cited research include the rapid proliferation of agentic and multi-step LLM applications (which multiply the number of intermediate outputs that need scoring), the emergence of model-routing and model-switching strategies that require a neutral evaluator, and growing regulatory attention on AI safety and bias that pushes enforcement of measurable quality criteria.
Adjacent and substitute markets are worth naming clearly. The most direct substitute is the do-it-yourself approach of calling GPT-4o or Claude 3.5 Sonnet as a judge inside a custom evaluation harness; this is free of vendor lock-in but carries the cost and accuracy concerns that Selene's benchmarks are designed to highlight [Atla AI]. A second substitute is general-purpose AI observability platforms that bundle evaluation alongside tracing, logging and cost monitoring. A third is internal evaluation teams at large AI consumers, who may build proprietary judges on their own data.
Regulatory and macro forces cut both ways. Frameworks such as the EU AI Act and sector-specific guidance from financial and healthcare regulators are creating real demand for documented, repeatable evaluation, which favors specialized vendors. At the same time, frontier model providers continue to release stronger and cheaper models, which compresses the accuracy and cost gap a third-party judge has to defend.
| Sizing reference | Value | Source |
|---|---|---|
| Atla Seed round (proxy for early investor conviction in category) | ~$5,000,000 | [Crunchbase] |
| Selene 1 Mini parameter count | 8 billion | [Hugging Face / Atla AI] |
| Public benchmarks on which Selene 1 Mini reportedly beats GPT-4o | 3 (RewardBench, EvalBiasBench, Auto-J) | [Atla AI] |
The table is not a market-size table because no named third-party TAM is available in the captured research; it instead summarizes the most concrete numeric anchors that exist for Selene today. The takeaway is that the investment case currently rests on benchmark leadership and category timing rather than on a published market-size report.
Data Accuracy: YELLOW -- Demand drivers and competitor identities are corroborated by multiple sources; market sizing for LLM evaluation specifically is not publicly available in the captured research.
Competitive Landscape
MIXED
Selene is positioned as the accuracy-first specialist in a young category where the alternatives are either broader observability platforms or the brute-force option of using a frontier model as a judge.
| Company | Positioning | Stage / Funding | Notable Differentiator | Source |
|---|---|---|---|---|
| Selene (Atla AI) | Specialized small-model LLM judge, API + open weights | Seed, ~$5M | Claims SOTA accuracy at 8B parameters, beating GPT-4o on three named benchmarks | [Crunchbase] [Atla AI] |
| Patronus AI | LLM evaluation and guardrails platform | Venture-backed (named competitor) | Enterprise-oriented evaluation suite with focus on safety and compliance | [Y Combinator competitive set] |
| Galileo | AI evaluation and observability for generative apps | Venture-backed (named competitor) | Broader observability surface (tracing, metrics) bundled with evaluation | [Y Combinator competitive set] |
| Braintrust | Eval and experimentation platform for LLM developers | Venture-backed (named competitor) | Developer-workflow-first, strong on prompt iteration and experiment tracking | [Y Combinator competitive set] |
The segment-by-segment map has three groupings. The first is dedicated evaluation specialists, where Selene sits alongside Patronus AI as companies whose product identity is centered on judging model output. The second is observability platforms with an evaluation module, where Galileo and Braintrust have built broader surfaces that include tracing, prompt experimentation and cost tracking, with evaluation as one capability among several. The third, and largest by volume of usage today, is the do-it-yourself substitute: teams writing their own evaluation scripts that call GPT-4o or Claude 3.5 Sonnet as a judge, often inside open-source frameworks. Selene's benchmark posture is a direct attack on that third group [Atla AI].
Where Selene appears most defensible today is on technical credibility and per-call economics. A peer-reviewed-style arXiv preprint, an open-weights release on Hugging Face, and benchmark wins against a frontier model give the company a research-credibility moat that observability-first competitors do not naturally possess [arXiv / Atla] [Hugging Face / Atla AI]. The per-call cost story is a corollary: an 8B specialist is materially cheaper to run at scale than a frontier API judge, which matters as customers move from offline evals to high-volume online monitoring. That edge is real but perishable. Frontier model providers continue to release smaller, cheaper variants, and any independent benchmark refresh can compress the lead.
Where Selene is most exposed is product surface area. Galileo and Braintrust can plausibly add a Selene-equivalent evaluator inside their existing platforms more easily than Selene can build a full observability stack from a Seed round. Patronus AI, with a similar specialist positioning, competes for the same enterprise design wins. Selene also does not own a distribution channel comparable to a hyperscaler marketplace presence or a deep integration with a frontier provider's developer console, both of which would be hard to displace once a competitor secured them.
The most plausible 18-month scenario splits along distribution. Winner if Selene converts research credibility into one or two reference-grade enterprise customers and a Hugging Face or hyperscaler distribution partnership: in that case the open-weights plus paid-API model becomes a defensible standard for evaluator infrastructure. Loser if a broader observability platform ships a competitive in-house judge bundled into a contract customers already have: in that case the evaluator becomes a commoditized feature inside someone else's platform, and Selene's accuracy lead, however real, struggles to command standalone pricing.
Opportunity
PUBLIC
If Atla executes, Selene becomes the default evaluator that sits between every production LLM application and the model serving it, and that is a category-defining position rather than a feature.
The headline opportunity. The single largest outcome Selene could plausibly become is the standard third-party judge layer for generative AI in production, the equivalent of what unit-test frameworks became for software or what monitoring agents became for cloud workloads. The cited evidence makes this reachable rather than aspirational for two reasons. First, the technical artifact already exists in a form that customers can audit: an 8-billion-parameter open-weights model with a published preprint and benchmark wins on RewardBench, EvalBiasBench and Auto-J [Atla AI] [arXiv / Atla]. Second, the category is being underwritten by serious capital across multiple competitors (Patronus AI, Galileo, Braintrust), which is the market signaling that evaluation is a standalone budget line, not a free utility bundled into a model API.
Growth scenarios.
| Scenario | What happens | Catalyst | Why it's plausible |
|---|---|---|---|
| Embedded judge for the model providers | Selene becomes the recommended evaluator inside one or more frontier model ecosystems, distributed via marketplace or developer console | A reference integration with a hyperscaler or model provider | Open weights on Hugging Face lower integration friction; benchmark claims give a provider a defensible reason to surface it [Hugging Face / Atla AI] |
| Evaluation standard for regulated AI | Selene is adopted as a documented evaluation layer by enterprises in finance, healthcare or insurance preparing for AI Act-style obligations | A regulated-industry design win plus a published evaluation methodology | Regulators are explicitly asking for repeatable evaluation, and a specialist with a research paper is easier to defend in audit than an in-house script [arXiv / Atla] |
| Developer-workflow expansion | Selene moves from "a judge endpoint" to a workflow product covering eval datasets, regression testing and online monitoring | A product release that wraps the judge in a fuller eval surface | The existing API model and YC distribution give Atla a credible path into developer adoption [Y Combinator] |
What compounding looks like. The flywheel for an evaluator company is data plus brand. Each customer that runs Selene against their production traffic generates labeled judgments that, with permission, can refine the next generation of the judge model, which in turn widens the accuracy gap against generic frontier judges. Brand compounds in parallel: every time an Atla benchmark is cited in a third-party post (as Galtea has already done [Galtea]), the cost of acquiring the next research-savvy buyer falls. Open-weights distribution on Hugging Face is the early evidence that the brand loop is starting; the data loop will only show up once paid API volume is disclosed.
The size of the win. A credible comparable for category-defining developer infrastructure is the broader AI observability and MLOps cohort, where peer companies have raised at valuations in the high hundreds of millions to low billions of dollars in private markets in recent years. If the embedded-judge scenario plays out and Selene becomes the default evaluator inside one or more major model ecosystems, a valuation in that band is the order-of-magnitude outcome to anchor against (scenario, not a forecast). The Seed round of approximately $5 million [Crunchbase] is consistent with an early bet against that outcome rather than a guarantee of it, and the next 12 to 18 months of customer disclosure and Series A pricing will be the first real read on whether the market agrees.
Data Accuracy: YELLOW -- Scenarios are constructed from cited product evidence and named competitors; valuation comparables are directional and not drawn from a named report on the LLM-evaluation segment specifically.
Sources
PUBLIC
[Y Combinator] Launch YC: Selene - The World's Most Accurate LLM-as-a-Judge | https://www.ycombinator.com/launches/Mu5-selene-the-world-s-most-accurate-llm-as-a-judge
[Y Combinator] Atla: The improvement engine for AI agents | https://www.ycombinator.com/companies/atla
[Crunchbase] atla - Crunchbase Company Profile & Funding | https://www.crunchbase.com/organization/atla-037f
[Crunchbase] Seed Round - atla - 2023-12-07 | https://www.crunchbase.com/funding_round/atla-037f-seed--a00aef2a
[LinkedIn] atla company page | https://www.linkedin.com/company/atla-ai
[Atla AI] Selene Mini: SOTA 8B LLM Judge, now available via API | https://atla-ai.com/post/selene-mini-api
[Atla AI] Selene 1 Mini: the best small language model-as-a-judge | https://www.atla-ai.com/post/selene-1-mini
[Hugging Face / Atla AI] Selene 1 Mini: the best small language model-as-a-judge | https://huggingface.co/blog/AtlaAI/selene-1-mini
[arXiv / Atla] Atla Selene Mini: A General Purpose Evaluation Model | https://arxiv.org/html/2501.17195v1
[Galtea] Exploring state-of-the-art LLMs as Judges | https://galtea.ai/blog/exploring-state-of-the-art-llms-as-judges
[Fondo] Atla Launches Selene: The World's Most Accurate LLM-as-a-Judge | https://www.tryfondo.com/blog/atla-launches-selene
[Toolify] Selene 1 Alternatives in 2026 | https://www.toolify.ai/alternative/selene-1
Articles about Selene
- Selene Is Betting an 8-Billion-Parameter Judge Can Grade GPT-4o's Homework — The London YC startup, backed by Creandum's $5M seed, wants to be the scoring layer every AI agent runs through before shipping.