Paperzilla

AI-powered scientific paper discovery platform

Website: https://paperzilla.ai

Cover Block

PUBLIC

Field Value
Name Paperzilla
Tagline AI-powered scientific paper discovery platform
Headquarters The Hague, Netherlands
Industry Research tools / academic software
Technology AI / Machine Learning (LLM-based ranking and retrieval)
Founding Team Solo founder (Mark Pors)

Links

PUBLIC

Executive Summary

PUBLIC

Paperzilla is an AI-driven discovery layer for scientific literature, aiming to compress a researcher's daily reading triage into a single personalized digest drawn from arXiv, medRxiv, and bioRxiv [Paperzilla, About]. The product surfaces today as a command-line tool (pz) that lets a user list projects, fetch curated research feeds, and filter papers by priority, date, or count, with output piped to JSON or exported as an Atom feed [clawskills.sh]. The company is led by founder and CEO Mark Pors, a Delft-trained engineer who was previously co-founder and CTO of WatchMouse, a website monitoring company acquired by CA Technologies in 2011 [TechCrunch, May 2010] [Stone Soup Coworking Space]. The technical approach combines hybrid keyword and semantic matching with an AI reranker that filters aggressively in favor of precision over recall [Paperzilla, About]. Funding, headcount, and revenue are not publicly disclosed, and the project's documented surface area sits primarily on the company website, GitHub, and Hugging Face, where Pors has published a labeled retrieval evaluation dataset of 250 papers with multi-annotator relevance scoring [Hugging Face]. For investors, the relevant question over the next 12 to 18 months is whether Paperzilla can convert a credible founder narrative and a working CLI into a paid, recurring product for working scientists, biotech R&D teams, or AI agent developers who need a clean research-context pipeline.

Data Accuracy: YELLOW -- Founder background corroborated by TechCrunch and LinkedIn; product claims sourced primarily to the company's own site and a single third-party skills directory.

Taxonomy Snapshot

Axis Value
Industry / Vertical Academic research tools / AI developer tooling
Technology Type LLM retrieval, semantic search, hybrid ranking
Geography Netherlands (EU)
Founding Team Solo founder, technical

Company Overview

PUBLIC

Paperzilla is an early-stage research-discovery project headquartered in The Hague and built by Mark Pors, who describes himself on GitHub as a Python, React, and React Native developer working on Paperzilla.ai while exploring machine learning with large language models [GitHub, pors]. The company's public footprint is concentrated on its own domain (paperzilla.ai and docs.paperzilla.ai), a GitHub organization at github.com/paperzilla-ai, and a Hugging Face account that hosts a retrieval evaluation dataset [GitHub, paperzilla-ai] [Hugging Face].

The founding date is not stated on any captured source, and there is no record in the available material of an incorporated legal entity, a priced funding round, or accelerator participation. What is documented is a clear product thesis: that researchers and research-adjacent professionals are drowning in preprint volume across arXiv, medRxiv, and bioRxiv, and that an LLM-mediated digest can replace the manual scan [Paperzilla, About]. Pors has framed the work, on a recruiting database profile, as "building research context for the agentic era," suggesting the longer arc may extend beyond a human-readable digest toward a context source for autonomous research agents [RocketReach].

The milestone trail visible publicly is modest but coherent: a live marketing site explaining the ingestion and ranking approach, a documentation portal, an open-source CLI distributed through the OpenClaw skill directory [AIClawSkills] [clawskills.sh], and a labeled retrieval dataset published openly on Hugging Face that uses GPT-4o as one of multiple annotators to score paper relevance [Hugging Face]. Together these suggest a builder who is shipping iteratively in public rather than a stealth operation.

Data Accuracy: YELLOW -- Confirmed by company website, GitHub, and Hugging Face; no third-party press coverage of Paperzilla itself surfaced.

Product and Technology

MIXED

Paperzilla's product is positioned as an AI-powered academic paper analysis and discovery platform, with two visible delivery surfaces: a web property at paperzilla.ai and a command-line client called pz [Paperzilla] [PUBLIC]. According to the company's About page, the system scans sources including arXiv, medRxiv, and bioRxiv on a daily cadence, applies hybrid keyword and semantic matching, and then uses an AI reranker to filter aggressively before producing one personalized digest per user [Paperzilla, About] [PUBLIC]. The stated design choice is precision over volume: the digest is intended to be short and high-signal rather than comprehensive.

The CLI surface is documented through the OpenClaw skill registry, where pz is described as a tool that lets a user list projects, fetch research feeds, filter papers by priority, date, or count, and export results either as JSON for programmatic use or as an Atom feed for traditional readers [clawskills.sh] [PUBLIC]. A separate skill listing for "paperzilla" describes the CLI as a way to search, filter, and browse high-signal academic papers [clawskills.sh] [PUBLIC]. The presence of the tool inside the OpenClaw / Claw Skills ecosystem signals that Paperzilla is being deliberately packaged as a callable capability for AI agents, not only as a human end-user product [AIClawSkills] [PUBLIC].

On the model and evaluation side, the publicly hosted paperzilla-rag-retrieval-250 dataset on Hugging Face contains 250 papers annotated by five annotators, including GPT-4o (model id gpt-4o-2024-11-20) running on Azure, with relevance scores and reasoning fields attached to each paper [Hugging Face] [PUBLIC]. The dataset's structure is consistent with a retrieval-augmented generation evaluation harness, which suggests the team is measuring ranking quality empirically rather than relying on qualitative judgment alone (inferred from dataset schema). No public information confirms hosting infrastructure, embedding model choice, or the specific reranker architecture in production.

Data Accuracy: YELLOW -- Product description corroborated by company About page, OpenClaw skill registry, and a public Hugging Face dataset; deeper architecture is not disclosed.

Market Research and Opportunity

PUBLIC

The market for scientific literature discovery is being reshaped right now by two simultaneous shifts: preprint volume is growing faster than any individual researcher can scan, and LLMs have made semantic ranking of long-form scientific text genuinely useful rather than merely plausible.

The primary sources Paperzilla ingests are themselves the leading indicators of the demand problem. arXiv, medRxiv, and bioRxiv are the three dominant preprint servers in physics, computer science, medicine, and biology respectively, and all three have grown submission counts materially over the past decade based on their public submission statistics. The company's own framing positions the product against this firehose: a single personalized digest, filtered aggressively, in place of manual triage [Paperzilla, About]. Independent third-party sizing for the specific sub-segment of "AI-mediated research discovery tools" is not available in the captured research, so any TAM figure here would be speculative and is omitted.

The demand-side tailwinds are nevertheless visible by analogy. Adjacent and substitute markets include reference managers (Zotero, Mendeley, Paperpile), literature search engines (Google Scholar, Semantic Scholar, Consensus, Elicit, Undermind), and the workflow tools embedded inside biotech and pharma R&D platforms. Each of these adjacencies has either an established paid base or recent venture funding activity, which suggests willingness to pay for research-workflow software exists; what is changing is that LLM reranking has lowered the quality bar for a credible challenger product.

The regulatory and macro backdrop is mixed. On one hand, open-access mandates from funders (NIH public access policy, Plan S in Europe) keep the underlying corpus legally accessible and machine-readable, which is structurally favorable for any ingestion-based product. On the other hand, agent-facing AI tools that summarize or rerank scientific content sit inside an evolving copyright and AI-policy debate in both the EU AI Act framework and the broader publisher licensing environment. Paperzilla's reliance on open preprint servers, rather than on paywalled journal full text, materially reduces that risk surface relative to competitors that scrape closed content.

Sizing claim Value Source
Paperzilla retrieval evaluation dataset 250 papers, 5 annotators [Hugging Face]
Primary ingestion sources arXiv, medRxiv, bioRxiv [Paperzilla, About]

is that Paperzilla is targeting a real and growing pain point but is doing so without a publicly cited TAM figure of its own; investors should treat the opportunity as analogous to the broader research-tools and AI-developer-tooling categories rather than as a quantified market.

Data Accuracy: ORANGE -- Demand framing inferred from public preprint server trends and adjacent-market activity; no commissioned market sizing in the captured research.

Competitive Landscape

MIXED

Paperzilla sits inside a crowded but unevenly served competitive set, where incumbents own discovery breadth and a wave of LLM-native challengers are competing on answer quality and workflow fit.

Company Positioning Stage / Funding Notable Differentiator Source
Paperzilla AI-curated daily digest of preprints; CLI plus web Pre-seed / undisclosed Agent-callable CLI, open evaluation dataset [Paperzilla, About] [PUBLIC], [Hugging Face] [PUBLIC]
papertool Listed competitor in research scope Undisclosed Direct category overlap noted in research scope structured facts [PUBLIC]

The broader competitive map can be split into three layers. The first layer is the discovery incumbents: Google Scholar and Semantic Scholar dominate top-of-funnel academic search, are free, and benefit from years of citation-graph investment. They are not going to be displaced as the search index of record, and any challenger has to define a workflow that sits beside them rather than replace them. The second layer is the LLM-native challenger cohort, including products that have raised venture funding around AI-assisted literature review and synthesis. These tools compete primarily on answer quality, citation grounding, and integrations with reference managers. The third layer is the embedded R&D workflow vendors inside biotech and pharma, where literature monitoring is one feature inside a larger informatics stack.

Where Paperzilla has a defensible edge today is at the intersection of two choices most competitors have not made together: a CLI-first, agent-callable interface and a deliberately narrow daily-digest output. Packaging the tool as an OpenClaw skill [AIClawSkills] makes it directly invokable by AI agents that need a research-context tool, which is a different buyer than the human end user that most literature tools target. The publicly shipped retrieval evaluation dataset [Hugging Face] is also a non-trivial credibility signal: very few competitors publish their own eval harness openly. Both edges are perishable, however. The CLI distribution channel is small today, and any well-funded competitor could ship an agent-facing API on a quarter's notice.

The most exposed flank is distribution and brand. Paperzilla has no documented funding, no documented sales motion, and a single founder; better-capitalized competitors will out-spend it on SEO, integrations with Zotero and Mendeley, and direct sales into pharma R&D. The product also does not, based on public materials, ingest closed-access journal full text, which limits its utility in clinical and pharmaceutical contexts where the canonical literature lives behind publisher paywalls.

Over an 18-month horizon, the most plausible scenarios are bifurcated. Paperzilla wins if the agentic-tooling thesis holds and a meaningful fraction of AI research agents standardize on a small number of vetted research-context skills, with pz becoming one of them; in that scenario the OpenClaw / Claw Skills positioning is the moat. Paperzilla loses if the dominant workflow remains a human researcher inside a browser, in which case Semantic Scholar plus a venture-funded LLM literature-review competitor captures the category before the CLI gains a paid base.

Opportunity

PUBLIC

If the agent-tooling thesis plays out, Paperzilla could become the default research-context skill that every AI research agent calls to get a filtered view of the day's preprint output.

The headline opportunity. Paperzilla's most ambitious plausible outcome is to become the standard machine-readable interface to recent scientific literature, sitting underneath both human-facing digest readers and autonomous AI research agents. The evidence that this is reachable rather than aspirational is concrete: the founder has already packaged the tool as an OpenClaw skill that exposes structured JSON and Atom output, the system is wired to the three preprint servers that matter most for technical research, and the evaluation harness is being measured rather than asserted [clawskills.sh] [Paperzilla, About] [Hugging Face]. The category-defining outcome is not "a better Google Scholar," which is unwinnable, but "the API every AI agent uses to know what came out yesterday," which is currently unowned.

Growth scenarios.

Scenario What happens Catalyst Why it's plausible
Agent-skill standard pz becomes one of a small number of canonical research-context skills called by AI research agents Adoption inside a major agent framework or skills marketplace Tool is already published as an OpenClaw skill with structured output [AIClawSkills]
Researcher SaaS Paid daily-digest subscription for working scientists and PhD students across the life sciences and ML A reference-manager integration (Zotero, Paperpile) or a university site-license deal Ingestion already covers arXiv, medRxiv, bioRxiv, the three sources that matter for that audience [Paperzilla, About]
R&D infrastructure Licensed as an internal literature-monitoring layer for biotech and pharma R&D teams A first paid pilot with a mid-cap biotech or a CRO Open evaluation dataset gives procurement teams something concrete to benchmark [Hugging Face]

What compounding looks like. The flywheel that turns one win into the next is data and evaluation, not virality. Every additional user produces feedback signal on which papers they actually read, which sharpens the reranker, which improves the digest, which compounds retention. The publicly released evaluation dataset [Hugging Face] is an early signal that the team treats ranking quality as a measurable surface rather than a marketing claim; if that discipline holds as the user base grows, the gap between Paperzilla's digest precision and a generic LLM-on-arXiv competitor should widen rather than narrow. A second compounding vector is agent distribution: each additional agent framework that adopts pz as a research skill increases the cost for a competitor to displace it.

The size of the win. A useful comparable, labelled clearly as a scenario rather than a forecast, is the academic and research workflow software category. Reference managers and literature tools have historically supported standalone businesses with mid-eight-figure to low-nine-figure revenue at maturity, and AI-native challengers in adjacent search and synthesis have raised venture rounds at valuations consistent with that trajectory. If Paperzilla executes the researcher SaaS scenario above, a category outcome in the tens of millions of ARR is the order of magnitude to underwrite (scenario, not a forecast). If the agent-skill standard scenario plays out, the relevant comparable is developer infrastructure, where standardization on a default tool inside a fast-growing category has historically produced larger outcomes still (scenario, not a forecast).

Data Accuracy: ORANGE -- Scenarios are analyst framing built on confirmed product surface and founder background; no revenue, customer, or partnership data is publicly available.

Sources

PUBLIC

  1. [Paperzilla] Paperzilla: AI-Powered Academic Paper Analysis | https://paperzilla.ai

  2. [Paperzilla] About Paperzilla | https://paperzilla.ai/about

  3. [Paperzilla] Paperzilla docs portal | https://docs.paperzilla.ai

  4. [GitHub] Paperzilla organization | https://github.com/paperzilla-ai

  5. [GitHub] pors (Mark Pors) | https://github.com/pors

  6. [Hugging Face] paperzilla/paperzilla-rag-retrieval-250 dataset | https://huggingface.co/datasets/paperzilla/paperzilla-rag-retrieval-250

  7. [AIClawSkills] Paperzilla - Web Scrapers OpenClaw Skill | https://aiclawskills.com/skills/paperzilla

  8. [clawskills.sh] pz - OpenClaw Skill | https://clawskills.sh/skills/pors-pz

  9. [clawskills.sh] paperzilla - OpenClaw Skill | https://clawskills.sh/skills/pors-paperzilla

  10. [LinkedIn] Mark Pors profile | https://www.linkedin.com/in/markpors/

  11. [RocketReach] Paperzilla company information | https://rocketreach.co/paperzilla-profile_b6a8bcd1c86737b4

  12. [TechCrunch, May 2010] WatchMouse launches GeoBrand, PPC brand abuse monitoring tool | https://techcrunch.com/2010/05/04/watchmouse-lunches-geobrand-ppc-brand-abuse-monitoring-tool/

  13. [Medium] About - Mark Pors | https://medium.com/@pors/about

  14. [Google Scholar] Mark Pors author profile | https://scholar.google.com/citations?user=HJtcbfMAAAAJ&hl=en

Articles about Paperzilla

View on Startuply.vc