24 April 2026·OpenCódice Research

openreview-mcp: peer review as a queryable resource for LLMs

Closing the biggest gap in the academic MCP stack, with a case study on ICLR 2024 to prove it.

mcpopenreviewpeer-reviewresearch-toolingiclr

The Model Context Protocol ecosystem now covers most of an academic researcher's daily stack. There are MCP servers for arXiv, Semantic Scholar, Hugging Face datasets, Crossref, and Google Scholar. There is no MCP server for the most valuable corpus of all: peer review.

That is the gap openreview-mcp closes, and the reason we are open-sourcing it today.

Why peer review

Most academic-search MCPs treat papers as the unit of information: title, abstract, citations, PDF. That is fine for discovery, but it ignores the densest signal that academic ML produces, which is the reasoning of expert reviewers explaining why a paper succeeds or fails.

Peer review is unique in three ways. First, it is dense. A single review packs five to eight specific critiques into a few hundred words: each one is a falsifiable claim about the paper, often anchored to a specific section, equation, or table. Second, it is consequential. The decision the reviewer reaches has real effects on careers and on the literature itself; reviewers know this and write accordingly. Third, it is hidden. Most peer review still happens behind closed venue portals, accessible only to authors and the platform.

OpenReview hosts that reasoning publicly for almost every major ML venue: ICLR, NeurIPS, COLM, TMLR, ACL ARR, and dozens of workshops. Reviews, author rebuttals, area-chair meta-reviews, and final decisions are all there, queryable through an official API. They are simply not reachable by any LLM you connect to today.

Until now.

What `openreview-mcp` does

The server exposes eleven MCP tools that map cleanly onto OpenReview entities:

Venues: list venues by year or series, compute submission and acceptance statistics for a venue.
Submissions: search by venue, query, author, or keywords; fetch a single submission with its abstract and PDF link; list everything a profile has authored.
Reviews: pull all official reviews for a submission with their ratings, confidence, soundness, presentation, contribution, summary, strengths, weaknesses, and questions.
Meta-reviews and decisions: fetch the area-chair meta-review and the accept/reject decision separately.
Rebuttals: pull author responses keyed to specific reviews.
Profiles: resolve names, affiliations, and DBLP/ORCID/Scholar handles.
Aggregate weaknesses (the signature tool): cluster recurrent reviewer critiques across a venue's rejections.

Install:

pip install "openreview-mcp[analysis]"

Wire it into Claude Code:

claude mcp add openreview -- openreview-mcp

Or Claude Desktop:

{
  "mcpServers": {
    "openreview": { "command": "openreview-mcp" }
  }
}

Public venues work out of the box. Private venues read OPENREVIEW_USERNAME and OPENREVIEW_PASSWORD from the environment. Twenty offline tests cover the parser and schema layer; the package runs on Python 3.11+ and is published under MIT on PyPI.

The signature tool, in detail

openreview_aggregate_weaknesses is what differentiates the server from a thin REST wrapper. It surfaces structural patterns across thousands of reviewer critiques in a way that no per-paper query can.

The pipeline runs in seven steps:

Fetch all submissions in the venue with their replies in a single bulk API call.
Filter to submissions whose final decision contains reject.
Sample N submissions with a fixed seed (default 100).
Extract the weaknesses field from every Official_Review reply on each sampled submission.
Split each paragraph into individual items on bullet/numbered delimiters. This is the step most papers about LLM-driven review analysis miss: a reviewer's "weaknesses" field typically lists five to eight distinct critiques, and clustering at the paragraph level collapses 90% of items into a single dominant cluster because of shared vocabulary. Splitting to per-item recovers the structure.
Vectorize items with TF-IDF (uni-gram and bi-gram, sublinear frequency, L2 normalisation) and reduce to 64 latent dimensions with Truncated SVD (the LSA approach of Deerwester et al. 1990). Re-normalise to the unit sphere.
Cluster with KMeans (n_init=20, fixed random state). For each cluster, return the top TF-IDF terms over its members and the three exemplars closest to the centroid.

The interesting design choice is that the tool does not return human-readable cluster labels. It returns each cluster's top terms, three nearest-to-centroid exemplars, and the contributing submission ids. The LLM that consumes the tool labels the clusters from the evidence.

This matters. A pre-baked taxonomy would freeze categories that vary across venues and years. A NeurIPS reviewer rejects papers for different reasons than an ACL reviewer. The fashionable failure modes shift over time. By returning raw clusters, we let the calling agent produce labels that are appropriate to whatever venue and slice of literature is being studied.

Does it work? A case study on ICLR 2024

The honest test for a tool like this is whether it surfaces something non-obvious about a venue you already know well. So we pointed it at ICLR 2024.

analysis.aggregate_weaknesses(
    client,
    venue_id="ICLR.cc/2024/Conference",
    sample_size=100,
    n_clusters=14,
    seed=7,
)

One hundred rejected submissions yielded 1,361 individual critiques clustered into 14 themes. Three results stood out, and each one carries an actionable lesson for authors planning their next submission.

Evaluation, not novelty, drives most rejections

The three largest clusters (197, 194, and 180 items) are all about the experimental setup: evaluation that is too narrow or biased, modeling choices that are unclear, and experiments or theory that are shallow. Together they account for 42% of all critiques. The generic "paper lacks novelty" cluster is only the fifth largest, with 104 items.

This contradicts the folklore that novelty is the primary battleground. If you are optimising your paper for reviewer satisfaction, the marginal improvement in your evaluation section is likely to weigh more than another paragraph defending novelty.

A representative critique from the largest cluster:

The evaluation is limited to two saturated benchmarks. Without out-of-distribution or harder settings, the gains over the baselines could simply reflect overfitting to the test sets the field has been optimising against for years.

The actionable takeaway: pick three diverse datasets, include at least one that breaks your method's assumptions, and report confidence intervals — not point estimates.

Craftsmanship still sinks papers

Typos and broken cross-references (78 items), confusing figures and captions (50 items), missing citations (71 items), and explicit critiques about writing quality (44 items) together flag roughly 70 of the 100 sampled papers. Mechanical issues that a careful proof-reading pass would catch are routine reasons for rejection.

This is the kind of finding that would be hard to admit if it weren't quantified. Every author tells themselves their paper is well-written; the data says reviewers disagree, often at line-level granularity:

Line 307, "one week The" → "one week. The". Algorithm 1 line 4: the index should be t-1 given the recurrence.

The actionable takeaway: in the last 48 hours before the deadline, do a typos-only pass. It is unglamorous and high-density. Pair it with a "consistency" pass — same colour scheme across figures, same notation for the same object across sections.

Topic-specific failure modes

Two clusters are narrowly topical: federated and semi-supervised learning (88 items) and time series (50 items). These likely reflect both high submission volume and domain-specific failure modes. Authors submitting in these areas would do well to read the cluster exemplars and pre-rebut the standard objections — for example, federated work is repeatedly criticised for unrealistic non-i.i.d. settings.

A reviewer's full critique from the comparison-to-existing-methods cluster, copied unedited from the OpenReview record:

Though the authors claim that they aim to propose a unified framework, the methods considered in their paper are mainly based on AM and POMO, in other words, the auto-regressive methods. As far as I know, there are also other methods (...).

This kind of specificity is what aggregate_weaknesses surfaces as evidence. Not abstract categories: the actual language and detail that rejected papers faced.

What this enables

Four use cases come immediately to mind, and we have built lightweight prototypes for each.

Pre-submission self-review. Run the tool on your target venue. Ask your LLM which clusters your draft is most exposed to, and harden those sections before reviewers find them. This works particularly well when paired with the venue's own meta-reviews from the previous year.

Reviewer-style critique agents. Ground a harsh-reviewer agent on real reviewer language from the target venue, not on a generic rubric. The advantage is verisimilitude: the LLM no longer hallucinates plausible-sounding objections; it draws from the actual vocabulary the venue's reviewers used.

Teaching. PhD advisors usually pass down folklore about what their venue rejects. With this tool, the same advisor can show evidence: open the cluster page, pick one exemplar per category, walk the student through it. The lesson lands with much more weight than "be careful with your baselines".

Rebuttal mining. Pair get_rebuttal with get_decision to study which rebuttal patterns flipped borderline rejections into acceptances. We have not done this analysis at scale yet, but the primitives are now available; we will publish results as we run them.

Why we built it on MCP, not as a script

Three reasons.

First, composability. A pre-submission self-review agent should be able to combine openreview-mcp with academia-mcp (arXiv literature lookup), with a code-execution sandbox, and with a slot for the user's own draft. MCP makes that composition mechanical.

Second, separation of concerns. The MCP server captures the OpenReview-specific knowledge (v1 vs v2 API differences, decision schemas, content field variants — TMLR, for example, uses recommendation instead of decision and combines strengths and weaknesses in one field). Consumers of the server do not need to learn any of this.

Third, durability. MCP is becoming the integration substrate other research-tooling projects build on. Producing a peer-review tool on MCP today means that future agents — including ones we have not imagined yet — can wire it up without our involvement.

Limitations we are honest about

The tool is not magic, and we want users to read the output with calibrated trust.

TF-IDF clusters by lexical similarity, not semantic. Two critiques saying the same thing with different vocabulary may end up in separate clusters. A future opt-in pipeline based on sentence embeddings would surface finer structure at the cost of an additional dependency. We expose the choice as a future flag rather than baking it in by default.
Sampling is uniform across rejections. Borderline rejections (decision text "Reject (close)") carry different critique patterns than clear rejections; we treat them identically. Weighting by reviewer confidence or decision margin is on the roadmap.
Each venue exposes a different slice of its process. NeurIPS publicly exposes only submissions that reached the decision stage, inflating the visible accept rate. COLM exposes only accepted papers, making weakness clustering impossible. We flag these limitations on each venue's page and recommend the user reads them before drawing conclusions.

What is next

openreview-mcp is the first MCP server we ship from OpenCódice Research. Two more are on the runway:

A deadline tracker for academic CFPs (NeurIPS, ICLR, ACL ARR, COLM, EMNLP, CHI, etc.), so an agent can query "what venues close in the next 60 days" and get a structured answer.
A Zenodo bridge for dataset and code deposits with DOIs, so research artefacts can be pushed and pulled programmatically.

Beyond the server, we have built Venues, an open analytics layer over the data this server produces: rich, navigable pages for ICLR 2024, ICLR 2025, NeurIPS 2024 and TMLR, with a year-over-year comparison view. The server is the engine; Venues is the dashboard. Both are open and free.

The repository is github.com/OpenCodice-Research/openreview-mcp. Issues and pull requests are welcome. We are particularly interested in contributions that add sentence-embedding clustering as an alternative to TF-IDF, venue-specific normalisation for non-standard review templates (TMLR, ARR cycles), and a public dashboard refreshed weekly for the largest venues.

If you build something on top of it, tell us. We will link to it.

The full design rationale, analysis pipeline, and case study are documented in OpenCódice Technical Report OC-TR-2026-007 (Zenodo, DOI 10.5281/zenodo.19758460).