Methodology

How we build the weakness map, what data we use, and where the limits are.

Data source

All data comes from the public API of OpenReview. No private data, no login, no special access. Only what anyone with a browser can see today on the venue's public page.

For each venue we fetch the submissions with their replies (reviews, meta-review, decision) in a single bulk call. Per-paper identifiers are not surfaced here: this site aggregates, it does not link.

How we cluster weaknesses

The `weaknesses` field of each official review typically contains five to eight distinct critiques, often as bullet points. The first thing we do is split each paragraph into individual items by its natural delimiters (dashes, numbers, bullets).

Then we apply a classical text-clustering pipeline:

We vectorize items with TF-IDF (uni-gram and bi-gram, sublinear frequency, L2 normalisation).
We reduce to 64 latent dimensions with Truncated SVD (Latent Semantic Analysis).
We re-normalise and cluster with KMeans (fixed seed, 20 initialisations).
For each cluster we recover the most representative TF-IDF terms and three exemplars close to the centroid.

The cluster labels and descriptions you see are not produced by the algorithm: they are written by hand by reading the terms and exemplars. The algorithm finds the structure; we name it.

Privacy and ToS compliance

We don't show paper identifiers, lists of rejected papers, or links to individual reviews.
Exemplar quotes are surfaced anonymously (no reviewer ID, no paper ID) and truncated.
All content is based on public data. If you are an author or reviewer of a paper and want a specific exemplar removed, write to contact@opencodice.org.
This is a non-profit open-science project.

Honest limitations

TF-IDF clusters by lexical similarity, not semantic. Two critiques saying the same thing with different vocabulary may end up in separate clusters.
Each venue exposes a different slice of its process. NeurIPS only exposes submissions that reached the decision stage; COLM only exposes accepted ones. We show what exists and flag when the data is biased.
Sampling within rejected papers is uniform. An analysis weighted by decision margin or reviewer confidence would differ.
Some clusters are mostly noise or concentrated on a specific sub-topic (federated learning, diffusion models). We flag this when it applies.

Full documentation

The pipeline design, the architecture of the MCP server that produces the data, and a full case study on ICLR 2024 are documented in the technical report OC-TR-2026-007:

Read technical report MCP server on GitHub