NeurIPS 2024

The largest ML venue. OpenReview only exposes submissions that reached the decision stage; what you see here is the tail of the process.

OpenReview

About the available data

OpenReview publicly exposes 4,236 NeurIPS 2024 submissions. The full conference received ~17,000. Withdrawn or pre-screening rejections are not in the public API, which inflates the visible acceptance rate. The clustered weaknesses come from the 201 rejected papers we can see.

4,236

Visible submissions

Total exposed on OpenReview

95.3%

Visible acceptance

Over the public subset

2,468

Critiques analysed

Across 201 rejected papers

Patterns identified

Rating distribution

Scores reviewers assigned in their reviews.

Committee decisions

How final decisions are distributed across visible submissions.

Accept (poster)86.1%
Accept (spotlight)7.7%
Reject4.7%
Accept (oral)1.4%

Weakness map

Each bar is a recurrent pattern across the reviews of rejected papers. The width represents how much weight that pattern carries among all analysed critiques.

Weak experimental comparison25.5%574
Mathematical notation and assumptions15.1%340
Hard to follow12.2%274
Critiques specific to generative models9.5%214
Details of the proposed method8.2%185
Limited experimental results7.9%178
Positioning against prior work6.4%143
Validation on real-world scenarios5.2%118
Typos and code errors5.2%116
Hard-to-read figures4.8%109

Patterns, one by one

Sorted by weight. For each pattern we show what it represents, how reviewers phrase it, and a practical takeaway you can apply before submitting your next paper.

#01

Weak experimental comparison

23.3%of total574 items

Reviewers find the comparison set too narrow: strong baselines are missing, ablations are missing, training details are missing.

methodsperformanceexperimentscomparisonauthorsdatatrainingdetails

How reviewers phrase it

Unfortunately, I found the rest of the paper (beyond the core idea) lacking and having several weaknesses. Importantly, the authors mischaracterize important relevant literature and conceptual ideas.

The paper compares against three baselines published before 2022. Two recent strong baselines are absent without explanation.

Performance numbers are reported, but training details (compute, seeds, hyperparameter search) are not, which makes the comparison hard to interpret.

Practical takeaway. Before the deadline, list the comparisons you would expect as a reviewer yourself. If your paper doesn't have them, add them or explain why they are omitted.

#02

Mathematical notation and assumptions

13.8%of total340 items

Errors in theorems, equations with inconsistent notation, assumptions that appear in proofs without being stated. Critical for theoretical papers.

equationtheoremeqassumptionlinesauthorsdefinedused

How reviewers phrase it

I found several typos in the main theorems in section 4. For example, the stability equation in Lemma 4 should be written with respect to the output at the iteration instead of the output of the ERM.

The proof of Theorem 1 silently uses a Lipschitz assumption that was never stated. Either add it to the theorem statement or argue why it is implied.

Eq. (12) uses the same symbol for two different objects defined on different pages.

Practical takeaway. Apply the golden rule to every theorem: state all assumptions in a block first, then the conclusion. Get a math-savvy colleague to review before submission.

#03

Hard to follow

11.1%of total274 items

The paper is hard to read. Structure does not support the argument, sentences are imprecise, parts are written at very different levels of detail.

paperwritingunderstandsectionmainhardauthorspresentation

How reviewers phrase it

While the paper makes significant contributions, there are some areas that could be improved. The writing is occasionally imprecise, making it challenging to follow the arguments and understand the details.

The introduction is dense and assumes a lot of background; the experimental section by contrast over-explains. Even out the levels.

The main result is buried at the bottom of page 6. Pull it forward and signpost it earlier.

Practical takeaway. After the first draft, read titles and subtitles only, in order: does the story land? If not, refactor structure before polishing prose.

#04

Critiques specific to generative models

8.7%of total214 items

A topic-specific cluster grouping complaints against diffusion and generative-model papers: unjustified architectural choices, comparisons against specific diffusion models, training costs not reported.

modelmodelsdiffusiondiffusion modelpaperperformancedataauthors

How reviewers phrase it

Since we usually don't switch models based on data I am not sure why this is important. Do we really have edge devices that switch on a daily basis?

The diffusion model used as backbone is two generations behind the state of the art. A repeat with a current model would change the conclusions.

Compute cost of training is not reported; this is the single most relevant axis for comparing generative methods.

Practical takeaway. If you work on generative models: anticipate `why this backbone instead of SDXL/SD3/Flux?` and answer it explicitly.

#05

Details of the proposed method

7.5%of total185 items

The method is described at a high level but operational details are missing: speed, compute cost, ablations of the key component.

methodproposedproposed methodmethodstrainingpapertableanalysis

How reviewers phrase it

The processing speed of the proposed method is one of the limitations.

There is no ablation on the core regulariser; we don't know whether it is doing the work the authors claim.

Memory cost compared to the baseline isn't reported.

Practical takeaway. A dedicated `Implementation and cost` subsection kills many of these complaints. Time per iteration, memory, sensitive hyperparameters.

#06

Limited experimental results

7.2%of total178 items

The headline table covers a narrow set of scenarios. Reviewers want to see the method under more diverse or more adversarial conditions.

resultstableexperimentalexperimental resultsperformanceanalysispaperauthors

How reviewers phrase it

The random daycare market for which the results are derived is somewhat restrictive.

All experimental settings stay within the i.i.d. regime; covariate shift would test the claims.

Why are results aggregated over only three runs? At this gap size, more seeds are needed.

Practical takeaway. An extra column with a scenario that breaks your method (and honesty about when it stops working) usually scores better than another comfortable experiment.

#07

Positioning against prior work

5.8%of total143 items

The related-work section fails to connect the paper to the right audience (safety track, specific sub-area) or fails to settle the novelty question.

workrelatedrelated workpaperworkssectionnoveltyauthors

How reviewers phrase it

One of my major concerns is the audience of this work. Given that this work is submitted to the safe ML track of NeurIPS, I expect more discussion on the relevance of this framework to AI safety.

The related work section reads like a chronology, not a comparison. Group prior work and contrast with the contribution.

Novelty over [Author, 2023] is not articulated; that paper appears to solve the same problem.

Practical takeaway. If you submit to a thematic track (safety, datasets), dedicate a paragraph in related work to making the connection explicit. Don't leave it implied.

#08

Validation on real-world scenarios

4.8%of total118 items

The method is tested on synthetic or academic datasets. Evidence that it works where the problem matters is missing.

datasetsrealreal worldworldexperimentspaperapplicationsscenarios

How reviewers phrase it

While the method is tested on two real-world datasets, broader evaluation across more diverse and challenging datasets could strengthen the validation.

Both datasets are well-curated benchmarks; one industrial dataset would substantially raise confidence in the claims.

The application scenarios discussed in the introduction are not represented in the experiments.

Practical takeaway. A real-data experiment, even small, is worth more than three with synthetic data. If not feasible: declare the limitation precisely.

#09

Typos and code errors

4.7%of total116 items

Errors at line level: typos, misplaced symbols, code fragments that don't compile as written.

linerightarrowtypotypo linetildeenda_algorithm

How reviewers phrase it

Line 307, `one week The` -> `one week. The`

In Algorithm 1 line 4, the index a_t should be a_{t-1} given the recurrence.

$\\tilde{x}$ is used in Eq. (9) but defined only in the appendix.

Practical takeaway. A two-hour pass focused only on `detail errors` before the deadline saves enormously at rebuttal time. Unglamorous but high density.

#10

Hard-to-read figures

4.4%of total109 items

Inconsistent captions, irregular capitalisation, lines too thin to print, duplicated or ambiguous labels.

figurefigurescaptionhardbettertextsmallfig

How reviewers phrase it

Keep capitalisation consistent across the figure labels.

Lines in Fig. 4 are too thin; the dashed and dotted variants are indistinguishable in print.

Caption of Fig. 2 is one sentence; please describe what the reader is looking at without referring to the body text.

Practical takeaway. Print your paper in black and white and look at it from a metre away. What isn't readable at that distance won't be readable on screen at review speed.

Other venues

ICLR 2024

7,404 submissions · 10 clusters

→

ICLR 2025

11,672 submissions · 10 clusters

→

TMLR

6,661 submissions · 10 clusters

→