OpenReview publicly exposes 4,236 NeurIPS 2024 submissions. The full conference received ~17,000. Withdrawn or pre-screening rejections are not in the public API, which inflates the visible acceptance rate. The clustered weaknesses come from the 201 rejected papers we can see.
4,236
Visible submissions
Total exposed on OpenReview
95.3%
Visible acceptance
Over the public subset
2,468
Critiques analysed
Across 201 rejected papers
10
Patterns identified
Rating distribution
Scores reviewers assigned in their reviews.
Committee decisions
How final decisions are distributed across visible submissions.
Accept (poster)86.1%
Accept (spotlight)7.7%
Reject4.7%
Accept (oral)1.4%
Weakness map
Each bar is a recurrent pattern across the reviews of rejected papers. The width represents how much weight that pattern carries among all analysed critiques.
Weak experimental comparison25.5%574
Mathematical notation and assumptions15.1%340
Hard to follow12.2%274
Critiques specific to generative models9.5%214
Details of the proposed method8.2%185
Limited experimental results7.9%178
Positioning against prior work6.4%143
Validation on real-world scenarios5.2%118
Typos and code errors5.2%116
Hard-to-read figures4.8%109
Patterns, one by one
Sorted by weight. For each pattern we show what it represents, how reviewers phrase it, and a practical takeaway you can apply before submitting your next paper.
#01
Weak experimental comparison
23.3%of total574 items
Reviewers find the comparison set too narrow: strong baselines are missing, ablations are missing, training details are missing.
Unfortunately, I found the rest of the paper (beyond the core idea) lacking and having several weaknesses. Importantly, the authors mischaracterize important relevant literature and conceptual ideas.
The paper compares against three baselines published before 2022. Two recent strong baselines are absent without explanation.
Performance numbers are reported, but training details (compute, seeds, hyperparameter search) are not, which makes the comparison hard to interpret.
Practical takeaway. Before the deadline, list the comparisons you would expect as a reviewer yourself. If your paper doesn't have them, add them or explain why they are omitted.
#02
Mathematical notation and assumptions
13.8%of total340 items
Errors in theorems, equations with inconsistent notation, assumptions that appear in proofs without being stated. Critical for theoretical papers.
I found several typos in the main theorems in section 4. For example, the stability equation in Lemma 4 should be written with respect to the output at the iteration instead of the output of the ERM.
The proof of Theorem 1 silently uses a Lipschitz assumption that was never stated. Either add it to the theorem statement or argue why it is implied.
Eq. (12) uses the same symbol for two different objects defined on different pages.
Practical takeaway. Apply the golden rule to every theorem: state all assumptions in a block first, then the conclusion. Get a math-savvy colleague to review before submission.
#03
Hard to follow
11.1%of total274 items
The paper is hard to read. Structure does not support the argument, sentences are imprecise, parts are written at very different levels of detail.
While the paper makes significant contributions, there are some areas that could be improved. The writing is occasionally imprecise, making it challenging to follow the arguments and understand the details.
The introduction is dense and assumes a lot of background; the experimental section by contrast over-explains. Even out the levels.
The main result is buried at the bottom of page 6. Pull it forward and signpost it earlier.
Practical takeaway. After the first draft, read titles and subtitles only, in order: does the story land? If not, refactor structure before polishing prose.
#04
Critiques specific to generative models
8.7%of total214 items
A topic-specific cluster grouping complaints against diffusion and generative-model papers: unjustified architectural choices, comparisons against specific diffusion models, training costs not reported.
The processing speed of the proposed method is one of the limitations.
There is no ablation on the core regulariser; we don't know whether it is doing the work the authors claim.
Memory cost compared to the baseline isn't reported.
Practical takeaway. A dedicated `Implementation and cost` subsection kills many of these complaints. Time per iteration, memory, sensitive hyperparameters.
#06
Limited experimental results
7.2%of total178 items
The headline table covers a narrow set of scenarios. Reviewers want to see the method under more diverse or more adversarial conditions.
The random daycare market for which the results are derived is somewhat restrictive.
All experimental settings stay within the i.i.d. regime; covariate shift would test the claims.
Why are results aggregated over only three runs? At this gap size, more seeds are needed.
Practical takeaway. An extra column with a scenario that breaks your method (and honesty about when it stops working) usually scores better than another comfortable experiment.
#07
Positioning against prior work
5.8%of total143 items
The related-work section fails to connect the paper to the right audience (safety track, specific sub-area) or fails to settle the novelty question.
One of my major concerns is the audience of this work. Given that this work is submitted to the safe ML track of NeurIPS, I expect more discussion on the relevance of this framework to AI safety.
The related work section reads like a chronology, not a comparison. Group prior work and contrast with the contribution.
Novelty over [Author, 2023] is not articulated; that paper appears to solve the same problem.
Practical takeaway. If you submit to a thematic track (safety, datasets), dedicate a paragraph in related work to making the connection explicit. Don't leave it implied.
#08
Validation on real-world scenarios
4.8%of total118 items
The method is tested on synthetic or academic datasets. Evidence that it works where the problem matters is missing.
While the method is tested on two real-world datasets, broader evaluation across more diverse and challenging datasets could strengthen the validation.
Both datasets are well-curated benchmarks; one industrial dataset would substantially raise confidence in the claims.
The application scenarios discussed in the introduction are not represented in the experiments.
Practical takeaway. A real-data experiment, even small, is worth more than three with synthetic data. If not feasible: declare the limitation precisely.
#09
Typos and code errors
4.7%of total116 items
Errors at line level: typos, misplaced symbols, code fragments that don't compile as written.
linerightarrowtypotypo linetildeenda_algorithm
How reviewers phrase it
Line 307, `one week The` -> `one week. The`
In Algorithm 1 line 4, the index a_t should be a_{t-1} given the recurrence.
$\\tilde{x}$ is used in Eq. (9) but defined only in the appendix.
Practical takeaway. A two-hour pass focused only on `detail errors` before the deadline saves enormously at rebuttal time. Unglamorous but high density.
#10
Hard-to-read figures
4.4%of total109 items
Inconsistent captions, irregular capitalisation, lines too thin to print, duplicated or ambiguous labels.
figurefigurescaptionhardbettertextsmallfig
How reviewers phrase it
Keep capitalisation consistent across the figure labels.
Lines in Fig. 4 are too thin; the dashed and dotted variants are indistinguishable in print.
Caption of Fig. 2 is one sentence; please describe what the reader is looking at without referring to the body text.
Practical takeaway. Print your paper in black and white and look at it from a metre away. What isn't readable at that distance won't be readable on screen at review speed.