
NeurIPS 2024 Wrapped 🌯
Explicit content warning
12/30/24 • 86 min
What happens when you bring over 15,000 machine learning nerds to one city? If your guess didn't include racism, sabotage and scandal, belated epiphanies, a spicy SoLaR panel, and many fantastic research papers, you wouldn't have captured my experience. In this episode we discuss the drama and takeaways from NeurIPS 2024.
Posters available at time of episode preparation can be found on the episode webpage.
EPISODE RECORDED 2024.12.22
- (00:00) - Recording date
- (00:05) - Intro
- (00:44) - Obligatory mentions
- (01:54) - SoLaR panel
- (18:43) - Test of Time
- (24:17) - And now: science!
- (28:53) - Downsides of benchmarks
- (41:39) - Improving the science of ML
- (53:07) - Performativity
- (57:33) - NopenAI and Nanthropic
- (01:09:35) - Fun/interesting papers
- (01:13:12) - Initial takes on o3
- (01:18:12) - WorkArena
- (01:25:00) - Outro
Links
Note: many workshop papers had not yet been published to arXiv as of preparing this episode, the OpenReview submission page is provided in these cases.
- NeurIPS statement on inclusivity
- CTOL Digital Solutions article - NeurIPS 2024 Sparks Controversy: MIT Professor's Remarks Ignite "Racism" Backlash Amid Chinese Researchers’ Triumphs
- (1/2) NeurIPS Best Paper - Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- Visual Autoregressive Model report this link now provides a 404 error
- Don't worry, here it is on archive.is
- Reuters article - ByteDance seeks $1.1 mln damages from intern in AI breach case, report says
- CTOL Digital Solutions article - NeurIPS Award Winner Entangled in ByteDance's AI Sabotage Accusations: The Two Tales of an AI Genius
- Reddit post on Ilya's talk
- SoLaR workshop page
Referenced Sources
- Harvard Data Science Review article - Data Science at the Singularity
- Paper - Reward Reports for Reinforcement Learning
- Paper - It's Not What Machines Can Learn, It's What We Cannot Teach
- Paper - NeurIPS Reproducibility Program
- Paper - A Metric Learning Reality Check
Improving Datasets, Benchmarks, and Measurements
- Tutorial video + slides - Experimental Design and Analysis for AI Researchers (I think you need to have attended NeurIPS to access the recording, but I couldn't find a different version)
- Paper - BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
- Paper - Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
- Paper - A Systematic Review of NeurIPS Dataset Management Practices
- Paper - The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track
- Paper - Benchmark Repositories for Better Benchmarking
- Paper - Croissant: A Metadata Format for ML-Ready Datasets
- Paper - Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox
- Paper - Evaluating Generative AI Systems is a Social Science Measurement Challenge
What happens when you bring over 15,000 machine learning nerds to one city? If your guess didn't include racism, sabotage and scandal, belated epiphanies, a spicy SoLaR panel, and many fantastic research papers, you wouldn't have captured my experience. In this episode we discuss the drama and takeaways from NeurIPS 2024.
Posters available at time of episode preparation can be found on the episode webpage.
EPISODE RECORDED 2024.12.22
- (00:00) - Recording date
- (00:05) - Intro
- (00:44) - Obligatory mentions
- (01:54) - SoLaR panel
- (18:43) - Test of Time
- (24:17) - And now: science!
- (28:53) - Downsides of benchmarks
- (41:39) - Improving the science of ML
- (53:07) - Performativity
- (57:33) - NopenAI and Nanthropic
- (01:09:35) - Fun/interesting papers
- (01:13:12) - Initial takes on o3
- (01:18:12) - WorkArena
- (01:25:00) - Outro
Links
Note: many workshop papers had not yet been published to arXiv as of preparing this episode, the OpenReview submission page is provided in these cases.
- NeurIPS statement on inclusivity
- CTOL Digital Solutions article - NeurIPS 2024 Sparks Controversy: MIT Professor's Remarks Ignite "Racism" Backlash Amid Chinese Researchers’ Triumphs
- (1/2) NeurIPS Best Paper - Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- Visual Autoregressive Model report this link now provides a 404 error
- Don't worry, here it is on archive.is
- Reuters article - ByteDance seeks $1.1 mln damages from intern in AI breach case, report says
- CTOL Digital Solutions article - NeurIPS Award Winner Entangled in ByteDance's AI Sabotage Accusations: The Two Tales of an AI Genius
- Reddit post on Ilya's talk
- SoLaR workshop page
Referenced Sources
- Harvard Data Science Review article - Data Science at the Singularity
- Paper - Reward Reports for Reinforcement Learning
- Paper - It's Not What Machines Can Learn, It's What We Cannot Teach
- Paper - NeurIPS Reproducibility Program
- Paper - A Metric Learning Reality Check
Improving Datasets, Benchmarks, and Measurements
- Tutorial video + slides - Experimental Design and Analysis for AI Researchers (I think you need to have attended NeurIPS to access the recording, but I couldn't find a different version)
- Paper - BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
- Paper - Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
- Paper - A Systematic Review of NeurIPS Dataset Management Practices
- Paper - The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track
- Paper - Benchmark Repositories for Better Benchmarking
- Paper - Croissant: A Metadata Format for ML-Ready Datasets
- Paper - Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox
- Paper - Evaluating Generative AI Systems is a Social Science Measurement Challenge
Previous Episode

OpenAI's o1 System Card, Literally Migraine Inducing
The idea of model cards, which was introduced as a measure to increase transparency and understanding of LLMs, has been perverted into the marketing gimmick characterized by OpenAI's o1 system card. To demonstrate the adversarial stance we believe is necessary to draw meaning from these press-releases-in-disguise, we conduct a close read of the system card. Be warned, there's a lot of muck in this one.
Note: All figures/tables discussed in the podcast can be found on the podcast website at https://kairos.fm/muckraikers/e009/
- (00:00) - Recorded 2024.12.08
- (00:54) - Actual intro
- (03:00) - System cards vs. academic papers
- (05:36) - Starting off sus
- (08:28) - o1.continued
- (12:23) - Rant #1: figure 1
- (18:27) - A diamond in the rough
- (19:41) - Hiding copyright violations
- (21:29) - Rant #2: Jacob on "hallucinations"
- (25:55) - More ranting and "hallucination" rate comparison
- (31:54) - Fairness, bias, and bad science comms
- (35:41) - System, dev, and user prompt jailbreaking
- (39:28) - Chain-of-thought and Rao-Blackwellization
- (44:43) - "Red-teaming"
- (49:00) - Apollo's bit
- (51:28) - METR's bit
- (59:51) - Pass@???
- (01:04:45) - SWE Verified
- (01:05:44) - Appendix bias metrics
- (01:10:17) - The muck and the meaning
Links
- o1 system card
- OpenAI press release collection - 12 Days of OpenAI
Additional o1 Coverage
- NIST + AISI [report] - US AISI and UK AISI Joint Pre-Deployment Test
- Apollo Research's paper - Frontier Models are Capable of In-context Scheming
- VentureBeat article - OpenAI launches full o1 model with image uploads and analysis, debuts ChatGPT Pro
- The Atlantic article - The GPT Era Is Already Ending
On Data Labelers
- 60 Minutes article + video - Labelers training AI say they're overworked, underpaid and exploited by big American tech companies
- Reflections article - The hidden health dangers of data labeling in AI development
- Privacy International article = Humans in the AI loop: the data labelers behind some of the most powerful LLMs' training datasets
Chain-of-Thought Papers Cited
- Paper - Measuring Faithfulness in Chain-of-Thought Reasoning
- Paper - Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
- Paper - On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models
- Paper - Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
Other Mentioned/Relevant Sources
- Andy Jones blogpost - Rao-Blackwellization
- Paper - Training on the Test Task Confounds Evaluation and Emergence
- Paper - Best-of-N Jailbreaking
- Research landing page - SWE Bench
- Code Competition - Konwinski Prize
- Lakera game = Gandalf
- Kate Crawford's Atlas of AI
- BlueDot Impact's course - Intro to Transformative AI
Unrelated Developments
- Cruz's
Next Episode

Understanding AI World Models w/ Chris Canal
Chris Canal, co-founder of EquiStamp, joins muckrAIkers as our first ever podcast guest! In this ~3.5 hour interview, we discuss intelligence vs. competencies, the importance of test-time compute, moving goalposts, the orthogonality thesis, and much more.
A seasoned software developer, Chris started EquiStamp as a way to improve our current understanding of model failure modes and capabilities in late 2023. Now a key contractor for METR, EquiStamp evaluates the next generation of LLMs from frontier model developers like OpenAI and Anthropic.
EquiStamp is hiring, so if you're a software developer interested in a fully remote opportunity with flexible working hours, join the EquiStamp Discord server and message Chris directly; oh, and let him know muckrAIkers sent you!
- (00:00) - Recording date
- (00:05) - Intro
- (00:29) - Hot off the press
- (02:17) - Introducing Chris Canal
- (19:12) - World/risk models
- (35:21) - Competencies + decision making power
- (42:09) - Breaking models down
- (01:05:06) - Timelines, test time compute
- (01:19:17) - Moving goalposts
- (01:26:34) - Risk management pre-AGI
- (01:46:32) - Happy endings
- (01:55:50) - Causal chains
- (02:04:49) - Appetite for democracy
- (02:20:06) - Tech-frame based fallacies
- (02:39:56) - Bringing back real capitalism
- (02:45:23) - Orthogonality Thesis
- (03:04:31) - Why we do this
- (03:15:36) - Equistamp!
Links
- EquiStamp
- Chris's Twitter
- METR Paper - RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
- All Trades article - Learning from History: Preventing AGI Existential Risks through Policy by Chris Canal
- Better Systems article - The Omega Protocol: Another Manhattan Project
Superintelligence & Commentary
- Wikipedia article - Superintelligence: Paths, Dangers, Strategies by Nick Bostrom
- Reflective Altruism article - Against the singularity hypothesis (Part 5: Bostrom on the singularity)
- Into AI Safety Interview - Scaling Democracy w/ Dr. Igor Krawczuk
Referenced Sources
- Book - Man-made Catastrophes and Risk Information Concealment: Case Studies of Major Disasters and Human Fallibility
- Artificial Intelligence Paper - Reward is Enough
- Wikipedia article - Capital and Ideology by Thomas Piketty
- Wikipedia article - Pantheon
LeCun on AGI
- "Won't Happen" - Time article - Meta’s AI Chief Yann LeCun on AGI, Open-Source, and AI Risk
- "But if it does, it'll be my research agenda latent state models, which I happen to research" - Meta Platforms Blogpost - I-JEPA: The first AI model based on Yann LeCun’s vision for more human-like AI
Other Sources
- Stanford CS Senior Project - Timing Attacks on Prompt Caching in Language Model APIs
- TechCrunch article - AI researcher François Chollet founds a new AI lab focused on AGI
- White House Fact Sheet - Ensuring U.S. Security and Economic Strength in the Age of Artificial Intelligence
- New York Post
If you like this episode you’ll love
Episode Comments
Generate a badge
Get a badge for your website that links back to this episode
<a href="https://goodpods.com/podcasts/muckraikers-582619/neurips-2024-wrapped-80948570"> <img src="https://storage.googleapis.com/goodpods-images-bucket/badges/generic-badge-1.svg" alt="listen to neurips 2024 wrapped 🌯 on goodpods" style="width: 225px" /> </a>
Copy