Discussion about this post

User's avatar
Ryan's avatar

Great synopsis. I think it would be interesting if somebody built a framework to identify these "hard problems" (questions with consistent wrong answers and high fluctuations on variations >> high uncertainty). One could then start building an "MMLU+ / hard benchmark dataset" that could serve as a) a new standard for evaluating reasoning, b) a starting point for better understand why these problems are so difficult for LLMs, and c) a starting point for fine-tuning efforts as a short term solution.

Expand full comment
1 more comment...

No posts