2 Comments
User's avatar
Ryan's avatar

Great synopsis. I think it would be interesting if somebody built a framework to identify these "hard problems" (questions with consistent wrong answers and high fluctuations on variations >> high uncertainty). One could then start building an "MMLU+ / hard benchmark dataset" that could serve as a) a new standard for evaluating reasoning, b) a starting point for better understand why these problems are so difficult for LLMs, and c) a starting point for fine-tuning efforts as a short term solution.

Expand full comment
Rabea's avatar

I completely agree. Another benchmark specifically for these types of problems, continually updated with the latest findings regarding these hard problems, would be great.

can't wait for a collaboration!

Expand full comment