Overconfidence in State of the Art LLMs

Aug 4, 2024

[Research Paper] Alice in Wonderland - Complete Reasoning Breakdown in SOTA LLMs

2 Comments

Great synopsis. I think it would be interesting if somebody built a framework to identify these "hard problems" (questions with consistent wrong answers and high fluctuations on variations >> high uncertainty). One could then start building an "MMLU+ / hard benchmark dataset" that could serve as a) a new standard for evaluating reasoning, b) a starting point for better understand why these problems are so difficult for LLMs, and c) a starting point for fine-tuning efforts as a short term solution.

Expand full comment

Reply (1)

Rabea

Aug 16

I completely agree. Another benchmark specifically for these types of problems, continually updated with the latest findings regarding these hard problems, would be great.

can't wait for a collaboration!

Expand full comment

The Intrapreneur by Farabi Innovations

Overconfidence in State of the Art LLMs