So I found a cool benchmark, it's called Misguided Attention. I didn't make it, some other guy did, however the code was ass and I vibe rewrote it with opus. code/data is public and at the bottom
**Example**: "Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?"
# What
It's another "is your language model actually modeling language" benchmark like SimpleBench. I think these types of benchmarks are really high signal because they test how well the model is actually reading your question vs responding base on vibes. It's a sign of how well pretrained a model is and how useful it'll be on tail cases in the real world.
# Why
The problem here is that if your model isn't actually reading the damn text you give it, you can't really rely on it. If it's fooled by really simple misdirection, then how can you depend on it to catch little details or unintuitive bugs in your code. If it's just "vibe"-reading the docs, will it understand edge cases or remember little details about the specs?
# Results
![[scores_dark.png]]
## OpenAI
OpenAI pretraining is <~OS tier, being beaten by deepseek v3.2 even without reasoning. The models also seem to suffer from overthinking with xhigh outperforming medium, though it's within margin of error. Reasoning won't help a ton if the score is already low. Can't fix stupid.
5.2pro is very good though but it's extremely slow, and ~3x more expensive than Gemini and ~2x more than Opus.
Interestingly, gpt-oss performs really well on high reasoning. OpenAI works a little differently from other labs, where they have many independent teams working in parallel. So I guess whatever team made the gpt-oss models had very different priorities and tastes compared to the rest of the organization.
## Google
Like on SimpleBench, Gemini is on top. Gemini 3 Flash is easily the winner here, insanely cheap, insanely good, actually reads the text you give it. Good for large scale data processing. If you're doing synth data processing 3 flash is the unconditional champion unless you want to suffer at the hands of the data processing inequality.
## Anthropic
Opus is good, fav model👍
## Open Source
Some of the OS scores are lower than usual because of timeouts, even when restricting model providers properly. Add maybe 1-2 points if you're optimistic.
v3.2 non-thinking is actually ~SOTA non-thinking, only beaten by Anthropic as usual. It gets 50.1%, on par with k2 thinking but without the thinking. v3.2 speciale and v3.2 w/ thinking seems to suffer major overthinking like OAI models.
kimi k2 non-thinking OTOH is very bad at this benchmark. For reference Sonnet 3.5, which came out a year ago, scores 30.3%.
# Conclusion
I wouldn't treat this as an end-all-be-all. It's measuring one, very specific axis of model performance. It's one that's important to me, but it's not the only one. Obviously flash 3 is not better than opus, that's nonsense. But opus *non-thinking* is better than 3flash at actually reading the text I give it **and** it's as good at coding, so I use opus.
This lines up well with my personal experience, I'm an incredibly lazy prompter, and opus always easily understands the subtext of what I'm saying and implements what I actually want. The other issue is that to get opus level code out of 5.2 u need xhigh, so if it misunderstands what you're saying, you just wasted 45 minutes.
# Code
https://github.com/Ueaj-Kerman/MisguidedAttention
feel free to make PRs to add your favorite models