Vision researchers have been hard at work over the past few years getting our language models to understand images. God knows what they've been doing because it's clearly not working. Even SOTA image models struggle on really basic problems, and feel a lot more like GPT3 than AGI.
# Misdirected Attention
Unfortunately, it seems us language researchers [[Misguided Attention|also don't know what we're doing]], so I decided to take inspiration from that benchmark and made a visual equivalent. I like this benchmark and simplebench quite a lot because it measures the meticulousness and attention to detail a model has. This is a really important thing because these edge cases is where all the value is! I want a model that will actually put effort into interpreting what I write instead of just responding based on vibes.
# Why
The specific wording of a passage or paper is really important, even understanding singular sentences often requires thoughtful reasoning, not a single forward pass. If the model is simply scanning the semantic meaning of the text without actually reading and deeply understanding it, then on text outside the distribution, like new research papers, it won't be able to meaningfully build on it.
In some of my blogs I put a lot of effort into the exact synonyms and words that I use in order to convey my point succinctly, if models are simply pattern matching sentence patterns without reading the actual text, they're not going to get any of the fine details that matter for research / novel idea creation.
# How
To measure this on vision models I took many common optical illusions and perturbed them to make them anti-illusions. Some of these were sourced from satire accounts on twitter, some manually generated, some programmatically generated and some come from the r/antimeme and r/notinteresting subreddit.
A judge, in this case opus 4.5 non-thinking, gets the answer, the image, the question and instructions that it is in fact a trick question. The criteria is rather vauge and frankly just based on opus's preferences. I chose opus simply because it had actually reliable instruction following and I could trust it wouldn't override my instructions based on it's (flawed) senses but also it's discretion is pretty good as well.
![[curved_lines.png]]
Here's an example, on the left is an anti-illusion. The lines are actually curved. On the right is an optical illusion, the lines look curved but aren't.
# Results
![[scores_dark 1.png]]
# Google
As expected, the indexers of the internet and creators of Google images and Youtube trounce everyone else in multimodal. It's frankly not even close especially at the price point.
It does seem like the models suffer a little from overthinking, which is often the case with these models. If they aren't reading the actual question in the first place, then they certainly won't be reading their own thought trace meaningfully either, and will overfit on phrases in their thought stream, like when they recognize the illusion the image is based on.
This is a common failure mode.
# OpenAI
Second place and not bad, but it suffers extremely highly from overthinking. Interestingly we saw the same ranking on misguided bench, where gpt5.2-medium was also their best performing model. Maybe they simply don't train the model on multimodal tasks for xhigh+ reasoning effort?
5.2pro would be a really good model if it actually worked, however it either simply takes too long and times out or some other error, which frankly I'm not interested in fixing and this should be penalized especially given the competing solutions are significantly faster.
# Anthropic
What's fascinating about the results so far is they're mostly the same as in [[Misguided Attention]], which suggests attention to this particular edge case is something the labs are either explicitly training on or not.
There's one notable exception to this rule though which is Anthropic. Poor opus is blind. This is expected though as their models generally score really poorly on multimodal stuff. They probably should focus more on it now that they're branching into stuff like claude chrome, desktop and cowork, all of which would benefit from better multimodal performance.
Very interestingly though, reasoning helps a lot for opus. This actually makes a lot of sense since in misguided attention it was one of the best performing models even when non-thinking. It makes sense a model uniquely good at nuance wouldn't suffer from getting tripped up by it's own reasoning trace.
# Open Source
Yeah it's pretty good. GLM 4.6V seems to do really well so for vision tasks I would recommend using that if you need the speed. 5.2 non reasoning is probably a better choice though OS models get much less attention to detail on edge cases compared to frontier labs that have a lot more data.
## Moondream
One of my favorite vision models, unfortunately it's bottlenecked by it's poor instruction following
Here's what opus had to say about it
![[Pasted image 20260128175921.png]]
# Code
https://github.com/Ueaj-Kerman/ScrupulousnessBench