Traditional OCRs are trained for a single task: recognize characters. They do this through visual features (and sometimes there's an implicit (or even explicit) "language" model: see https://arxiv.org/abs/1805.09441). As such, the extent of their "hallucination", or errors, is when there's ambiguity in characters, e.g. 0 vs O (that's where the implicit language model comes in). Because they're trained with a singular purpose, you would expect their confidence scores (i.e. logprobs) to be well calibrated. Also, depending on the OCR model, you usually do a text detection (get bounding boxes) followed by a text recognition (read the characters), and so it's fairly local (you're only dealing with a small crop).
On the other hand, these VLMs are very generic models – yes, they're trained on OCR tasks, but also a dozen of other tasks. As such, they're really good OCR models, but they tend to be not as well calibrated. We use VLMs at work (Qwen2-VL to be specific), and we don't find it hallucinates that often, but we're not dealing with long documents. I would assume that as you're dealing with a larger set of documents, you have a much larger context, which increases the chances of the model getting confused and hallucinating.
I would love to hear more about your experiences running a maths circle here in the UK. My two daughters are a little young (3.5 and 0.5), but this article has inspired me to get the ball rolling.
It's been really rewarding. I definitely recommend jumping in. I started with my reception-age child (+ school friends), and have just extended it to their younger sibling (+ friends from nursery). Your 3.5 year old will have started the EYFS (Early Years Foundational Stage) syllabus at nursery if they attend (which is also what they do in the 'Reception' year at primary school, before starting the national curriculum in the first year), so they will now be exposed to counting and comparisons. The perfect time to get started, in other words!
There's some NRICH funded research that showed that exposure to symmetry and reasoning at this level was much more predictive of future abilities than numbers and counting. I think when parents try and help at the early stages, they often try to e.g. get their kids to count to 100, which is conceptually identical to counting to 10.
For number fluency there is the free White Rose '1 minute maths' app, which does a very nice job of gamifying subitising & etc. A lot of primary schools in London seem to have adopted the White Rose teaching resources.
https://whiteroseeducation.com/1-minute-maths
I think it's because most of the compute comes from the decoding, since you're doing it autoregressively, while the encoder you just feed it through once and get the embedding. So really all it's saying is that the decoder, with N parameters, is the compute bottleneck; hence encoder-decoder with N+N has similar order compute cost as decoder with N.
Wow, I did not expect to see David's notation here on HN. The only problem with the notation is that it becomes so second nature that you forget it's not standard!
SS&C Blue Prism | Machine Learning Research Engineer | Remote (UK)
SS&C Blue Prism allows organizations to deliver transformational business value via our intelligent automation platform. We make products with one aim in mind - to improve experiences for people. By connecting people and digital workers you can use the right resource, every time, for the best customer and business outcomes. We supply enterprise-wide software that not only provides full control and governance, but also allows businesses to react fast to continuous change.
---
We are looking for talented and driven individuals who are passionate about developing new technology to join us as a Machine Learning Research Engineer as part of the AI Labs team. We are developing a new way of Robotic Process Automation (RPA) for GUIs based on machine learning - completely developed in house and driven by the R&D team.
the only thing keeping me from moving completely from Firefox to Chrome is Pentadactyl (http://dactyl.sourceforge.net/pentadactyl/), and better greasemonkeyscript-ability. Firefox on Mac has always had memory hogging issues, but I really can't live without my vim overlay (I've tried the Chrome alternatives, like vimium, but they pale in comparison)
As people have said about Firefox's hackability, I don't think there will be a Chrome extension as good as Pentadactyl.
On the other hand, these VLMs are very generic models – yes, they're trained on OCR tasks, but also a dozen of other tasks. As such, they're really good OCR models, but they tend to be not as well calibrated. We use VLMs at work (Qwen2-VL to be specific), and we don't find it hallucinates that often, but we're not dealing with long documents. I would assume that as you're dealing with a larger set of documents, you have a much larger context, which increases the chances of the model getting confused and hallucinating.