Hacker Newsnew | past | comments | ask | show | jobs | submit | EarlyOom's commentslogin

Shocking how poor frontier models perform on simple visual tasks. Best-in-domain tool calling will Become the norm


VLM Run | Member of Technical Staff, ML Systems | Full-time | Hybrid Bay Area, CA | https://vlm.run | 150k-220k / yr + Equity

VLM Run is a first-of-its-kind API dedicated to running Vision Language Models on Documents, Images, and Video. We’re building a stack from the bottom-up for ‘Visual’ applications of language models that we believe will make up > 90% of inference needs in the next 5 years.

Hybrid from Bay Area, CA

Looking for experience in any of the following:

* ML Domains: Vision Language Models, LLMs, Temporal/Video Models

* Model Training, Evaluation, and Versioning platforms: WnB, Huggingface

* Infra: Python, Pytorch, Pydantic, CUDA, Torch.compile

* Devops: Github CI, Docker, Conda, API Billing and Monitoring

https://vlm-run.notion.site/vlm-run-hiring-25q1


Hi, do you offer visa / allow remote from the EU (GMT+2)


This is the main focus of VLM Run and typed extraction more generally. If you provide proper type constraints (e.g. with Pydantic) you can dramatically reduce the surface area for hallucination. Then there's actually fine-tuning on your dataset (we're working on this) to push accuracy beyond what you get from an unspecialized frontier model.


Re type constraints: Not really. If one of the fields in my JSON is `name` but the model can’t read the name on the page, it will very happily make one up. Type constraints are good for making sure that your data is parseable, but they don’t do anything to fix the undetectable inaccuracy problem.

Fine-tuning does help, though.


Yes, both false positives and false negatives like the one you mentioned happens when the schema is sometimes ill-defined. Making name optional via `name: str | None` actually turns out ensure that the model only fills it if it’s certain that field exists.

These are some of the nuances we had to work with during VLM fine-tuning with structured JSON.


You seem to be missing my point.


An effective way that usually increases accuracy is to use an ensemble of capable models that are trained independently (e.g., gemini, gpt-4o, qwen). If >x% of them have the same output, accept it, otherwise reject and manually review


There’s a very low chance that three separate models will come up with the same result. There are always going to be errors, small or large. Even if you find a way around that, running the process three times on every page is going to be prohibitively expensive, especially if you want to finetune.


No, running it two or three times for every page isn't prohibitive. In fact, one of the arguments for using modern general-purpose multimodal models for historical HTR is that it is cheaper and faster than Transkribus.

What you can do is for instance to ask one model for a transcription, and ask a second model to compare the transcription to the image and correct any errors it finds. You actually have a lot of budget to try things like these if the alternative is to fine-tune your own model.


The odds of them getting the same result for any given patch should be very high if it is the correct result and they aren't garbage. The only times where they are not getting the same result would be the times when at least one has made a mistake. The odds of 3 different models making the same mistake should be low (unless it's something genuinely ambiguous like 0 vs O in a random alphanumeric string).

Best 2 out of 3 should be far more reliable than any model on its own. You could even weight their responses for different types of results, like say model B is consistently better for serif fonts, maybe their confidence counts for 1.5 times as much as the confidence of models A and C.


That's not OCR.

It is an absolute miracle.

It is transmutating a picture into JSON.

I never thought this would be possible in my lifetime.

But that is different from what your interlocutor is discussing.


> I never thought this would be possible in my lifetime.

I used to work in Computer Vision and Image Processing. These days I utter this sentence on an almost daily basis. :-D


You can try out some of our schemas with Ollama if you want: https://github.com/vlm-run/vlmrun-hub (instructions in Readme)


VLMs are able to take context into account when filling in fields, following either a global or field specific prompt. This is great for e.g. unlabeled axes, checking a legend for units to be suffixed after a number, etc. Also, you catch lots of really simple errors with type hints (e.g. dates, addresses, country codes etc.).


You can! it works with Ollama https://github.com/vlm-run/vlmrun-hub

At the end of the day its just schemas. You can decide for yourself if its work upgrading to a larger, more expensive model.


We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...


Kind of skeptical since you also provide a “confidence” value, which has to be entirely made up.

Do you have an example that isn’t a sample drivers license? Something that is unlikely to have appeared in an LLM’s training data?


Wait what? That's pretty neat. I'm on my phone right now, so I can't really view the notebook very easily. How does this work? Are you using some kind of continual partitioning of the image and refeeding that back into the LLM to sort of pseudo-zoom in/out on the parts that contain non-cut off text until you can resolve that into rough coordinates?


We convert to a JSON schema, but it would be trivial to convert this to yaml. There are some minor differences in e.g. tokens required to output JSON vs yaml which is why we've opted for our strategy.


OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.


That's a great point about the limitations of traditional OCR with rotated or poorly scanned documents. I agree that VLMs really shine when it comes to understanding context and extracting information beyond just the text itself. It's pretty cool how they can map implicit relationships, like those X-axis labels you mentioned.


Would love to chat! reach out scott@vlm.run


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: