More

EarlyOom · 2025-07-02T19:27:58 1751484478

Shocking how poor frontier models perform on simple visual tasks. Best-in-domain tool calling will Become the norm

EarlyOom · 2025-03-03T23:02:56 1741042976

VLM Run is a first-of-its-kind API dedicated to running Vision Language Models on Documents, Images, and Video. We’re building a stack from the bottom-up for ‘Visual’ applications of language models that we believe will make up > 90% of inference needs in the next 5 years.

Hybrid from Bay Area, CA

Looking for experience in any of the following:

* ML Domains: Vision Language Models, LLMs, Temporal/Video Models

* Model Training, Evaluation, and Versioning platforms: WnB, Huggingface

* Infra: Python, Pytorch, Pydantic, CUDA, Torch.compile

* Devops: Github CI, Docker, Conda, API Billing and Monitoring

https://vlm-run.notion.site/vlm-run-hiring-25q1

d3m0t3p · 2025-03-03T23:30:33 1741044633

Hi, do you offer visa / allow remote from the EU (GMT+2)

EarlyOom · 2025-02-26T22:28:04 1740608884

This is the main focus of VLM Run and typed extraction more generally. If you provide proper type constraints (e.g. with Pydantic) you can dramatically reduce the surface area for hallucination. Then there's actually fine-tuning on your dataset (we're working on this) to push accuracy beyond what you get from an unspecialized frontier model.

rafram · 2025-02-27T04:10:09 1740629409

Re type constraints: Not really. If one of the fields in my JSON is `name` but the model can’t read the name on the page, it will very happily make one up. Type constraints are good for making sure that your data is parseable, but they don’t do anything to fix the undetectable inaccuracy problem.

Fine-tuning does help, though.

fzysingularity · 2025-02-27T05:08:46 1740632926

Yes, both false positives and false negatives like the one you mentioned happens when the schema is sometimes ill-defined. Making name optional via `name: str | None` actually turns out ensure that the model only fills it if it’s certain that field exists.

These are some of the nuances we had to work with during VLM fine-tuning with structured JSON.

rafram · 2025-02-27T12:48:02 1740660482

You seem to be missing my point.

hashta · 2025-02-27T00:48:48 1740617328

An effective way that usually increases accuracy is to use an ensemble of capable models that are trained independently (e.g., gemini, gpt-4o, qwen). If >x% of them have the same output, accept it, otherwise reject and manually review

rafram · 2025-02-27T04:18:37 1740629917

There’s a very low chance that three separate models will come up with the same result. There are always going to be errors, small or large. Even if you find a way around that, running the process three times on every page is going to be prohibitively expensive, especially if you want to finetune.

vintermann · 2025-02-27T10:55:47 1740653747

No, running it two or three times for every page isn't prohibitive. In fact, one of the arguments for using modern general-purpose multimodal models for historical HTR is that it is cheaper and faster than Transkribus.

What you can do is for instance to ask one model for a transcription, and ask a second model to compare the transcription to the image and correct any errors it finds. You actually have a lot of budget to try things like these if the alternative is to fine-tune your own model.

jjk166 · 2025-02-27T21:54:40 1740693280

The odds of them getting the same result for any given patch should be very high if it is the correct result and they aren't garbage. The only times where they are not getting the same result would be the times when at least one has made a mistake. The odds of 3 different models making the same mistake should be low (unless it's something genuinely ambiguous like 0 vs O in a random alphanumeric string).

Best 2 out of 3 should be far more reliable than any model on its own. You could even weight their responses for different types of results, like say model B is consistently better for serif fonts, maybe their confidence counts for 1.5 times as much as the confidence of models A and C.

refulgentis · 2025-02-27T03:22:33 1740626553

That's not OCR.

It is an absolute miracle.

It is transmutating a picture into JSON.

I never thought this would be possible in my lifetime.

But that is different from what your interlocutor is discussing.

1024core · 2025-02-27T03:34:33 1740627273

> I never thought this would be possible in my lifetime.

I used to work in Computer Vision and Image Processing. These days I utter this sentence on an almost daily basis. :-D

EarlyOom · 2025-02-26T21:32:15 1740605535

You can try out some of our schemas with Ollama if you want: https://github.com/vlm-run/vlmrun-hub (instructions in Readme)

EarlyOom · 2025-02-26T21:21:42 1740604902

VLMs are able to take context into account when filling in fields, following either a global or field specific prompt. This is great for e.g. unlabeled axes, checking a legend for units to be suffixed after a number, etc. Also, you catch lots of really simple errors with type hints (e.g. dates, addresses, country codes etc.).

EarlyOom · 2025-02-26T21:19:36 1740604776

You can! it works with Ollama https://github.com/vlm-run/vlmrun-hub

At the end of the day its just schemas. You can decide for yourself if its work upgrading to a larger, more expensive model.

EarlyOom · 2025-02-26T21:19:10 1740604750

We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...

what · 2025-02-27T03:02:03 1740625323

Kind of skeptical since you also provide a “confidence” value, which has to be entirely made up.

Do you have an example that isn’t a sample drivers license? Something that is unlikely to have appeared in an LLM’s training data?

vunderba · 2025-02-26T21:22:32 1740604952

Wait what? That's pretty neat. I'm on my phone right now, so I can't really view the notebook very easily. How does this work? Are you using some kind of continual partitioning of the image and refeeding that back into the LLM to sort of pseudo-zoom in/out on the parts that contain non-cut off text until you can resolve that into rough coordinates?

EarlyOom · 2025-02-26T21:18:35 1740604715

We convert to a JSON schema, but it would be trivial to convert this to yaml. There are some minor differences in e.g. tokens required to output JSON vs yaml which is why we've opted for our strategy.

EarlyOom · 2025-02-21T21:38:41 1740173921

OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.

codelion · 2025-02-23T22:52:01 1740351121

That's a great point about the limitations of traditional OCR with rotated or poorly scanned documents. I agree that VLMs really shine when it comes to understanding context and extracting information beyond just the text itself. It's pretty cool how they can map implicit relationships, like those X-axis labels you mentioned.

EarlyOom · 2025-02-20T20:28:50 1740083330

Would love to chat! reach out scott@vlm.run