Hacker Newsnew | past | comments | ask | show | jobs | submit | nutlope's commentslogin

Thank you!


Should be up, please try again!


It let me upload a file, but didn't produce any output.


Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.

Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!


I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.

Is this amount of larger transformation expected/desirable?

(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)


Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...


I've had trouble with pulling scientific content out of poster PDFs, mostly because e.g. nougat falls apart with different layouts.

Have you considered that usage yet?


> Need an example image? Try ours. Great idea, I wish more services would have similar feature


How accurate is this?

When compared with existing OCR systems, what sorts of mistakes does it make?


Option to use a local LLM?


I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.

* https://github.com/jabberjabberjabber/LLMOCR


MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.


I love this. Can you share the source?


Hey! Have you tried out Edge Streaming yet? It uses the Edge Runtime which is a fraction of the cost of serverless functions and lets you stream responses for much longer than 10 seconds, giving you the "chatting" effect that you see on ChatGPT.

Docs: http://vercel.fyi/streaming Example: https://vercel.com/blog/gpt-3-app-next-js-vercel-edge-functi...


I have not! thanks for letting me know, I'll give it a try.


It's a conference registration site that involves a series of challenges involving a wordle and a multiplayer experience with a prism built with Three.js


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: