Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yup, evals can definitely be tough. We basically have a suite of several hundred web data extraction evals in a tool we built called Bananalyzer [1]. It's made it pretty straightforward for us to benchmark how accurately our agent generates code when it uses Tarsier-text (+ GPT-4) for perception v.s. Tarsier-screenshot (+ GPT-4V/o).

Will have to look into supporting Azure OCR in Tarsier then—thanks for the tip!

[1] https://github.com/reworkd/bananalyzer



Awesome, will take a look at this. thank you


Neat. Do you have the Bananalyzer eval results for Tarsier published somewhere?


We're hoping to release an evals paper about Bananalyzer this summer and compare Tarsier to a variety of other perception systems in it. The hard part with evaluating a perception/context system though is that it's very intertwined with the agent's architecture, and that's not something we're comfortable fully open-sourcing yet. We'll have to think of interesting ways to decouple the perception system and eval them with Bananalyzer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: