Hacker Newsnew | past | comments | ask | show | jobs | submit | taesiri's commentslogin

for overly represented concepts, like popular brands, it seems that the model “ignores” the details once it detects that the overall shapes or patterns are similar. Opening up the vision encoders to find out how these images cluster in the embedding space should provide better insights.


Yes, and this can probably be solved by methods for fairness.

I used to believe that fairness research could be ignored, that it was all rubbish, but they at least try to do something about things like unbalanced datasets etc. I'm still not sure I totally believe in it though.


If there aren't any five-legged dogs in your trainset, it's safer[0] to just remember that all dogs are four-legged than to actually recognize and count legs. After all, you might have a few images of dogs in your trainset that are misleading enough to look five-legged (e.g. because a dog is in front of another dog).

Overrepresentation is a different source of bias. That's what gives you, say, image generators that always draw "golden 1970s sci-fi robot" as C3-PO even when given additional instructions to draw something else.

Both of these problems are manifestations of the difference between training and deployment distributions. Ok, I guess you could say that four-legged dogs are "overrepresented" in the training set, but that's because four-legged dogs are also overrepresented in reality. The deployment distribution doesn't have five-legged dogs in it. What we've done is instead concoct an adversarial distribution to force a train/deploy gap where none would exist.

Releasing the vision encoder won't help because weights are opaque. Stochastic gradient descent does not yield functional internal representations[1]; it fills the bucket of parameters with one distribution and one distribution only. We could tell if, say the vision encoder produces identical embeddings for dogs regardless of leg count, or some other counterfactuals; but not much more than that.

[0] Lower loss and possibly lower L2-norm

[1] https://arxiv.org/abs/2505.11581


State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).


There's no need to repeat what is said at the top of the linked webpage.


tldr; We find that GenAI can satisfy 1/3 of everyday image editing requests, while 2/3 of the requests are better handled by human image editors.


Abstract:

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.


Abstract:

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.


All frontier models, (o1, o1-pro, QVQ, gemini-flash-thinking) score exactly 0% on main questions of this benchmark.


This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?


Not one place, but there are some people tweeting about new papers daily (@arankomatsuzaki, @_akhaliq, @omarsar0) other people summarizing papers (@davisblalock, @rasbt). Latent Space podcast is also great and of course r/LocalLLaMA/ is an amazing place to share and learn.


Nice, thanks!


Coool! Would be nice to have an option to send commands to an LLM and show the results to the "user"! :D


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: