taesiri's comments

taesiri · 2025-06-03T17:17:00 1748971020

for overly represented concepts, like popular brands, it seems that the model “ignores” the details once it detects that the overall shapes or patterns are similar. Opening up the vision encoders to find out how these images cluster in the embedding space should provide better insights.

impossiblefork · 2025-06-03T21:20:11 1748985611

Yes, and this can probably be solved by methods for fairness.

I used to believe that fairness research could be ignored, that it was all rubbish, but they at least try to do something about things like unbalanced datasets etc. I'm still not sure I totally believe in it though.

kmeisthax · 2025-06-03T18:06:09 1748973969

If there aren't any five-legged dogs in your trainset, it's safer[0] to just remember that all dogs are four-legged than to actually recognize and count legs. After all, you might have a few images of dogs in your trainset that are misleading enough to look five-legged (e.g. because a dog is in front of another dog).

Overrepresentation is a different source of bias. That's what gives you, say, image generators that always draw "golden 1970s sci-fi robot" as C3-PO even when given additional instructions to draw something else.

Both of these problems are manifestations of the difference between training and deployment distributions. Ok, I guess you could say that four-legged dogs are "overrepresented" in the training set, but that's because four-legged dogs are also overrepresented in reality. The deployment distribution doesn't have five-legged dogs in it. What we've done is instead concoct an adversarial distribution to force a train/deploy gap where none would exist.

Releasing the vision encoder won't help because weights are opaque. Stochastic gradient descent does not yield functional internal representations[1]; it fills the bucket of parameters with one distribution and one distribution only. We could tell if, say the vision encoder produces identical embeddings for dogs regardless of leg count, or some other counterfactuals; but not much more than that.

[0] Lower loss and possibly lower L2-norm

[1] https://arxiv.org/abs/2505.11581

taesiri · 2025-06-03T12:47:30 1748954850

State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).

LorenDB · 2025-06-03T18:53:10 1748976790

There's no need to repeat what is said at the top of the linked webpage.

taesiri · 2025-05-23T15:47:05 1748015225

tldr; We find that GenAI can satisfy 1/3 of everyday image editing requests, while 2/3 of the requests are better handled by human image editors.

taesiri · 2025-03-07T15:22:19 1741360939

Abstract:

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.

taesiri · 2025-02-17T05:53:51 1739771631

Abstract:

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

taesiri · 2025-02-17T05:40:32 1739770832

All frontier models, (o1, o1-pro, QVQ, gemini-flash-thinking) score exactly 0% on main questions of this benchmark.

taesiri · on July 10, 2024

This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?

taesiri · on Aug 13, 2023

Not one place, but there are some people tweeting about new papers daily (@arankomatsuzaki, @_akhaliq, @omarsar0) other people summarizing papers (@davisblalock, @rasbt). Latent Space podcast is also great and of course r/LocalLLaMA/ is an amazing place to share and learn.

jhwater · on Aug 13, 2023

Nice, thanks!

taesiri · on Aug 13, 2023

Coool! Would be nice to have an option to send commands to an LLM and show the results to the "user"! :D