Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great explanation, thanks. I was surprised to hear that models still only work with ~100k tokens, but after giving it some thought it makes sense. There's only so many words/subword units that get used in any given language. The entropy comes from all the billions of different ways those subwords can be ordered.


Textual language is really, really amazing if you sit down and think about what it does versus the resources it consumes to do it.

It's a common pasttime for programmers to claim that our textual programming languages are just terrible and need to be replaced somehow with something visual, but I think this very often comes from a place of not understanding just how amazing textual languages are. Not they couldn't possibly be improved by something in at least some domains, and there are after all some successful niches for visual languages, but I think if you set out to wholesale replace textual languages without an understanding of and appreciation for the impressive nature of the competition they offer you're setting yourself up to fail.


This also touches on the contrast between how human beings and LLM's trade compression for nuance. Human beings have enormous resources devoted to long-tailed distribution of information, for example in lexical items. Word distributions follow Zipf's Law, so like in the million word FROWN corpus, roughly half the words only occur one time. Like when's the last time you use the word chrysanthemum, or corpulent? But did you have any difficulty recognizing them? So while human beings have limited scale compared to machines, we do have an enormous capacity for nuanced, communication and conception.

Whereas LLM's make the opposite trade-off. There are information centric theory limitations on the amount of information LM's can store (roughly 3.6 bits per parameter) so they aggressively compress information and trade away nuance (https://arxiv.org/abs/2505.17117).


There is also a tradeoff between different vocabulary sizes (how many entries exist in the token -> embedding lookup table) that inform the current shape of tokenizers and LLMs. (Below is my semi-armchair stance, but you can read more in depth here[0][1].)

If you tokenized at the character level ('a' -> embedding) then your vocabulary size would be small, but you'd have more tokens required to represent most content. (And context scales non-linearly, iirc, like n^3) This would also be a bit more 'fuzzy' in terms of teaching the LLM to understand what a specific token should 'mean'. The letter 'a' appears in a _lot_ of different words, and it's more ambiguous for the LLM.

On the flip side: What if you had one entry in the tokenizer's vocabulary for each word that existed? Well, it'd be far more than the ~100k entries used by popular LLMs, and that has some computational tradeoffs like when you calculate the probability of each 'next' token via softmax, you'd have to run that for each token, as well as increasing the size of certain layers within the LLM (more memory + compute required for each token, basically).

Additionally, you run into a new problem: 'Rare Tokens'. Basically, if you have infinite tokens, you'll run into specific tokens that only appear a handful of times in the training data and the model is never able to fully imbue the tokens with enough meaning for them to _help_ the model during inference. (A specific example being somebody's username on the internet.)

Fun fact: These rare tokens, often called 'Glitch Tokens'[2], have been used for all sorts of shenanigans[3] as humans learn to break these models. (This is my interest in this as somebody who works in AI security)

As LLMs have improved, models have pushed towards the largest vocabulary they can get away with without hurting performance. This is about where my knowledge on the subject ends, but there have been many analyses done to try to compute the optimal vocabulary size. (See the links below)

One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8) or directly against the final layers of state in a small LLM (trying to use a small LLM to 'grok' the meaning and hoist it into a more dense, almost compressed latent space that the large LLM can understand).

It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...

This immediately makes the model's inner state (even more) opaque to outside analysis though. e.g., like why using gRPC as the protocol for your JavaScript front-end sucks: Humans can't debug it anymore without other tooling. JSON is verbose as hell, but it's simple and I can debug my REST API with just network inspector. I don't need access to the underlying Protobuf files to understand what each byte means in my gRPC messages. That's a nice property to have when reviewing my ChatGPT logs too :P

Exciting times!

0: https://www.rohan-paul.com/p/tutorial-balancing-vocabulary-s...

1: https://arxiv.org/html/2407.13623v1

2: https://en.wikipedia.org/wiki/Glitch_token

3: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...


Again, super interesting thanks!

> One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8)

I've had similar ideas in the past. High level languages that humans write are designed for humans. What does an "LLM native" programming language look like? And, to your point about protobufs vs JSON, how does a human debug it when the LLM gets stuck?

> It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...

That's basically the strategy for Claude's new "Skills" feature, just in a more dynamic/AI driven way. Claude will do semantic search through YAML frontmatter to determine what skill might be useful in a given context, then load that entire skill file into context to execute it. Your idea here is similar, use a small local model to summarize each file (basically dynamically generate that YAML front matter), feed those into the larger model's context, and then it can choose which file(s) it cares about based on that.


Theres almost an infinity of chess games possible from barely 32 pièces and simple moves.

Already at move 3 you have bilions of positions possible




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: