Llamafile: Bringing LLMs to the people, and to your own computer

simonw · on Dec 21, 2023

My favourite thing about llamafile is that it means we can now download a single 4GB file and get both a language model and the software needed to execute against that model in a single unit, which we can use on multiple different platforms.

Me, a few weeks ago: "Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again."

There are llamafiles for Mixtral-8x7B-Instruct now too, which is currently the most capable openly licensed model: https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/t... - those ones are 32GB.

I wrote some more notes on llamafile here: https://simonwillison.net/2023/Nov/29/llamafile/

To save us some time...

1. A llamafile is just a zip, so you don't have to download the GGUF of a model separately if you already have the llamafile, just unzip it

2. You can use llamafile against other models without having to bundle them together - the llamafile binary without a bundled model is about 5MB. Notes on that here: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

3. If you already have a preferred way of running models on your own computer that isn't llamafile, great! This software isn't for you.

tmikaeld · on Dec 21, 2023

I thought Mixtral required 34GB of RAM to run at that quantization?

Yet RAM requirements aren't mentioned anywhere (in your links).

toddmorey · on Dec 21, 2023

This is a great step forward, but I hope eventually Llamafile can also simplify using local data and documents as context.

pseudosavant · on Dec 21, 2023

I downloaded two files, ran `llamafile-server-0.4 -ngl 35 -m mistral-7b-instruct-v0.1.Q4_K_M.gguf` and was getting 53 tokens/second on my RTX 2070 Super 8GB. It couldn't be easier.

This is really going to make me want to try using a local model to help with coding. I never get responses that fast on ChatGPT with GPT-4.

behnamoh · on Dec 21, 2023

This was recently discussed extensively:

https://news.ycombinator.com/item?id=38464057

dang · on Dec 21, 2023

Yup – macroexpanded:

Bash one-liners for LLMs - https://news.ycombinator.com/item?id=38629630 - Dec 2023 (102 comments)

Llamafile – The easiest way to run LLMs locally on your Mac - https://news.ycombinator.com/item?id=38522636 - Dec 2023 (17 comments)

Llamafile is the new best way to run a LLM on your own computer - https://news.ycombinator.com/item?id=38489533 - Dec 2023 (47 comments)

Llamafile lets you distribute and run LLMs with a single file - https://news.ycombinator.com/item?id=38464057 - Nov 2023 (288 comments)

jokethrowaway · on Dec 21, 2023

Cosmopolitan is a nice hack but I'd never use it.

I really don't see the point, cross-platform software with multiple binaries is absolutely fine. I've shipped cross-platform software and it's a one time trivial cost to prepare a multi OS pipeline. The actual problems come from dependencies and I'm sure you'll have to solve the same problems with cosmopolitan.

Llamafile is another hack I'd never use - why limit myself to the models supported by llamafile when I can just install llama.cpp? (the web interface is also part of llama.cpp https://github.com/ggerganov/llama.cpp/pull/1998)

Moreover, people with GPUs will likely want to use something with better performance than llama.cpp (gptq or exllama2)

csdvrx · on Dec 21, 2023

> I really don't see the point, cross-platform software with multiple binaries is absolutely fine. I've shipped cross-platform software and it's a one time trivial cost to prepare a multi OS pipeline.

That's like people who said "I could do this dropbox thing in a weekend with just rsync, ssh and ftp"

You think you can, but even if you might cover this "one time trivial cost" once, 1) it doesn't scale well so you'll end up not doing it that often 2) most other people can't.

> people with GPUs likely want to use something with better performance

End users like performance, but first they need to be able to use something at all.

If you start with "compile this binary then download this 4Gb file then run that command in this terminal" you've lost 99% of the potential audience

I think you are missing the forest for the tree if you only see a technological hack instead of the change of paradigm it allows.

pseudosavant · on Dec 21, 2023

I'm getting >50 tokens/second with an RTX 2070 Super 8GB on Mistral 7B with llamafile. More than adequate performance.

pseudosavant · on Dec 21, 2023

For context, I'm running this on my basic mainstream laptop (6-core, IGP, Ryzen 5 5500U) with 16GB RAM. I'm getting ~6 tokens/second for mistral-7b-v0.1.Q4_K_M.

bordercases · on Dec 21, 2023

> That's like people who said "I could do this dropbox thing in a weekend with just rsync, ssh and ftp"

And so begins the cycle of enshitification!

chaosharmonic · on Dec 21, 2023

Actually if you poke around at it a little more closely you'll find there's also a version of Llamafile that just ships llama.cpp, enabling you to supply a model of your choosing.

Also the point of a tool like this to make the ecosystem more accessible to everyone, not just software developers.

cyanydeez · on Dec 21, 2023

gonna be awhile before sufficiently powerful hardware is GA for LLM

chaosharmonic · on Dec 22, 2023

I mean, that depends somewhat on your use case. My primary development machine -- a Framework with a recent board and 32GB of RAM -- is perfectly capable of running smaller models on CPU, and while it's certainly easy to find where the ceiling is (and this obviously isn't the hardware profile of an average user), it's still more than sufficient for basic use

Even if it does also chew through my battery...

pseudosavant · on Dec 21, 2023

Now all they need to do is figure a way out to run Mixtral 8x7B in 8GB VRAM or less...

yjftsjthsd-h · on Dec 21, 2023

Super cool seeing cosmopolitan used in a relatively mainstream role like this:)

(And I guess the AI stuff is nice too;P)

Narciss · on Dec 21, 2023

This is really cool!