Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Llamafile: Bringing LLMs to the people, and to your own computer (future.mozilla.org)
88 points by weberer on Dec 21, 2023 | hide | past | favorite | 17 comments


My favourite thing about llamafile is that it means we can now download a single 4GB file and get both a language model and the software needed to execute against that model in a single unit, which we can use on multiple different platforms.

Me, a few weeks ago: "Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again."

There are llamafiles for Mixtral-8x7B-Instruct now too, which is currently the most capable openly licensed model: https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/t... - those ones are 32GB.

I wrote some more notes on llamafile here: https://simonwillison.net/2023/Nov/29/llamafile/

To save us some time...

1. A llamafile is just a zip, so you don't have to download the GGUF of a model separately if you already have the llamafile, just unzip it

2. You can use llamafile against other models without having to bundle them together - the llamafile binary without a bundled model is about 5MB. Notes on that here: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

3. If you already have a preferred way of running models on your own computer that isn't llamafile, great! This software isn't for you.


I thought Mixtral required 34GB of RAM to run at that quantization?

Yet RAM requirements aren't mentioned anywhere (in your links).


This is a great step forward, but I hope eventually Llamafile can also simplify using local data and documents as context.


I downloaded two files, ran `llamafile-server-0.4 -ngl 35 -m mistral-7b-instruct-v0.1.Q4_K_M.gguf` and was getting 53 tokens/second on my RTX 2070 Super 8GB. It couldn't be easier.

This is really going to make me want to try using a local model to help with coding. I never get responses that fast on ChatGPT with GPT-4.


This was recently discussed extensively:

https://news.ycombinator.com/item?id=38464057


Yup – macroexpanded:

Bash one-liners for LLMs - https://news.ycombinator.com/item?id=38629630 - Dec 2023 (102 comments)

Llamafile – The easiest way to run LLMs locally on your Mac - https://news.ycombinator.com/item?id=38522636 - Dec 2023 (17 comments)

Llamafile is the new best way to run a LLM on your own computer - https://news.ycombinator.com/item?id=38489533 - Dec 2023 (47 comments)

Llamafile lets you distribute and run LLMs with a single file - https://news.ycombinator.com/item?id=38464057 - Nov 2023 (288 comments)


Cosmopolitan is a nice hack but I'd never use it.

I really don't see the point, cross-platform software with multiple binaries is absolutely fine. I've shipped cross-platform software and it's a one time trivial cost to prepare a multi OS pipeline. The actual problems come from dependencies and I'm sure you'll have to solve the same problems with cosmopolitan.

Llamafile is another hack I'd never use - why limit myself to the models supported by llamafile when I can just install llama.cpp? (the web interface is also part of llama.cpp https://github.com/ggerganov/llama.cpp/pull/1998)

Moreover, people with GPUs will likely want to use something with better performance than llama.cpp (gptq or exllama2)


> I really don't see the point, cross-platform software with multiple binaries is absolutely fine. I've shipped cross-platform software and it's a one time trivial cost to prepare a multi OS pipeline.

That's like people who said "I could do this dropbox thing in a weekend with just rsync, ssh and ftp"

You think you can, but even if you might cover this "one time trivial cost" once, 1) it doesn't scale well so you'll end up not doing it that often 2) most other people can't.

> people with GPUs likely want to use something with better performance

End users like performance, but first they need to be able to use something at all.

If you start with "compile this binary then download this 4Gb file then run that command in this terminal" you've lost 99% of the potential audience

I think you are missing the forest for the tree if you only see a technological hack instead of the change of paradigm it allows.


I'm getting >50 tokens/second with an RTX 2070 Super 8GB on Mistral 7B with llamafile. More than adequate performance.


For context, I'm running this on my basic mainstream laptop (6-core, IGP, Ryzen 5 5500U) with 16GB RAM. I'm getting ~6 tokens/second for mistral-7b-v0.1.Q4_K_M.


> That's like people who said "I could do this dropbox thing in a weekend with just rsync, ssh and ftp"

And so begins the cycle of enshitification!


Actually if you poke around at it a little more closely you'll find there's also a version of Llamafile that just ships llama.cpp, enabling you to supply a model of your choosing.

Also the point of a tool like this to make the ecosystem more accessible to everyone, not just software developers.


gonna be awhile before sufficiently powerful hardware is GA for LLM


I mean, that depends somewhat on your use case. My primary development machine -- a Framework with a recent board and 32GB of RAM -- is perfectly capable of running smaller models on CPU, and while it's certainly easy to find where the ceiling is (and this obviously isn't the hardware profile of an average user), it's still more than sufficient for basic use

Even if it does also chew through my battery...


Now all they need to do is figure a way out to run Mixtral 8x7B in 8GB VRAM or less...


Super cool seeing cosmopolitan used in a relatively mainstream role like this:)

(And I guess the AI stuff is nice too;P)


This is really cool!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: