My favourite thing about llamafile is that it means we can now download a single 4GB file and get both a language model and the software needed to execute against that model in a single unit, which we can use on multiple different platforms.
Me, a few weeks ago: "Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again."
I downloaded two files, ran `llamafile-server-0.4 -ngl 35 -m mistral-7b-instruct-v0.1.Q4_K_M.gguf` and was getting 53 tokens/second on my RTX 2070 Super 8GB. It couldn't be easier.
This is really going to make me want to try using a local model to help with coding. I never get responses that fast on ChatGPT with GPT-4.
I really don't see the point, cross-platform software with multiple binaries is absolutely fine. I've shipped cross-platform software and it's a one time trivial cost to prepare a multi OS pipeline. The actual problems come from dependencies and I'm sure you'll have to solve the same problems with cosmopolitan.
Llamafile is another hack I'd never use - why limit myself to the models supported by llamafile when I can just install llama.cpp? (the web interface is also part of llama.cpp https://github.com/ggerganov/llama.cpp/pull/1998)
Moreover, people with GPUs will likely want to use something with better performance than llama.cpp (gptq or exllama2)
> I really don't see the point, cross-platform software with multiple binaries is absolutely fine. I've shipped cross-platform software and it's a one time trivial cost to prepare a multi OS pipeline.
That's like people who said "I could do this dropbox thing in a weekend with just rsync, ssh and ftp"
You think you can, but even if you might cover this "one time trivial cost" once, 1) it doesn't scale well so you'll end up not doing it that often 2) most other people can't.
> people with GPUs likely want to use something with better performance
End users like performance, but first they need to be able to use something at all.
If you start with "compile this binary then download this 4Gb file then run that command in this terminal" you've lost 99% of the potential audience
I think you are missing the forest for the tree if you only see a technological hack instead of the change of paradigm it allows.
For context, I'm running this on my basic mainstream laptop (6-core, IGP, Ryzen 5 5500U) with 16GB RAM. I'm getting ~6 tokens/second for mistral-7b-v0.1.Q4_K_M.
Actually if you poke around at it a little more closely you'll find there's also a version of Llamafile that just ships llama.cpp, enabling you to supply a model of your choosing.
Also the point of a tool like this to make the ecosystem more accessible to everyone, not just software developers.
I mean, that depends somewhat on your use case. My primary development machine -- a Framework with a recent board and 32GB of RAM -- is perfectly capable of running smaller models on CPU, and while it's certainly easy to find where the ceiling is (and this obviously isn't the hardware profile of an average user), it's still more than sufficient for basic use
Me, a few weeks ago: "Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again."
There are llamafiles for Mixtral-8x7B-Instruct now too, which is currently the most capable openly licensed model: https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/t... - those ones are 32GB.
I wrote some more notes on llamafile here: https://simonwillison.net/2023/Nov/29/llamafile/
To save us some time...
1. A llamafile is just a zip, so you don't have to download the GGUF of a model separately if you already have the llamafile, just unzip it
2. You can use llamafile against other models without having to bundle them together - the llamafile binary without a bundled model is about 5MB. Notes on that here: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...
3. If you already have a preferred way of running models on your own computer that isn't llamafile, great! This software isn't for you.