A quick survey of the thread seems to indicate the 7b parameter LLaMA model does about 20 tokens per second (~4 words per second) on a base model M1 Pro, by taking advantage of Apple Silicon’s Neural Engine.
Note that the latest model iPhones ship with a Neural Engine of similar performance to latest model M-series MacBooks (both iPhone 14 Pro and M1 MacBook Pro claim 15.8 teraflops on their respective neural engines; it might be the same exact component in each chip). All iPhone 14 models sport 6GB integrated RAM; the MacBook starts at 8GB. All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.
Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.
So. With Whisper and LLaMA on the Neural Engine both showing better than real-time performance, and Apple’s own pre-existing Siri Neural TTS, it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone. This is absolutely extraordinary stuff!
> All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.
Battery capacity and thermals are different and might be problematic. The phone might throttle performance earlier.
> it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone.
As a demo, yes, but would loading the model be fast enough for Siri-like responsiveness? You also would want to run other programs alongside it.
And of course, for Apple to adopt something like this, we would have to get rid of the tendency of these models to derail conversations. Put in something somewhat sexist/racist/…, and it will reply with something a bit more sexist/racist/…)
> we would have to get rid of the tendency of these models [...] reply with something a bit more sexist/racist/
If you don't want it to be racist, don't say racist things to it. Also, it'll be fairly clear where the racism came from - like a parrot and their owner.
AIs that can tweet, like MS Tay, and that remote-work chatbot, get a lot of attention when they melt down. Private AIs on your phone don't seem like they'll caise any concern with the phone-using public.
I think we'll appreciate the benefits more than we'll mind that others can make it say dirty words.
At this point in time, Siri as a voice-driven assistant has become so totally and utterly useless, its not even worth comparing it to anything else. I wonder how a company can work at a feature like that for 10 years, and manage to make it worse with every release they put out.
At this point in time, Apple should be so embarrased of Siri that I really think scratching the whole thing would have a net benefit.
Scratch it, and start over. And fire everyone involved with Siri :-)
Just recently Siri would belly-up on “Turn off Living Room lightS” — it would only work if I said “light” (singular). Extremely frustrating. They fixed it, I think, but this arbitrariness and many other make me think Siri is more quirk- and algorithms-based than a true AI.
Handling smart home requests is the one thing that Siri seems to do more or less without error, at least for me. I use that multiple times per day per day, and cannot remember the last time that it did not work.
Is Siri better, or does it have you well trained? My smart home stuff works best for me because I know more of the exact labels. I was literally surprised the other day that my wife included an S and it still worked.
Half the time it responds with "one moment.. One moment.. this is taking too long" or "I have problems connecting to the internet". But there's no internet problems whatsoever and it connects to my home Assistant using local homekit integration which shouldn't even need that.
> Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.
Oh shit, I took a closer look and you’re right. The repo was also helpfully updated with a note to this effect: “The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder, there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't know how to utilize it properly. But in any case, you can even disable it with LLAMA_NO_ACCELERATE=1 make and the performance will be the same, since no BLAS calls are invoked by the current implementation”.
No Joi in my pocket just yet :(
Because of this I re-checked my claims about the Whisper speed up from the Neural Engine and that does look legit, 6x at least. So the Neural Engine does have the chops for this workload, it just isn’t being used in this repo. It may not be LLaMA, but I sure hope someone gets an LLM running on the ANE sooner rather than later.
Our investigations indicate that it might not be possible to achieve ANE performance improvement over CPU for LLM Decoder inference with batch size of 1 [0]. Just to make it clear - I'm no expert in Core ML / ANE, so these conclusions could be totally wrong.
Neural Engine across the M1 and M2 series is also sadly very limited.
I bought one thinking I could exploit it for StableDiffusion and other tasks but found that most libraries say to use GPU for faster generation. What I found is not only is the engine the same on m2 pro (meaning I upgraded for no reason from my m1 basemodel) but it also doesn't scale at all except in the m1 Ultra where it's doubled simply because it's using two dies bridged.
Neural Engine can generate 512x512 images pretty easily but takes a while even compared to using the GPU on a basemodel m1 Mac Mini. It's kinda crazy. Looking into ways to improve it and take advantage of the neural engine in the future but the current situation is very limited. Even apples official implementation and coreML libraries seem to prefer you run them on Metal
According to Meta's benchmarking[0] it is comparable on many metrics. I haven't used it myself so I can't say for sure if that is the case when actually using it.
I don't understand this topic well, but given premise that GPT3 and ChatGPT are different only that ChatGPT includes RLHF(Reinforcement Learning from Human Feedback), and LLaMA 7b is comparable to GPT3 on a number of metrics, it would follow that if we were to improve LLaMA 7b with RLHF, the 7b model would be similar to ChatGPT. Is that correct?
You're likely right that applying RLHF (+ fine-tuning with instructions) to LLaMA 7b would produce results similar to ChatGPT, but I think you're implying that that would be feasible today.
RLHF requires a large amount of human feedback data and IIRC there's no open data set for that right now.
And they've already collected over 100,000 samples, iirc ChatGPT was trained on something like 30,000 samples, so the open models should already be positioned to succeed.
> You mean train it on ChatGPT's output? That's against OpenAI's terms of service.
Oh no, someone call the internet police.
I'm sure scraping tons and tons of images and web data to train DALLE and GPT and then selling access to that data to others was also against many licenses and terms of services, but OpenAI did those anyway.
None of these AIs were created ethically. At the very least we can make sure these huge models don’t solely belong to monopolistic tech companies and democratize their power.
GPT 3.5 likely differs from the original GPT 3 by more than instruction fine-tuning. For example, it was probably retrained under Chinchilla scaling laws [1], with a lot more data and maybe a somewhat smaller parameter count.
There are many variants of GPT-3 and GPT-3.5, and based on the performance numbers in Meta’s paper, it looks like they’re comparing against the very first version of GPT-3 from 2020. [2]
I wish we could start having open source TTS models with similar performance. So far Tortoise TTS is not there yet.
Im not sure if Siri neural TTS is offered for 3rd party apps.
Oh, it’s probably higher than four words per second, then. I assumed tokens was characters and used the standard “there are five characters in a word” rule of thumb.
It's about 4 charcters per token. So just over 1 token per word. I just round to 1 token per word since text most people generate does not use larger words and because larger common words are still encoded as one token (e.g. HackerNews is probably one token despite being 10 characters).
But wont it be that in real life no one would want to run a voice command which consumes lot of CPU and battery as opposed to making a network call to a service which has this model hosted ?
Agreed that this can always be improved and hardware can get more efficient and better to but at the end of the day, would it ever be better then an API call ?
I live in eastern Oregon on a property with no cell service.
I use Siri a lot, mainly to add reminders, and sometimes I try to use Siri when I'm out at the greenhouse, which is just past the edge of the mesh network. I would love for those reminders to get added - even if it burnt battery.
And more generally I would love for people writing apps to consider that phones don't always have service - as would my neighbors.
Simple thought experiment: you want to know how many tons of copper are mined in the US each year. Lowest possible latency is calculating this in your head, most likely using data you don’t have. Looking it up online is a lot, lot faster.
In some far future world maybe every transistor will include the sum total of human knowledge up to the nanosecond, but that’s a pretty far future. There are many things where running locally means a higher latency floor.
the potential drawbacks of relying entirely on voice-operated assistants like ChatGPT. There are concerns around privacy and the use of personal data, as well as the potential for bias and inaccuracies in the responses generated by these models. It's important to strike a balance between the convenience and benefits of these technologies and the potential risks and limitations they bring. Nonetheless, the advancements being made in this field are impressive and it will be interesting to see how they develop in the future.
i think voice assistants can perform actions on phones (eg "open app, message Alice, call Bob, turn off Bluetooth"). This couldn't do that (I think), which is an obvious drawback
I've had difficulty obtaining useful results from the smaller (7B-sized) models. The issue lies in the content, not the speed. If you could stream the text-to-speech, the speed alone would be satisfactory.
Note that the latest model iPhones ship with a Neural Engine of similar performance to latest model M-series MacBooks (both iPhone 14 Pro and M1 MacBook Pro claim 15.8 teraflops on their respective neural engines; it might be the same exact component in each chip). All iPhone 14 models sport 6GB integrated RAM; the MacBook starts at 8GB. All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.
Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.
So. With Whisper and LLaMA on the Neural Engine both showing better than real-time performance, and Apple’s own pre-existing Siri Neural TTS, it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone. This is absolutely extraordinary stuff!