Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A quick survey of the thread seems to indicate the 7b parameter LLaMA model does about 20 tokens per second (~4 words per second) on a base model M1 Pro, by taking advantage of Apple Silicon’s Neural Engine.

Note that the latest model iPhones ship with a Neural Engine of similar performance to latest model M-series MacBooks (both iPhone 14 Pro and M1 MacBook Pro claim 15.8 teraflops on their respective neural engines; it might be the same exact component in each chip). All iPhone 14 models sport 6GB integrated RAM; the MacBook starts at 8GB. All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.

Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.

So. With Whisper and LLaMA on the Neural Engine both showing better than real-time performance, and Apple’s own pre-existing Siri Neural TTS, it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone. This is absolutely extraordinary stuff!



> All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.

Battery capacity and thermals are different and might be problematic. The phone might throttle performance earlier.

> it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone.

As a demo, yes, but would loading the model be fast enough for Siri-like responsiveness? You also would want to run other programs alongside it.

And of course, for Apple to adopt something like this, we would have to get rid of the tendency of these models to derail conversations. Put in something somewhat sexist/racist/…, and it will reply with something a bit more sexist/racist/…)

But yes, it would be a cool demo.


> we would have to get rid of the tendency of these models [...] reply with something a bit more sexist/racist/

If you don't want it to be racist, don't say racist things to it. Also, it'll be fairly clear where the racism came from - like a parrot and their owner.

AIs that can tweet, like MS Tay, and that remote-work chatbot, get a lot of attention when they melt down. Private AIs on your phone don't seem like they'll caise any concern with the phone-using public.

I think we'll appreciate the benefits more than we'll mind that others can make it say dirty words.


Siri doesn’t seem as fast or responsive compared to Google assistant at times.


At this point in time, Siri as a voice-driven assistant has become so totally and utterly useless, its not even worth comparing it to anything else. I wonder how a company can work at a feature like that for 10 years, and manage to make it worse with every release they put out.

At this point in time, Apple should be so embarrased of Siri that I really think scratching the whole thing would have a net benefit.

Scratch it, and start over. And fire everyone involved with Siri :-)


The logistics aren't that easy, Apple's entire product line runs Siri.


Siri is sometimes busy doing laundry or Gods know what. I think the quality of Siri is much better than Google Assistant but I wonder about the lag.


Really? I find Siri can’t understand anything slightly more than basic instructions.

Google assistant can seem to do more


I'm very interested in this space. Can you share an example that illustrates the difference in "understanding" between the two?


Just recently Siri would belly-up on “Turn off Living Room lightS” — it would only work if I said “light” (singular). Extremely frustrating. They fixed it, I think, but this arbitrariness and many other make me think Siri is more quirk- and algorithms-based than a true AI.


Handling smart home requests is the one thing that Siri seems to do more or less without error, at least for me. I use that multiple times per day per day, and cannot remember the last time that it did not work.


Is Siri better, or does it have you well trained? My smart home stuff works best for me because I know more of the exact labels. I was literally surprised the other day that my wife included an S and it still worked.


Mine is really really poor at it.

Half the time it responds with "one moment.. One moment.. this is taking too long" or "I have problems connecting to the internet". But there's no internet problems whatsoever and it connects to my home Assistant using local homekit integration which shouldn't even need that.


> Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.

He has already done great work here: https://github.com/ggerganov/whisper.cpp


If I may, this library runs LLaMA on CPU. There is no way to run it on the Neural Engine yet.

The optimization in this case only seems to refer to the 4bit model loading method (to be friendlier to the arm64 CPU)

GeoHot has tinygrad running LLaMA on Metal (but only the 7B model) that's the closest I've seen to taking advantage of apple silicon.

Neural Engine implementation would be awesome


Oh shit, I took a closer look and you’re right. The repo was also helpfully updated with a note to this effect: “The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder, there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't know how to utilize it properly. But in any case, you can even disable it with LLAMA_NO_ACCELERATE=1 make and the performance will be the same, since no BLAS calls are invoked by the current implementation”.

No Joi in my pocket just yet :(

Because of this I re-checked my claims about the Whisper speed up from the Neural Engine and that does look legit, 6x at least. So the Neural Engine does have the chops for this workload, it just isn’t being used in this repo. It may not be LLaMA, but I sure hope someone gets an LLM running on the ANE sooner rather than later.


Our investigations indicate that it might not be possible to achieve ANE performance improvement over CPU for LLM Decoder inference with batch size of 1 [0]. Just to make it clear - I'm no expert in Core ML / ANE, so these conclusions could be totally wrong.

[0] https://github.com/ggerganov/whisper.cpp/discussions/548#dis...


Don’t sell yourself short! (And you have my apologies in advance if my excited comment above has created any extra work for you)


Neural Engine across the M1 and M2 series is also sadly very limited.

I bought one thinking I could exploit it for StableDiffusion and other tasks but found that most libraries say to use GPU for faster generation. What I found is not only is the engine the same on m2 pro (meaning I upgraded for no reason from my m1 basemodel) but it also doesn't scale at all except in the m1 Ultra where it's doubled simply because it's using two dies bridged.

Neural Engine can generate 512x512 images pretty easily but takes a while even compared to using the GPU on a basemodel m1 Mac Mini. It's kinda crazy. Looking into ways to improve it and take advantage of the neural engine in the future but the current situation is very limited. Even apples official implementation and coreML libraries seem to prefer you run them on Metal


The 7b model specifically is not quite "ChatGPT-level" though, is it?


According to Meta's benchmarking[0] it is comparable on many metrics. I haven't used it myself so I can't say for sure if that is the case when actually using it.

[0]: https://arxiv.org/pdf/2302.13971.pdf


That's GPT3, not ChatGPT.


I don't understand this topic well, but given premise that GPT3 and ChatGPT are different only that ChatGPT includes RLHF(Reinforcement Learning from Human Feedback), and LLaMA 7b is comparable to GPT3 on a number of metrics, it would follow that if we were to improve LLaMA 7b with RLHF, the 7b model would be similar to ChatGPT. Is that correct?


You're likely right that applying RLHF (+ fine-tuning with instructions) to LLaMA 7b would produce results similar to ChatGPT, but I think you're implying that that would be feasible today.

RLHF requires a large amount of human feedback data and IIRC there's no open data set for that right now.


There's open-assistant.io, which is doing RLHF directly on the open


And they've already collected over 100,000 samples, iirc ChatGPT was trained on something like 30,000 samples, so the open models should already be positioned to succeed.


There are open datasets (see the chatllama harness project and its references). You can of course also cross train it using actual ChatGPT.


Is there something I'm missing? ChatLlama doesn't reference any human feedback datasets.

> You can of course also cross train it using actual ChatGPT.

You mean train it on ChatGPT's output? That's against OpenAI's terms of service.


> You mean train it on ChatGPT's output? That's against OpenAI's terms of service.

Oh no, someone call the internet police.

I'm sure scraping tons and tons of images and web data to train DALLE and GPT and then selling access to that data to others was also against many licenses and terms of services, but OpenAI did those anyway.


None of these AIs were created ethically. At the very least we can make sure these huge models don’t solely belong to monopolistic tech companies and democratize their power.


You’re missing something. Both SHP (https://huggingface.co/datasets/stanfordnlp/SHP) and OpenAssistant datasets are referenced.

And the TOS violation might be the case, the project nevertheless has a mode to use OpenAI in the fine tuning steps.


I’m interested in this as well. Comparatively little attention has been paid to those 7B model results, but they look quite good against 175B GPT-3.

As for ChatGPT, that is GPT-3.5 (same 175B model, but with instruction fine-tuning), plus the RLHF.


GPT 3.5 likely differs from the original GPT 3 by more than instruction fine-tuning. For example, it was probably retrained under Chinchilla scaling laws [1], with a lot more data and maybe a somewhat smaller parameter count.

There are many variants of GPT-3 and GPT-3.5, and based on the performance numbers in Meta’s paper, it looks like they’re comparing against the very first version of GPT-3 from 2020. [2]

[1] https://arxiv.org/abs/2203.15556

[2] https://arxiv.org/abs/2005.14165


There's no overhead introduced for the 'final' model inference, is there?


None of the Meta models are RLHF tuned, as far as I know.


I wish we could start having open source TTS models with similar performance. So far Tortoise TTS is not there yet. Im not sure if Siri neural TTS is offered for 3rd party apps.


>20 tokens per second (~4 words per second)

How can there be 5 tokens per word, when they have more than half the vocabulary as GPT-2/3 which has 1.3 tokens per word?

I would have guessed more like 1.5 tokens per word.


Oh, it’s probably higher than four words per second, then. I assumed tokens was characters and used the standard “there are five characters in a word” rule of thumb.


It's about 4 charcters per token. So just over 1 token per word. I just round to 1 token per word since text most people generate does not use larger words and because larger common words are still encoded as one token (e.g. HackerNews is probably one token despite being 10 characters).


I typically see people claim 2-3 tokens per word.


But wont it be that in real life no one would want to run a voice command which consumes lot of CPU and battery as opposed to making a network call to a service which has this model hosted ?

Agreed that this can always be improved and hardware can get more efficient and better to but at the end of the day, would it ever be better then an API call ?


I live in eastern Oregon on a property with no cell service.

I use Siri a lot, mainly to add reminders, and sometimes I try to use Siri when I'm out at the greenhouse, which is just past the edge of the mesh network. I would love for those reminders to get added - even if it burnt battery.

And more generally I would love for people writing apps to consider that phones don't always have service - as would my neighbors.


Privacy concerns are justified.

It's not just that, this can also work completely offline.


I'm looking forward to run stuff like this online. Using bigtech corporate souls SaaS AI is just pure dystopia material.

It's even better that we are talking about a relatively low power machine here. Maybe can operate offered.


You mean offline?


Ultimately, no amount of technology will ever beat the speed of light. Running locally will always have a lower latency floor.


Theoretically yes. But in the real world, no.

Simple thought experiment: you want to know how many tons of copper are mined in the US each year. Lowest possible latency is calculating this in your head, most likely using data you don’t have. Looking it up online is a lot, lot faster.

In some far future world maybe every transistor will include the sum total of human knowledge up to the nanosecond, but that’s a pretty far future. There are many things where running locally means a higher latency floor.


Its still cheaper to run a free model on a competitive "dumb" cloud host than buy a service only one company provides.


There are still a few people in the world who don't have always-on gigabit internet access everywhere they go.


"There is No Reason for Any Individual To Have a Computer in Their Home"


I would also expect 10x improvements over the next year due to optimizations found throughout the stack.


the potential drawbacks of relying entirely on voice-operated assistants like ChatGPT. There are concerns around privacy and the use of personal data, as well as the potential for bias and inaccuracies in the responses generated by these models. It's important to strike a balance between the convenience and benefits of these technologies and the potential risks and limitations they bring. Nonetheless, the advancements being made in this field are impressive and it will be interesting to see how they develop in the future.


That's very ChatGPT of you to say!


i think voice assistants can perform actions on phones (eg "open app, message Alice, call Bob, turn off Bluetooth"). This couldn't do that (I think), which is an obvious drawback


4 words a second doesn't seem fast enough for a voice assistant ?


It's faster than that [0], 20 token/s, should be approximately 15 words per second.

0: https://help.openai.com/en/articles/4936856-what-are-tokens-...


I've had difficulty obtaining useful results from the smaller (7B-sized) models. The issue lies in the content, not the speed. If you could stream the text-to-speech, the speed alone would be satisfactory.


You're right I overestimated how fast we talk!


Some rules of thumb I use for estimating this kind of stuff

100wpm: Max typing speed

200wpm: Max speaking speed

300wpm: Max listening speed, max reading speed with subvocalisation

900wpm: Max reading speed without subvocalisation


Doin napkin math, this model should be hitting 900wpm




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: