A quick survey of the thread seems to indicate the 7b parameter LLaMA model does...

Someone · on March 11, 2023

> All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.

Battery capacity and thermals are different and might be problematic. The phone might throttle performance earlier.

> it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone.

As a demo, yes, but would loading the model be fast enough for Siri-like responsiveness? You also would want to run other programs alongside it.

And of course, for Apple to adopt something like this, we would have to get rid of the tendency of these models to derail conversations. Put in something somewhat sexist/racist/…, and it will reply with something a bit more sexist/racist/…)

But yes, it would be a cool demo.

LawTalkingGuy · on March 12, 2023

> we would have to get rid of the tendency of these models [...] reply with something a bit more sexist/racist/

If you don't want it to be racist, don't say racist things to it. Also, it'll be fairly clear where the racism came from - like a parrot and their owner.

AIs that can tweet, like MS Tay, and that remote-work chatbot, get a lot of attention when they melt down. Private AIs on your phone don't seem like they'll caise any concern with the phone-using public.

I think we'll appreciate the benefits more than we'll mind that others can make it say dirty words.

j45 · on March 11, 2023

Siri doesn’t seem as fast or responsive compared to Google assistant at times.

lynx23 · on March 12, 2023

At this point in time, Siri as a voice-driven assistant has become so totally and utterly useless, its not even worth comparing it to anything else. I wonder how a company can work at a feature like that for 10 years, and manage to make it worse with every release they put out.

At this point in time, Apple should be so embarrased of Siri that I really think scratching the whole thing would have a net benefit.

Scratch it, and start over. And fire everyone involved with Siri :-)

faeriechangling · on March 14, 2023

The logistics aren't that easy, Apple's entire product line runs Siri.

sgt · on March 11, 2023

Siri is sometimes busy doing laundry or Gods know what. I think the quality of Siri is much better than Google Assistant but I wonder about the lag.

j45 · on March 11, 2023

Really? I find Siri can’t understand anything slightly more than basic instructions.

Google assistant can seem to do more

andsoitis · on March 11, 2023

I'm very interested in this space. Can you share an example that illustrates the difference in "understanding" between the two?

cromka · on March 12, 2023

Just recently Siri would belly-up on “Turn off Living Room lightS” — it would only work if I said “light” (singular). Extremely frustrating. They fixed it, I think, but this arbitrariness and many other make me think Siri is more quirk- and algorithms-based than a true AI.

revscat · on March 12, 2023

Handling smart home requests is the one thing that Siri seems to do more or less without error, at least for me. I use that multiple times per day per day, and cannot remember the last time that it did not work.

xen2xen1 · on March 12, 2023

Is Siri better, or does it have you well trained? My smart home stuff works best for me because I know more of the exact labels. I was literally surprised the other day that my wife included an S and it still worked.

wkat4242 · on March 13, 2023

Mine is really really poor at it.

Half the time it responds with "one moment.. One moment.. this is taking too long" or "I have problems connecting to the internet". But there's no internet problems whatsoever and it connects to my home Assistant using local homekit integration which shouldn't even need that.

schappim · on March 11, 2023

> Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.

He has already done great work here: https://github.com/ggerganov/whisper.cpp

DennisAleynikov · on March 12, 2023

If I may, this library runs LLaMA on CPU. There is no way to run it on the Neural Engine yet.

The optimization in this case only seems to refer to the 4bit model loading method (to be friendlier to the arm64 CPU)

GeoHot has tinygrad running LLaMA on Metal (but only the 7B model) that's the closest I've seen to taking advantage of apple silicon.

Neural Engine implementation would be awesome

fwlr · on March 12, 2023

Oh shit, I took a closer look and you’re right. The repo was also helpfully updated with a note to this effect: “The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder, there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't know how to utilize it properly. But in any case, you can even disable it with LLAMA_NO_ACCELERATE=1 make and the performance will be the same, since no BLAS calls are invoked by the current implementation”.

No Joi in my pocket just yet :(

Because of this I re-checked my claims about the Whisper speed up from the Neural Engine and that does look legit, 6x at least. So the Neural Engine does have the chops for this workload, it just isn’t being used in this repo. It may not be LLaMA, but I sure hope someone gets an LLM running on the ANE sooner rather than later.

ggerganov · on March 12, 2023

Our investigations indicate that it might not be possible to achieve ANE performance improvement over CPU for LLM Decoder inference with batch size of 1 [0]. Just to make it clear - I'm no expert in Core ML / ANE, so these conclusions could be totally wrong.

[0] https://github.com/ggerganov/whisper.cpp/discussions/548#dis...

fwlr · on March 12, 2023

Don’t sell yourself short! (And you have my apologies in advance if my excited comment above has created any extra work for you)

DennisAleynikov · on March 13, 2023

Neural Engine across the M1 and M2 series is also sadly very limited.

I bought one thinking I could exploit it for StableDiffusion and other tasks but found that most libraries say to use GPU for faster generation. What I found is not only is the engine the same on m2 pro (meaning I upgraded for no reason from my m1 basemodel) but it also doesn't scale at all except in the m1 Ultra where it's doubled simply because it's using two dies bridged.

Neural Engine can generate 512x512 images pretty easily but takes a while even compared to using the GPU on a basemodel m1 Mac Mini. It's kinda crazy. Looking into ways to improve it and take advantage of the neural engine in the future but the current situation is very limited. Even apples official implementation and coreML libraries seem to prefer you run them on Metal

sho_hn · on March 11, 2023

The 7b model specifically is not quite "ChatGPT-level" though, is it?

doctoboggan · on March 11, 2023

According to Meta's benchmarking[0] it is comparable on many metrics. I haven't used it myself so I can't say for sure if that is the case when actually using it.

[0]: https://arxiv.org/pdf/2302.13971.pdf

redox99 · on March 11, 2023

That's GPT3, not ChatGPT.

Kelamir · on March 11, 2023

I don't understand this topic well, but given premise that GPT3 and ChatGPT are different only that ChatGPT includes RLHF(Reinforcement Learning from Human Feedback), and LLaMA 7b is comparable to GPT3 on a number of metrics, it would follow that if we were to improve LLaMA 7b with RLHF, the 7b model would be similar to ChatGPT. Is that correct?

popinman322 · on March 11, 2023

You're likely right that applying RLHF (+ fine-tuning with instructions) to LLaMA 7b would produce results similar to ChatGPT, but I think you're implying that that would be feasible today.

RLHF requires a large amount of human feedback data and IIRC there's no open data set for that right now.

inawarminister · on March 11, 2023

There's open-assistant.io, which is doing RLHF directly on the open

Taek · on March 12, 2023

And they've already collected over 100,000 samples, iirc ChatGPT was trained on something like 30,000 samples, so the open models should already be positioned to succeed.

summarity · on March 11, 2023

There are open datasets (see the chatllama harness project and its references). You can of course also cross train it using actual ChatGPT.

popinman322 · on March 11, 2023

Is there something I'm missing? ChatLlama doesn't reference any human feedback datasets.

> You can of course also cross train it using actual ChatGPT.

You mean train it on ChatGPT's output? That's against OpenAI's terms of service.

gkbrk · on March 11, 2023

> You mean train it on ChatGPT's output? That's against OpenAI's terms of service.

Oh no, someone call the internet police.

I'm sure scraping tons and tons of images and web data to train DALLE and GPT and then selling access to that data to others was also against many licenses and terms of services, but OpenAI did those anyway.

jquery · on March 12, 2023

None of these AIs were created ethically. At the very least we can make sure these huge models don’t solely belong to monopolistic tech companies and democratize their power.

summarity · on March 11, 2023

You’re missing something. Both SHP (https://huggingface.co/datasets/stanfordnlp/SHP) and OpenAssistant datasets are referenced.

And the TOS violation might be the case, the project nevertheless has a mode to use OpenAI in the fine tuning steps.

throwaway1851 · on March 11, 2023

I’m interested in this as well. Comparatively little attention has been paid to those 7B model results, but they look quite good against 175B GPT-3.

As for ChatGPT, that is GPT-3.5 (same 175B model, but with instruction fine-tuning), plus the RLHF.

DavidSJ · on March 11, 2023

GPT 3.5 likely differs from the original GPT 3 by more than instruction fine-tuning. For example, it was probably retrained under Chinchilla scaling laws [1], with a lot more data and maybe a somewhat smaller parameter count.

There are many variants of GPT-3 and GPT-3.5, and based on the performance numbers in Meta’s paper, it looks like they’re comparing against the very first version of GPT-3 from 2020. [2]

[1] https://arxiv.org/abs/2203.15556

[2] https://arxiv.org/abs/2005.14165

Nowado · on March 11, 2023

There's no overhead introduced for the 'final' model inference, is there?

toxik · on March 11, 2023

None of the Meta models are RLHF tuned, as far as I know.

snickmy · on March 11, 2023

I wish we could start having open source TTS models with similar performance. So far Tortoise TTS is not there yet. Im not sure if Siri neural TTS is offered for 3rd party apps.

sebzim4500 · on March 11, 2023

>20 tokens per second (~4 words per second)

How can there be 5 tokens per word, when they have more than half the vocabulary as GPT-2/3 which has 1.3 tokens per word?

I would have guessed more like 1.5 tokens per word.

fwlr · on March 11, 2023

Oh, it’s probably higher than four words per second, then. I assumed tokens was characters and used the standard “there are five characters in a word” rule of thumb.

MacsHeadroom · on March 12, 2023

It's about 4 charcters per token. So just over 1 token per word. I just round to 1 token per word since text most people generate does not use larger words and because larger common words are still encoded as one token (e.g. HackerNews is probably one token despite being 10 characters).

Taek · on March 12, 2023

I typically see people claim 2-3 tokens per word.

RonnieOwnsLexus · on March 11, 2023

But wont it be that in real life no one would want to run a voice command which consumes lot of CPU and battery as opposed to making a network call to a service which has this model hosted ?

Agreed that this can always be improved and hardware can get more efficient and better to but at the end of the day, would it ever be better then an API call ?

nmcfarl · on March 11, 2023

I live in eastern Oregon on a property with no cell service.

I use Siri a lot, mainly to add reminders, and sometimes I try to use Siri when I'm out at the greenhouse, which is just past the edge of the mesh network. I would love for those reminders to get added - even if it burnt battery.

And more generally I would love for people writing apps to consider that phones don't always have service - as would my neighbors.

Tepix · on March 11, 2023

Privacy concerns are justified.

It's not just that, this can also work completely offline.

irusensei · on March 11, 2023

I'm looking forward to run stuff like this online. Using bigtech corporate souls SaaS AI is just pure dystopia material.

It's even better that we are talking about a relatively low power machine here. Maybe can operate offered.

Tepix · on March 13, 2023

You mean offline?

sebzim4500 · on March 11, 2023

Ultimately, no amount of technology will ever beat the speed of light. Running locally will always have a lower latency floor.

brookst · on March 12, 2023

Theoretically yes. But in the real world, no.

Simple thought experiment: you want to know how many tons of copper are mined in the US each year. Lowest possible latency is calculating this in your head, most likely using data you don’t have. Looking it up online is a lot, lot faster.

In some far future world maybe every transistor will include the sum total of human knowledge up to the nanosecond, but that’s a pretty far future. There are many things where running locally means a higher latency floor.

brucethemoose2 · on March 11, 2023

Its still cheaper to run a free model on a competitive "dumb" cloud host than buy a service only one company provides.

q7xvh97o2pDhNrh · on March 11, 2023

There are still a few people in the world who don't have always-on gigabit internet access everywhere they go.

grandiego · on March 12, 2023

"There is No Reason for Any Individual To Have a Computer in Their Home"

gitfan86 · on March 11, 2023

I would also expect 10x improvements over the next year due to optimizations found throughout the stack.

ar9av · on March 12, 2023

the potential drawbacks of relying entirely on voice-operated assistants like ChatGPT. There are concerns around privacy and the use of personal data, as well as the potential for bias and inaccuracies in the responses generated by these models. It's important to strike a balance between the convenience and benefits of these technologies and the potential risks and limitations they bring. Nonetheless, the advancements being made in this field are impressive and it will be interesting to see how they develop in the future.

Centigonal · on March 12, 2023

That's very ChatGPT of you to say!

return_to_monke · on March 11, 2023

i think voice assistants can perform actions on phones (eg "open app, message Alice, call Bob, turn off Bluetooth"). This couldn't do that (I think), which is an obvious drawback

amusedcyclist · on March 11, 2023

4 words a second doesn't seem fast enough for a voice assistant ?

mromanuk · on March 12, 2023

It's faster than that [0], 20 token/s, should be approximately 15 words per second.

0: https://help.openai.com/en/articles/4936856-what-are-tokens-...

schappim · on March 11, 2023

I've had difficulty obtaining useful results from the smaller (7B-sized) models. The issue lies in the content, not the speed. If you could stream the text-to-speech, the speed alone would be satisfactory.

amusedcyclist · on March 11, 2023

You're right I overestimated how fast we talk!

fwlr · on March 12, 2023

Some rules of thumb I use for estimating this kind of stuff

100wpm: Max typing speed

200wpm: Max speaking speed

300wpm: Max listening speed, max reading speed with subvocalisation

900wpm: Max reading speed without subvocalisation

mromanuk · on March 12, 2023

Doin napkin math, this model should be hitting 900wpm