A unified efficient open-source LLM deployment engine for both cloud server and local use cases.
It comes with full OpenAI-compatible API that runs directly with Python, iOS, Android, browsers. Supporting deploying latest large language models such as Qwen2, Phi3, and more.
It will take some effort to implement operators but not too much (cutlass's group gemm already support different mnk's), however the performance benefit is marginal compared to padding all LoRA ranks to the same rank because all these kernels are not compute bound.
Thanks for the pointer! As far as we know the WebGPU development on firefox is a bit lagging behind, so we use Chrome and did not develop this project on firefox.
Yes of course. Optimizing and building the model to the format acceptable by ONNX web runtime will getting this in. On the other hand, we also need to enhance our own runtime (for example for better memory pool management) in the future.
Thanks for your interest! Most of the existing stable diffusion demos rely on a server behind to run the image generation. It means you need to host your own GPU server to support these workloads. It is hard to have the demo run purely on web browser, because stable diffusion usually has heavy computation and memory consumption.
The web stable diffusion directly puts stable diffusion model in your browser, and it runs directly through client GPU on users’ laptop. This means there is no queueing time for the server’s response. It also means more opportunities for client server co-optimizations, since essentially the “client” and “server” are both the single laptop. The web stable diffusion is also friendly for personalization and privacy. Since everything runs only on the client side and no interaction with server is needed, you can imagine to have your own style stable diffusion deployed and demonstrated on the web without sharing the model to anyone else, and you can also run with personalized model input (e.g., the text prompt in this case) without letting others know.
Thanks for your interest again! And we are happy to hear your feedback on your experiences and the functionalities you would like us to add in the future.
What happens when someone wants to update/upgrade the model to a newer version? Can they just get a “diff” and “patch” their model, or do they have to download a whole new one?
Upgrading the model is pretty easy. We just need to build the new model locally in the same way we build the current model. This usually takes fewer than 2min. If people want to deploy the new version to web browser and share for others to use, they just need to upload the model weights to some server (for example we are now using a public Hugging Face repo to store the weights), and provide a link pointing to the weights. This can be achieved also with little effort.