For anyone interested in SparseGPT, on May 25th, the author of the SparseGPT paper will show you how you can download an optimized and open-sourced LLM and run it on CPUs at GPU speeds using DeepSparse.
Exciting news! Our latest MLPerf™ Inference v3.0 results showcase a 6X improvement in just six months, catapulting our CPU performance to an astonishing 1,000X increase and reducing power consumption by 92%. Our sparsified BERT NLP model achieves lightning-fast speeds of 5,578 items/sec, outperforming ONNX Runtime CPU and NVIDIA T4 GPU, while our streamlined ResNet-50 model leaves competitors in the dust with 19,632 images/sec for image classification. The key to our success is compound sparsity as a model compression technique, which allowed us to trim ResNet-50 from 90.8MB to 11MB and BERT-Large from 1.2GB to a mere 10MB. Harnessing the power of DeepSparse, our inference runtime specifically designed to accelerate sparse models on x86 and ARM CPUs, we've revolutionized AI performance.
Happy to introduce one of the most comprehesive ChatGPT cheat sheets: a 30 pg. paper highlighting various prompts to manage ChatGPT for generating text. The document not only highlights what ChatGPT can generate but also how it can generate it! Here is the TOC:
You can now load multiple transformers (each model has a unique sparsification recipe) on top of the DeepSparse server behind Streamlit, and it's open-source. This was battle tested on a 16GB of RAM with only 4 vCPU virtual machine. These compute requirements are enough to load up to 19 sparse BERT models in memory and compare their performance on question answering (P.S. they are really fast on just CPUs).
Confirm your spot: https://neuralmagic.com/unlock-faster-and-more-efficient-lan...