Snapdragon 8 Gen 1's iGPU: Adreno Gets Big

kllrnohj · on March 7, 2024

For the 'CPU to GPU Copy Bandwidth' section the more likely reason GPU -> CPU copy is slow is there's no reason to do it. Adreno is unified memory, you can just mmap that GPU buffer on the CPU. This is done on Android via the "gralloc" HAL, also called (A)HardwareBuffer.

CPU->GPU is still valuable in that it's where texture swizzling happens to optimize the data for non-linear access, and vendors are all cagey about documenting these formats. But I don't think there's a copy engine for it at all, i think it's just CPU code. If you run a perfetto trace you can see adreno actually using multiple threads for this, likely why CPU->GPU is then so much faster than the reverse. But you almost never need non-linear output, so since vendor-specific swizzling isn't helpful you just don't bother and use shared memory between the two.

flakiness · on March 7, 2024

Great article, as always!

The author not only looks into the spec sheet and the presentation, but also looks into the OSS mesa code and uses OpenGL introspection to reverse-engineer (well, not by himself but...) the architecture. For me this is one of the most detailed explanations of how mobile GPU architecture looks like.

The comparison to the older NVIDIA GPU is also very helpful (it is like a 6 year gap between this and discrete NVIDIA GPU 1050 GTX). Now I wonder how it compares to other mobile GPUs like Apple's or ARM's.

xlazom00 · on March 7, 2024

Does anyone know if there is any processor(in phone) with SVE/SVE2? As SVE/SVE2 SHOULD be in all new ARMv9.0-A + CPU cores.

I did have a chance to test only Qualcomm Snapdragon cores ( Samsung Galaxy S23).

fotad · on March 7, 2024

Snapdragon 888 probably the first one have SVE, Snapdragon 7/8 gen all have Qualcomm® Type-1 Hypervisor, so what's your test, abel to run any linux on it?

xlazom00 · on March 7, 2024

I am using default android on Samsung Galaxy S23 with userland https://play.google.com/store/apps/details?id=tech.ula So any advice how to run it?

my123 · on March 7, 2024

Qualcomm disables SVE (masked at the hypervisor level) on all their silicon. If you want SVE on a phone roday, your options are Tensor G3, Exynos 2200/2400 or MediaTek phones with ARMv9 CPUs.

Or if you have hypervisor execution level code exec (including an unfused phone), you can patch up that limitation.

xlazom00 · on March 7, 2024

So I have this https://en.wikipedia.org/wiki/Windows_Dev_Kit_2023

And we can run ubuntu on that https://github.com/armbian/build

There isn't any hypervisor running on that and still no SVE So any advice ?

my123 · on March 8, 2024

This runs Cortex-X1C and A78C, which are of the generation _just_ before the one that got SVE

monocasa · on March 7, 2024

Looking at the docs, it doesn't look like ARMv9-A strictly requires FEAT_SVE to be implemented.

What that means in practice is anyone's guess.

cubefox · on March 7, 2024

Any explanation why Qualcomm uses "tile based rendering" while Nvidia and AMD don't?

MindSpunk · on March 8, 2024

Because binning is expensive and increases the cost of processing geometry. It works on Mobile GPUs because bandwidth to main memory is very power hungry so eating the cost of binning is worth it when your power envelope is 5W and you just can't provide enough bandwidth to do immediate mode rasterization.

Desktop GPU's power and memory bandwidth budgets utterly dwarf what the mobile GPU has so paying the compute cost of binning doesn't make much sense because they can just throw more memory bandwidth and larger caches at it. Geometry binning and the kind of tiling mobile GPUs use has a lot of performance cliffs so it's generally avoided when you can. The second you go off the happy path of a tiler the performance drops like a rock, like in the compute example in the linked article. Desktop class GPUs can get most of the benefits of tiling by just letting their cache hierarchy do the 'tiling' for them to avoid hits to main memory.

Some desktop cards do some tiling work, probably to improve cache utilization, but they don't suffer from the same performance cliffs that mobile GPUs do.

TiredOfLife · on March 7, 2024

Nvidia has been using it for 10 years. https://www.realworldtech.com/tile-based-rasterization-nvidi...

kbolino · on March 7, 2024

If you can afford to render the entire screen, why would you tile? Tiling complicates certain kinds of shaders, tiles have to be stitched together to make the final image, and tiling redraws (different parts of) the same triangles multiple times.

On a dedicated GPU with lots of memory bandwidth, there's probably no benefit (and maybe even some penalty) to use tiling with lower resolutions (e.g. 1080p). However, 4K rendering might benefit from it and 8K rendering probably requires it.

kllrnohj · on March 7, 2024

The output resolution doesn't matter. The reason you tile is to improve locality and thus have more cache hits. For mobile GPUs, the cache is literally a specific tile buffer, but for nvidia (who also do this) the cache is just L2. But by tiling the geometry, they spend more time in L2 and fewer times hitting DRAM. This is a performance and power win. The actual resolution is irrelevant as the tiles are very small, even on nvidia GPUs. 16x16 and 32x32 are common tile sizes, see https://www.techpowerup.com/231129/on-nvidias-tile-based-ren... and https://www.youtube.com/watch?v=Nc6R1hwXhL8 where this was reverse engineered before nvidia actually talked about it.

You might be confusing tiled rendering with various upscaling technologies or variable rate shading?

kbolino · on March 9, 2024

It seems like nVidia's solution is actually different than the conventional tiling of PowerVR, though they are broadly similar and as you say address the main VRAM bandwidth issue by using smaller but faster cache.

I've watched parts of that video a couple of times, but I don't fully understand it. The best I can make of it is that PowerVR tiles everything at all levels of rendering, whereas nVidia's Maxwell (and other, more modern, desktop GPUs?) tiles only up to the point of rasterization. If I understand this correctly, it means that pixel shaders on mobile operate on a limited view (their tile, plus some margin?) while pixel shaders on desktop still operate across the entire screen. I don't know if this matters that much in practice, and the distinction seems to be motivated by patents (which may have expired by now?) rather than technical necessity or benefits.

Either way, given the massive importance of cache locality, what I originally said about tiled rendering offering no benefit at 1080p is indeed wrong. I think my understanding of GPUs is more than a decade out of date at this point. These problems used to be solved with more power and wider memory buses. That seems to have stopped scaling well over 10 years ago.

kllrnohj · on March 7, 2024

It's significantly more bandwidth efficient to do tile based rendering however it has more performance cliffs and requires more game developer care to avoid hitting issues. For mobile SoCs (powervr, adreno, mali, and whatever apple calls their powervr-derivative) you can't just throw GDDR and get gobs of bandwidth, and also bandwidth is power expensive, so the savings more than offsets the performance cliffs and developer complications

charcircuit · on March 7, 2024

Both Nvidia and AMD do use tile based rendering

bpye · on March 7, 2024

And for what it’s worth Apple and Imagination do also - Imagination being where tiled rendering really started.

kllrnohj · on March 7, 2024

Not quite, not in the same way mobile GPUs do. Also I don't think AMD ever did the transition. They announced it (draw stream binning), but don't seem to have ever shipped it?

bigpaw · on March 9, 2024

Is qualcomm capable of doing a flsgship cpu/gpu without overheating and high power usage?

umanwizard · on March 7, 2024

The title should say 8+, not 8 (@dang)

nfriedly · on March 7, 2024

You're correct. But I believe they have an identical design - the only difference is that the 8+ runs a little faster and more efficiently, because it was manufactured by TSMC, whereas the 8 was manufactured by Samsung.

clamchowder · on March 7, 2024

The 8+ runs at slightly faster clocks and has a smaller die. Somehow TSMC's 4 nm is better than Samsung's 4 nm.

But yeah everything I wrote there should be applicable to the Snapdragon 8 Gen 1 as well, just without the 10% GPU clock speed increase that Qualcomm says the 8+ gets.

rayiner · on March 7, 2024

Snapdragon 800 -> Snapdragon 8 Gen N is terrible marketing. It makes everything sound like a point release.

bartekpacia · on March 7, 2024

I remember having Snapdragon 800 in my LG G2. It was a beast at the time!

rbanffy · on March 7, 2024

Most people switch phones on a cadence that makes every new one feel like a beast (my personal phone is a 2nd gen iPhone SE and my new work phone is a 13, which feels like a beast next to the other).

It's also incredibly rare to find someone who knows what chip is in their phones. I have the vague notion my personal phone has an A15 or something like that.

causi · on March 7, 2024

My headcanon is that Snapdragon 875 would've been a great chip unlike the infamously overheating 888.

nfriedly · on March 7, 2024

I'm on a Snapdragon 870 (moto g100), and it's great. Performance and battery life are both good enough that I don't think about it very often, and I've never had any overheating issues.

I think the 870 was the highest end Snapdragon that didn't have any overheating issues for a while.

joecool1029 · on March 7, 2024

5nm Samsung process was just bad. Needs active cooling.

I've had the sm8250 (865 in my case) and it's a great chip that's faster in real-world conditions than the 888 hand warmer was. Now I have an 8 gen 2 device and it has no overheating issues. The 8 gen 1+ doesn't appear to have any either. I avoided the 8 gen 1 after having the bad experience with the 888.

nfriedly · on March 7, 2024

Aah, ok, I edited my comment to just say that it was the best for a while, not still the best. Agree with you about Samsung's 5nm process.

rbanffy · on March 7, 2024

I believe their biggest failure in branding is to not put out Raspberry Pi-like small boards for people to experiment. Anyone with a slight interest in computers knows an M3 is a beast and that an i3 is meh. Almost nobody outside Qualcomm knows why a Gen 8 N is better than an 800 and what the difference would be.

monocasa · on March 7, 2024

There's economic reasons behind the cheap SoCs you see. Invariably it's when an SoC was made in bulk expecting some market that never materialized.

For instance the original Pi SoC was very clearly intended for some set top box OEM that didn't pick it up for some reason.

When that happens, after a while the chip manufacturer is willing to sell them for way below cost just to get anything back from their inventory that from their perspective looks like a complete loss.

So you get an industrious cottage industry that takes those to cheap to make sense chips, cost reduces the reference design, and ships them with the just barely building sdk from the chip manufacturer.

At the end of the day Qualcomm doesn't care about this market because they are pretty good about keeping their customers on the hook for their bulk orders. So they're focused on supporting current bulk customers where a $1k dev board is actually a really reasonable price.

RuggedPineapple · on March 7, 2024

They do in fact put those out. That said the pricing is not at all Pi like. The Snapdragon 8 Gen 2 board goes for north of a thousand dollars.

https://www.lantronix.com/products/snapdragon-8-gen-2-mobile...

jsheard · on March 7, 2024

That's only Pi-like in the sense of being an SBC, unlike the Pi it's meant to be an evaluation/reference platform for integrators who want to build their own board around the SOC, not something you would buy to use for its own sake.

GuB-42 · on March 7, 2024

But maybe they could make it more Pi-like, that is pricing it so that it can interest both integrators and hobbyists, with a bootloader that makes it easy to tinker with the OS (not necessarily the firmware).

rbanffy · on March 7, 2024

I am sure that if the RPi Foundation can do it, Qualcomm can as well.

jsheard · on March 7, 2024

Broadcom is the counterpart to Qualcomm in that comparison, and those two have similar attitudes towards the hobbyist/enthusiast market - they don't care in the slightest. It took a entity outside of Broadcom which nonetheless had deep connections to them (Eben Upton was there prior to starting RPi) in order to broker a compromise where the Pi could happen, and even then Broadcom kept most of the documentation tied up in NDAs and the bare SOCs not available for sale to the general public.

The Raspberry Pi is an anomaly that's unlikely to be replicated.

rbanffy · on March 7, 2024

And this is where they fail - the people who gets their boards have almost zero interest in upstreaming whatever hardware enablement they do to make the boards sort of work for their specific use cases.

I really don't think undercutting their evaluation board business with affordable SBCs would hurt their bottom line.

jsheard · on March 7, 2024

The people who buy those eval boards probably aren't even allowed to upstream whatever they do with it, they'll have signed an NDA with Qualcomm to get access to the documentation.

_ea1k · on March 7, 2024

That's a good point. Perhaps even something like the NUC form-factor would work well.

TiredOfLife · on March 7, 2024

The soc alone is 160+ dollars.

rbanffy · on March 8, 2024

I didn't say the SBC needs to be ridiculously cheap as well, just to be competitive with others.

Dalewyn · on March 7, 2024

Intel Core i3 is to Apple M3, i5 is to M3 Pro, i7 is to M3 Max, and i9 is to M3 Ultra.

If you think an i3 is "meh", you know nothing about computers. For the vast majority of users including gamers, an i3 is overkill.

rbanffy · on March 8, 2024

I don't know which benchmarks you are using, but if you look at something like this:

https://www.cpubenchmark.net/laptop.html

you can see there is no i3 outperforming a base M3 and while some i7's and i9's (and Ryzen's) outperform even the top M3, it's only so because of core-count (at the expense of increased TDP and shorter battery life).

Dalewyn · on March 8, 2024

I'm talking about their tiers as merchandise. i3 is to Ryzen 3 is to M* (eg: M1, M2, etc.). They are all the lowest tier of their respective premium class of merchandise.

And regardless of benchmarks, an i3 will overkill any workload the average man will ask of it. You do realize an i3 has 4 hyperthreaded cores these days?