The Evolution of the QEMU Translator

st_goliath · on Jan 29, 2021

In 2017 at ELCE in Prague I saw an interesting talk by Alexander Graf (QEMU & KVM developer at SuSE) which was about putting the Qemu tiny-code-generator into EDK2 (UEFI reference implementation) to run x86_64 option ROM blobs on ARM64 servers[1].

The main idea as I understood/remember it was roughly this:

UEFI offers an abstract hardware interface to the OS, but to actually talk to the hardware it also needs drivers. To solve this issue, UEFI defines a driver interface and loads driver blobs directly from a ROM on the device itself. However, if you plug an off-the-shelf PCI card into an ARM server, this doesn't work, because the driver blob is x86 code. For Linux this doesn't matter, since it has its own drivers, but e.g. Grub doesn't and if the PCI device in question is a graphics card, you are flying blind until the kernel is up.

So they patched their version of EDK2 to map the option ROM blobs as not-executable and when UEFI/the bootloader tries to call into the option ROM blob, the page fault handler they installed traps that, extracts the call arguments and runs the blob through TCG. They use the same trick to catch calls from the ROM blob back into their ARM UEFI binary, extract the arguments again, do the call, convert the return values and jump back.

It sounds like a crazy hack, but apparently worked well enough that they could plug an Nvidia card into an ARM server and it would display the Grub boot splash screen and early printk messages.

[1] https://osseu17.sched.com/event/ByIv/qemu-in-uefi-alexander-...

saagarjha · on Jan 29, 2021

Fun fact: UEFI has a custom bytecode language, EBC, for writing architecture-independent drivers: https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_In...

chriswarbo · on Jan 29, 2021

OpenFirmware used Forth for this.

my123 · on Jan 29, 2021

Which is dead.

Vogtinator · on Jan 29, 2021

Xorg does something similar for VESA Bios Extensions: https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/in...

alexhutcheson · on Jan 29, 2021

Rob Landley was at one point working on an interesting project to combine a fork of the Tiny C Compiler (tcc) with QEMU's Tiny Code Generator (TCG) to create a compact, BSD-licensed drop-in replacement for gcc and binutils that would work on all the platforms supported by TCG:

https://landley.net/qcc/

http://landley.net/hg/qcc/file/tip/todo/todo.txt

It seems like a pretty cool idea, so I'm hopeful that eventually he'll have the chance to hack on it and get it working.

rijoja · on Jan 30, 2021

This is a really cool idea. At least it would serve as an inspirational or educational project. I would be willing to support this in some capacity. Would anyone else be interested?

Maybe set up a chat and see what happens?

saagarjha · on Jan 29, 2021

It’ll be very interesting to see these efforts to make QEMU faster–until now it seems like flexibility and compatibility were the main goals, with performance being much lower priority, but with a good optimizing JIT it might become reasonably competitive.

thesz · on Jan 29, 2021

https://en.wikipedia.org/wiki/Loongson#Hardware-assisted_x86...

"With added improvements in QEMU from ICT, Loongson-3 achieves an average of 70% the performance of executing native binaries when running x86 binaries from nine benchmarks."

Loongson is MIPS which executes x86 binaries at 70% speed of native execution. Using QEMU.

QEMU is not unperformant.

saagarjha · on Jan 29, 2021

That quote literally comes from a section titled "Hardware-assisted x86 emulation", so I don't think it would necessarily be fair to attribute all the performance to QEMU. FWIW, getting results that good almost always requires hardware MMU support, which is usually the largest bottleneck in these kinds of things. Without that you'd be lucky to get within the same order of magnitude.

Furthermore, having worked with QEMU in several binary translation and emulation projects, I cannot say that it is designed with performance as its first priority–TCG's number one consideration seems to be portability and ease of supporting new architectures. If you look at TCG-generated code it's pretty "stupid"; almost no optimizations are applied at all. This isn't necessarily something I fault QEMU for, it's just clearly not a priority. Or hasn't been, I guess; but it seems like this might be changing and I'm very interested to seeing where it goes.

beagle3 · on Jan 29, 2021

TCG is actually the 2nd generation backend. The 1st generation was much worse, but it was incredibly elegant: the operations were just written as functions in C, as you may do for an interpreter - and then memcpy’d one after the other.

So a compiler from one architecture to another which is as easy to write as an interpreter (but without interpreter overhead), is independent of the target architecture - and it just works. I remember being blown away by the elegance.

CodesInChaos · on Jan 29, 2021

You might want to take a look at Partial evaluation, which is a similarly elegant technique.

https://en.wikipedia.org/wiki/Partial_evaluation

https://labs.oracle.com/pls/apex/f?p=LABS:0::APPLICATION_PRO...

thesz · on Jan 29, 2021

Of course, hardware helps a lot. But I doubt that getting from typical 5%-10% of native performance to 70% (7-14 times speedup) can be attributed only to hardware assistance. MIPS is way too different from x86.

saagarjha · on Jan 30, 2021

5%-10% is typical performance for a pure interpreter; a good JIT can hit 30%-50%. After that the main issue is going to be the MMU that I mentioned, and having hardware support there gets you to nearly native.

thesz · on Feb 4, 2021

"The good JIT" is a TCG at the time. Is it good for you?

MIPS needs about 20 to 40 cycles for page fault handling. For 500MHz MIPS (Verilog/VHDL implementation synthesized to 90nm - yes, I am that old) it translates to latency from 1us to 2us. [1] shows that typical x86 page fault latency is ~7.8us, about at least twice as much. [2] (best I've found, sorry) speculates that functions compute x86 status codes and allow 80-bit long double operations natively.

[1] https://makedist.com/posts/2016/10/10/measuring-userfaultfd-...

[2] https://news.ycombinator.com/item?id=15543718

Excuse my rant, but MIPS has just one mistake - delay branch slot. It complicates implementation immensely. Other than that, it is a fascinating architecture. They claimed that their CPU can be programmed using regular instructions as effective as others do with microcode and thus MIPS does not need that and my experience had shown that.

FeepingCreature · on Jan 29, 2021

What if QEMU ends up faster than native? I could imagine a future where programs are built unoptimized or with `-Os` for download size, and optimized by the operating system before or during execution, with x86 or arm (or risc-v!) ending up as the default "portable executable instruction format" for a whole bunch of CPU architectures. Where the first thing the OS or even the BIOS loads is qemu...

pm215 · on Jan 29, 2021

Not gonna happen, so it's a bit irrelevant. QEMU's architecture is fairly generic and it was never built for speed in the first place. Improving the speed of emulation is more about getting better than the current baseline of "10x or worse slower than native, and continuing to gradually get slower if we add features and don't actively try to work on performance". If we got to "3x slower than native" I would be really surprised.

Which isn't to say that you can't do better than 3x-slower under any circumstances, just that if you wanted performance you'd probably be better off starting from scratch with an emulator design that cared about performance and which was really clear about its use-cases -- eg "this is user-space only, not system emulation" and "this is only this very small set of host and guest architectures". QEMU does a lot of different things in one codebase, which makes it cumbersome to change anything and hard to put in optimisations or simplifications which might be valid for a specific host/target combination but not more widely.

jakogut · on Jan 29, 2021

> if you wanted performance you'd probably be better off starting from scratch with an emulator design that cared about performance and which was really clear about its use-cases

See box86:

https://github.com/ptitSeb/box86

stsquad · on Feb 1, 2021

AIUI that is mainly achieved by running the native library functions.Given how much time you spend in library functions I'm not surprised it has an edge over the full translation of the application.

jart · on Jan 29, 2021

The GCC Runtime Exception v3 might forbid that. That's one of the reasons why Apple and Google blew a gasket when GNU changed licenses and then poured billions into Clang. They wanted to be able to recompile your app store binaries on their backend to optimize for any hardware changes they made.

One workaround for that restriction is to use Cosmopolitan Libc: https://justine.lol/cosmopolitan/index.html It bundles compiler runtime libraries that are fast and permissively licensed. It lets you keep using GCC without facing restrictions on things like code morphing.

gaul · on Jan 30, 2021

HP Dynamo is similar to what you suggest:

https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html

Note that CPUs and compiler optimization have improved over the last 20 years and these results may not still hold.

saagarjha · on Jan 29, 2021

This sounds much like the pitch behind the JVM.

saurik · on Jan 30, 2021

While true of course, this more directly sounds like the OG pitch behind the Low-Level Virtual Machine (aka, LLVM).

ec109685 · on Jan 29, 2021

Isn’t that essentially what WASM is?

It’d be interesting to see the performance of QEMU if it used WASM as an intermediate representation, and then fed that into WASM engine so it could take advantage of all the browser’s optimizations.

njharman · on Jan 29, 2021

I keep confusing QEMU with the extended memory manager thing I used in 80’s, which is https://en.m.wikipedia.org/wiki/QEMM

karmakaze · on Jan 29, 2021

I remember that, and DESQview which QEMM was made for then marketed separately. DESQview was quite impressive being able to fully run most DOS programs concurrently 3 years before MS Windows/286, Windows/386, or IBM OS/2 v1.2.

m463 · on Jan 29, 2021

what if I have an AST 6-pack plus?

sharken · on Jan 29, 2021

Not an expert on QEMU, but apart from being a very useful tool, it also seems very complex.

Perhaps not Linux-complex, but certainly close. So it's somewhat impressive that it has come so far when the Developer docs states: QEMU does not have a high level design description document - only the source code tells the full story.

But it can run Windows 10 on Mac :) https://forums.macrumors.com/threads/success-virtualize-wind...

jart · on Jan 29, 2021

If you want to try the "new kid in town" x86_64 emulator that's ~200kb in size and has a simple readable easily hackable codebase, although it isn't as fast as qemu, then please give Blinkenlights a try! https://justine.lol/blinkenlights/index.html It was built for the purpose of testing Cosmopolitan Libc https://justine.lol/cosmopolitan/index.html How readable is the source code to Blinkenlights? See for yourself. Here's the ALU code: https://github.com/jart/cosmopolitan/blob/master/tool/build/...

pm215 · on Jan 29, 2021

As with Linux, a lot of the lines-of-code turns out to be device-related: for the kernel it's device drivers, and for QEMU it's models of devices. Also, just because we don't have a high level design document doesn't mean we don't have a high level design -- it's just that it's inside the heads of the developers rather than on paper :-)

saagarjha · on Jan 29, 2021

It might be useful to note that this is running through virtualization, not TCG :)

rijoja · on Jan 29, 2021

Even though I bet you could do it through TCG?

saagarjha · on Jan 29, 2021

Sure, but it would be much slower.

rijoja · on Jan 29, 2021