Jet fuel has no lead, it's basically kerosene. Avgas for reciprocating engines (basically only small general aviation planes/helicopters) currently contains lead but is moving to be lead free in the US.
I think this is a bit of a pedantic take. My reading of the GP is to distinguish systems based on chemical batteries from systems based on gravitational batteries rather than the reaction time of the chemical cells specifically.
Also have to take into account that number was at ground level where there is the highest air density. I’m not sure what the landing and takeoff profiles looked like in the design but presumably the wings are that size so it can operate at a higher altitude at a reasonable speed.
Wouldn't that support the idea that most energy consumption is in decoding? If you're getting 2x, 4x, 8x ect as much value computation per instruction and yet only a 30% increase in power then clearly most the power is not used by computing the values.
no, because there's a lot the CPU does that is neither decoding nor execution. There's also caches, register renaming, branch prediction, inter core communication for atomics, and a dozen other things.
Sorry poor terminology use on my part. I mean more broadly that most energy is used on frontend and middle end, rather than backend and that this is what vectorisation improves in regards to energy consumption. Register renaming and branch prediction energy consumption should be improved in the same factor as decoding. Caching probably less so (depending if we are talking instruction, data or combined).
I don’t think inter-core communication is too relevant when comparing vectored and non-vectored on a single core, but definitely would be when batching across multiple cores.
It’s true this is as good as resistive heating in terms of heating efficiency, but it is still far less efficient than a heat pump. Unless you are in an area with a large excess of renewable energy it probably doesn’t make sense environmentally to use an asic heater.
One might use a smaller version as a space heater, rather than for primary heating. In that context it isn't really competing with heat pumps; space heaters are usually resistive.
It looks like the same company still makes some products that leverage waste heat from computation (https://qalway.com/fr), just not ones that specifically mention crypto as the source of the computation.
For some two-way wireless protocols (like wifi) you have to take into account the guard interval, slot times and interframe spacing which are all values set in time (~1-50us). For long distance transmissions your speed-of-light limited signal propagation time can exceed these values.
In terms of size usually guard interval < slot size < inter-frame space. If propagation exceeds guard interval AND have a channel with lots of echo any communication will be difficult. If propagation exceeds slot timing then coordination between more than 2 devices will be different (high retries/low throughput). If propagation exceeds interframe spacing a two-way wifi connection will not be possible as both stations will think every frame timed out waiting for an ACK.
Presumably two different instances of this clock would have different tick rates due to the specific crystal used and it's very difficult to produce two identical crystals.
An atomic clock is very consistent across devices as it exploits the properties of an element which is much more repeatable (just have to have a quantity of the element).
Reminds of the Anne McCaffrey Crystal Singer series where crystals were mined for use in their space-faring technology. Each crystal was unique and would be tuned to a particular use.
There are separate micro op caches per core however they are typically shared among hyperthreads. I wonder if this could be another good reason for cloud vendors to move away from 1vCPU = 1 hyperthread to 1vCPU = 1 core for x86 when sharing machines (not that there weren't enough good reasons already).
One sneaky thing I've noticed them doing is slowly switching their licensing over to 1 vCPU = 1 CPU, even though you're now only getting one hyperthread instead of one core.
For Microsoft, this means that they've literally doubled their software licensing revenue relative to the hardware it is licensed to.
This kind of false incentive worries me a lot, because while I like the technical concepts like infrastructure-as-code enabled by the public cloud, I feel like greed will eventually destroy what they've built and we'll all be back to square one.
Ask your cloud sales representative these questions next time you have coffee with them:
- What incentive do you have to make your logging formats efficient, if you charge by the gigabyte ingested?
- If your customers are forced to "scale out" to compensate for a platform inefficiency, what incentive do you have to fix the underlying issue?
- What incentive do you have to make network flows take direct paths if you charge for cross-zone traffic? Or to put it another way: Why does load balancer team refuse to implement same-zone-preference as a default?
Etc...
Once you start looking at the cloud like this, you suddenly realise why there are so many user voice feedback posts with thousands of upvotes where the vendor responds with "willnotfix" or just radio silence.
Cloud vendors probably use a hypervisor that schedules the VM time slices in a way that hyperthread siblings are only ever cooccupied by the same guest.
ARM vendors must be feeling pretty good about themselves yeah, but if you take AMD's cores... SMT might not be a huge win in every benchmark, but you just can't keep that wide backend fed from a single hyperthread (at least I can't!).
So turning SMT off is at the least wasted potential for those cores, the way they've been designed
Probably because they have an 8-wide decoder and a massive reorder buffer, so they can actually keep the backend fed.
The problem with x86 is decoding is hell and requires increasingly large transistor counts to parallelize, so you end up with a bottleneck there. ARM doesn't have that problem.
Variable length, over lapping instructions has made x86 instruction decoding intractable. The obvious answer is make it tractable, the unobvious answer is how to do that and hopefully remain backward compatible.
The performance claims are true for all the worst reasons.
Let's say you can queue up 100 instructions. This yields the following
1 port 100% of the time
2 ports 60% of the time
3 ports 30% of the time
4 ports 10% of the time
5 ports 2% of the time
Increasing the buffer to 200 instructions yields the following
2 ports 80% of the time
3 ports 40% of the time
4 ports 15% of the time
5 ports 4% of the time
As in that made-up example, doubling the window you can inspect doesn't double performance. You really want those extra ports because they offer a few percentage IPC uptick, but the cost is too high. So you keep increasing the window size until the extra ports become viable. As an aside, AMD Caymen switched from VLIW5 to VLIW4 because the fifth port was mostly unused. A few applications suffered from the slightly lower theoretical performance, but using that space for more VLIW 4 units (along with other changes) meant that for most things the overall performance went up.
Now comes the x86 fly in the ointment -- the decoders width gives rapidly diminishing returns (I believe an AMD exec mentioned 4 was the hard limit to keep power consumption under control). This limits the size of the reorder buffer that you can keep queued up. Since you have a maximum instruction window size, you have a hard port limit.
So you add a second thread. Sure, it requires it's own entire frontend and register sets, but in exchange you get a ton more opportunities to use those other ports. There are tradeoffs with the complexity and extra units required for SMT, but that's beyond our scope.
As you can see, SMT performance is DIRECTLY related to how inefficiently the main thread can use the resources. In less interdependent code, SMT performance increases are worse because finding uses for those extra ports on the main thread is easier.
Now, let's consider the M1 and one reason why it doesn't have SMT. Going 5, 6, or even 8-wide on the decoders is trivial compared to x86. Apple's M1 (and even the upcoming V1 or N2) have wider decode. This in turn feed a much larger buffer which can in turn extract more parallelism from the thread (this seems to be taking about as many transistors as the extra frontend stuff to implement SMT). Because they can keep most of their ports fed with just one thread, there's no need for the complexity of SMT.
IBM POWER does show a different side of SMT though. They go with 8-way SMT. This isn't because they have that many ports. It's so they can hide latency in their supercomputers. It's kind of like MIMT (multiple instruction, multiple thread) in modern GPUs, but even more flexible. They help to ensure that even when other threads waiting for data that there's still another thread that can be executing.
The memory latency hiding also works with 2-way SMT. I worked on a networking software doing per packet session lookup in large hash tables. SMT with a Sandybridge core in this application gave 40% better performance which is higher than usually mentioned. So for memory bound (as in cache misses) applications, SMT is a boon.
The CPU in this case is a Threadripper 3970x, 32 cores, 64 SMT.
My experience is this: When the L3 cache is effective, then the memory latency hiding via memory prefetch works well across SMT threads. If the hashtable load requires a chain walk, the SMT latency hiding is less effective because the calculated prefetch location is not the actual hit. I couldn't get prefetching multiple slots as the load increased to be as effective as prefetching a single slot.
I tested this some years ago on a raytracer, and got a tad over 50% more speed when enabling HT compared to disabling it.
As you say, the ray tracer did a lot of cache missing , interspersed with a fair bit of calculations. I'm guessing this is close to the ideal workload, as far as non-synthetic benchmarks go.
4-way and 8-way SMT is about latency hiding (like MIMT in GPUs, but more flexible). It increases the probability that at least one thread has data it can be crunching.
Because the cloud is designed around people uploading binaries to your machine -- it is a basic principle of how services are allocated. When you go to AWS an spin up an EC2 instance, you don't get a machine to yourself. You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.
You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.
Doesn't that make it even harder to do any sort of specific attack on anything? From what I understand, these side-channel attacks depend on being able to predict the addresses you'll read from and an idea of what you're after as well as a stable environment in which enough timing information can be collected, and any small changes in the environment will mean you can start reading something completely different without even knowing; a CPU that could be running literally who-knows-what at any time seems like it wouldn't let you collect much in the way of coherent data, and of course the VM you're doing it from could itself be moving uncontrollably across CPUs.
It will be interesting see if the next waves of wifi products solve this. For Wifi 6 and Halow there's a new feature called 'target wake time' that will let these devices sleep (and not pollute spectrum) for longer.
Wifi 6 also brings OFDMA which will let stations use much less of the channel at a time (instead of a 20MHZ+ chunk they can just use 2MHZ while other stations use the rest). 2.4GHz being stuck on old Wifi 4 (or worse) devices hasn't helped the situation.