More

totallyabstract · on March 11, 2024

Jet fuel has no lead, it's basically kerosene. Avgas for reciprocating engines (basically only small general aviation planes/helicopters) currently contains lead but is moving to be lead free in the US.

totallyabstract · on March 2, 2024

I think this is a bit of a pedantic take. My reading of the GP is to distinguish systems based on chemical batteries from systems based on gravitational batteries rather than the reaction time of the chemical cells specifically.

hn_go_brrrrr · on March 2, 2024

Definitely pedantic, but also quite amusing.

kragen · on March 2, 2024

you are my kind of person

totallyabstract · on Feb 21, 2024

Also have to take into account that number was at ground level where there is the highest air density. I’m not sure what the landing and takeoff profiles looked like in the design but presumably the wings are that size so it can operate at a higher altitude at a reasonable speed.

totallyabstract · on Feb 19, 2024

Wouldn't that support the idea that most energy consumption is in decoding? If you're getting 2x, 4x, 8x ect as much value computation per instruction and yet only a 30% increase in power then clearly most the power is not used by computing the values.

adgjlsfhk1 · on Feb 20, 2024

no, because there's a lot the CPU does that is neither decoding nor execution. There's also caches, register renaming, branch prediction, inter core communication for atomics, and a dozen other things.

totallyabstract · on Feb 20, 2024

Sorry poor terminology use on my part. I mean more broadly that most energy is used on frontend and middle end, rather than backend and that this is what vectorisation improves in regards to energy consumption. Register renaming and branch prediction energy consumption should be improved in the same factor as decoding. Caching probably less so (depending if we are talking instruction, data or combined).

I don’t think inter-core communication is too relevant when comparing vectored and non-vectored on a single core, but definitely would be when batching across multiple cores.

totallyabstract · on April 11, 2022

Clickable links:

https://jira.atlassian.com/browse/BCLOUD-12503

https://jira.atlassian.com/browse/BCLOUD-13021

totallyabstract · on April 5, 2022

A French company made a product along these lines already, but I don’t think it’s been updated since 2018: https://blog.qarnot.com/introducing-the-qc-1-crypto-heater/

It’s true this is as good as resistive heating in terms of heating efficiency, but it is still far less efficient than a heat pump. Unless you are in an area with a large excess of renewable energy it probably doesn’t make sense environmentally to use an asic heater.

nybble41 · on April 6, 2022

One might use a smaller version as a space heater, rather than for primary heating. In that context it isn't really competing with heat pumps; space heaters are usually resistive.

It looks like the same company still makes some products that leverage waste heat from computation (https://qalway.com/fr), just not ones that specifically mention crypto as the source of the computation.

totallyabstract · on March 4, 2022

For some two-way wireless protocols (like wifi) you have to take into account the guard interval, slot times and interframe spacing which are all values set in time (~1-50us). For long distance transmissions your speed-of-light limited signal propagation time can exceed these values.

In terms of size usually guard interval < slot size < inter-frame space. If propagation exceeds guard interval AND have a channel with lots of echo any communication will be difficult. If propagation exceeds slot timing then coordination between more than 2 devices will be different (high retries/low throughput). If propagation exceeds interframe spacing a two-way wifi connection will not be possible as both stations will think every frame timed out waiting for an ACK.

More info here: https://en.wikipedia.org/wiki/Guard_interval https://en.wikipedia.org/wiki/Distributed_coordination_funct... https://en.wikipedia.org/wiki/Short_Interframe_Space

totallyabstract · on Aug 20, 2021

Presumably two different instances of this clock would have different tick rates due to the specific crystal used and it's very difficult to produce two identical crystals.

An atomic clock is very consistent across devices as it exploits the properties of an element which is much more repeatable (just have to have a quantity of the element).

dwd · on Aug 21, 2021

Reminds of the Anne McCaffrey Crystal Singer series where crystals were mined for use in their space-faring technology. Each crystal was unique and would be tuned to a particular use.

https://en.wikipedia.org/wiki/Crystal_Singer

totallyabstract · on April 30, 2021

There are separate micro op caches per core however they are typically shared among hyperthreads. I wonder if this could be another good reason for cloud vendors to move away from 1vCPU = 1 hyperthread to 1vCPU = 1 core for x86 when sharing machines (not that there weren't enough good reasons already).

jiggawatts · on May 1, 2021

One sneaky thing I've noticed them doing is slowly switching their licensing over to 1 vCPU = 1 CPU, even though you're now only getting one hyperthread instead of one core.

For Microsoft, this means that they've literally doubled their software licensing revenue relative to the hardware it is licensed to.

This kind of false incentive worries me a lot, because while I like the technical concepts like infrastructure-as-code enabled by the public cloud, I feel like greed will eventually destroy what they've built and we'll all be back to square one.

Ask your cloud sales representative these questions next time you have coffee with them:

- What incentive do you have to make your logging formats efficient, if you charge by the gigabyte ingested?

- If your customers are forced to "scale out" to compensate for a platform inefficiency, what incentive do you have to fix the underlying issue?

- What incentive do you have to make network flows take direct paths if you charge for cross-zone traffic? Or to put it another way: Why does load balancer team refuse to implement same-zone-preference as a default?

Etc...

Once you start looking at the cloud like this, you suddenly realise why there are so many user voice feedback posts with thousands of upvotes where the vendor responds with "willnotfix" or just radio silence.

the8472 · on May 1, 2021

Cloud vendors probably use a hypervisor that schedules the VM time slices in a way that hyperthread siblings are only ever cooccupied by the same guest.

ljhsiung · on May 1, 2021

Even putting aside security aspects aside, in general I've been seeing research pop up over the years criticizing SMT's performance claims of ~30%.

Hell, even Amazon's Graviton CPUs don't have it (though I'm sure that's a product of being ARM derived rather than a design decision).

tux3 · on May 1, 2021

ARM vendors must be feeling pretty good about themselves yeah, but if you take AMD's cores... SMT might not be a huge win in every benchmark, but you just can't keep that wide backend fed from a single hyperthread (at least I can't!).

So turning SMT off is at the least wasted potential for those cores, the way they've been designed

wmf · on May 1, 2021

Apple Firestorm is even wider but it doesn't have SMT. I guess they just don't care.

marcan_42 · on May 1, 2021

Probably because they have an 8-wide decoder and a massive reorder buffer, so they can actually keep the backend fed.

The problem with x86 is decoding is hell and requires increasingly large transistor counts to parallelize, so you end up with a bottleneck there. ARM doesn't have that problem.

sitkack · on May 1, 2021

This the single largest driver of M1 performance.

Variable length, over lapping instructions has made x86 instruction decoding intractable. The obvious answer is make it tractable, the unobvious answer is how to do that and hopefully remain backward compatible.

hajile · on May 1, 2021

The performance claims are true for all the worst reasons.

Let's say you can queue up 100 instructions. This yields the following

    1 port 100% of the time
    2 ports 60% of the time
    3 ports 30% of the time
    4 ports 10% of the time
    5 ports 2% of the time

Increasing the buffer to 200 instructions yields the following

    2 ports 80% of the time
    3 ports 40% of the time
    4 ports 15% of the time
    5 ports 4% of the time

As in that made-up example, doubling the window you can inspect doesn't double performance. You really want those extra ports because they offer a few percentage IPC uptick, but the cost is too high. So you keep increasing the window size until the extra ports become viable. As an aside, AMD Caymen switched from VLIW5 to VLIW4 because the fifth port was mostly unused. A few applications suffered from the slightly lower theoretical performance, but using that space for more VLIW 4 units (along with other changes) meant that for most things the overall performance went up.

Now comes the x86 fly in the ointment -- the decoders width gives rapidly diminishing returns (I believe an AMD exec mentioned 4 was the hard limit to keep power consumption under control). This limits the size of the reorder buffer that you can keep queued up. Since you have a maximum instruction window size, you have a hard port limit.

So you add a second thread. Sure, it requires it's own entire frontend and register sets, but in exchange you get a ton more opportunities to use those other ports. There are tradeoffs with the complexity and extra units required for SMT, but that's beyond our scope.

As you can see, SMT performance is DIRECTLY related to how inefficiently the main thread can use the resources. In less interdependent code, SMT performance increases are worse because finding uses for those extra ports on the main thread is easier.

Now, let's consider the M1 and one reason why it doesn't have SMT. Going 5, 6, or even 8-wide on the decoders is trivial compared to x86. Apple's M1 (and even the upcoming V1 or N2) have wider decode. This in turn feed a much larger buffer which can in turn extract more parallelism from the thread (this seems to be taking about as many transistors as the extra frontend stuff to implement SMT). Because they can keep most of their ports fed with just one thread, there's no need for the complexity of SMT.

IBM POWER does show a different side of SMT though. They go with 8-way SMT. This isn't because they have that many ports. It's so they can hide latency in their supercomputers. It's kind of like MIMT (multiple instruction, multiple thread) in modern GPUs, but even more flexible. They help to ensure that even when other threads waiting for data that there's still another thread that can be executing.

feffe · on May 1, 2021

The memory latency hiding also works with 2-way SMT. I worked on a networking software doing per packet session lookup in large hash tables. SMT with a Sandybridge core in this application gave 40% better performance which is higher than usually mentioned. So for memory bound (as in cache misses) applications, SMT is a boon.

injinj · on May 1, 2021

I have a graph for this:

https://github.com/raitechnology/raikv/blob/master/graph/mt_...

The CPU in this case is a Threadripper 3970x, 32 cores, 64 SMT.

My experience is this: When the L3 cache is effective, then the memory latency hiding via memory prefetch works well across SMT threads. If the hashtable load requires a chain walk, the SMT latency hiding is less effective because the calculated prefetch location is not the actual hit. I couldn't get prefetching multiple slots as the load increased to be as effective as prefetching a single slot.

magicalhippo · on May 1, 2021

I tested this some years ago on a raytracer, and got a tad over 50% more speed when enabling HT compared to disabling it.

As you say, the ray tracer did a lot of cache missing , interspersed with a fair bit of calculations. I'm guessing this is close to the ideal workload, as far as non-synthetic benchmarks go.

gpderetta · on May 1, 2021

The reorder buffer size is practically limited by branch prediction probability, not decode bandwidth though.

jcelerier · on May 1, 2021

When doing audio processing I'm getting ~20/25% more oomph with HT enabled

tyingq · on April 30, 2021

Or to roll out more ARM, where there isn't currently any hyperthreading.

jamieiles · on May 1, 2021

Thunder X2 and X3 has 4 way SMT for general purpose, but yes, more ARM is good :-)

hajile · on May 1, 2021

4-way and 8-way SMT is about latency hiding (like MIMT in GPUs, but more flexible). It increases the probability that at least one thread has data it can be crunching.

secondcoming · on May 1, 2021

Why would this be an issue for machines on the cloud? If someone can upload binaries to your machine you have bigger problems, no?

derekp7 · on May 1, 2021

Because the cloud is designed around people uploading binaries to your machine -- it is a basic principle of how services are allocated. When you go to AWS an spin up an EC2 instance, you don't get a machine to yourself. You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.

userbinator · on May 1, 2021

You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.

Doesn't that make it even harder to do any sort of specific attack on anything? From what I understand, these side-channel attacks depend on being able to predict the addresses you'll read from and an idea of what you're after as well as a stable environment in which enough timing information can be collected, and any small changes in the environment will mean you can start reading something completely different without even knowing; a CPU that could be running literally who-knows-what at any time seems like it wouldn't let you collect much in the way of coherent data, and of course the VM you're doing it from could itself be moving uncontrollably across CPUs.

totallyabstract · on April 20, 2021

It will be interesting see if the next waves of wifi products solve this. For Wifi 6 and Halow there's a new feature called 'target wake time' that will let these devices sleep (and not pollute spectrum) for longer.

Wifi 6 also brings OFDMA which will let stations use much less of the channel at a time (instead of a 20MHZ+ chunk they can just use 2MHZ while other stations use the rest). 2.4GHz being stuck on old Wifi 4 (or worse) devices hasn't helped the situation.