Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Posits are actually quite reasonable, but there's a lot of either ignorance or disingenuousness in this article, which is really too bad. I wish that John would ditch the hyperbole and solicit feedback from other experts, because posits are not a bad idea, but the presentation continues to give him the trappings of a crank.

I'll unpack just the first example that jumped out at me:

> Currently, half-precision (16-bit) IEEE floats are often used for this purpose, but 8-bit posits have the potential to be 2−4× faster. An important function for neural network training is a sigmoid function, a function f(x) that is asymptotically 0 as x → −∞ and asymptotically 1 as x → ∞. A common sigmoid function is 1/(1 + e−x) which is expensive to compute, easily requiring over a hundred clock cycles because of the math library call to evaluate exp(x), and because of the divide.

"have the potential to be" 2-4x faster? Sure. But until we see an implementation, 1-2x is much more likely (closer to the 1x end of the spectrum). Commodity hardware runs 32b IEEE float multiplications with 3-4 cycles latency and single-cycle throughput (or better). 16b can be made faster if designers care to. There's simply not much "faster" available for posits to inhabit (2-4x faster than 16b float would be faster than small integer arithmetic).

Evaluating a sigmoid function requires "over a hundred clock cycles" only in the most naive possible implementation sitting on top of a lousy math library. Using 32b floats on a current generation phone or laptop with a decent math library, a naive scalar implementation of the sigmoid function has a latency of less than 50 cycles. But latency doesn't matter at all in a machine learning context; we're interested only in throughput. On a machine with AVX-512, a single core can evaluate a sigmoid function with a throughput of about 1.25 cycles / input. In full-precision 32b floating-point (i.e. a relative error of ~10^-7). John's proposed posit implementation has a relative error of about 10^-1. If we target that error threshold, we can trivially go below 1 cycle/input in 32b or 16b float on a phone. So IEEE floats are at least two orders of magnitude faster than he claims. You need to go back more than 15 years for the numbers that the paper tosses around to even be plausible.

There are several other examples like this in the paper. I don't want to be too antagonistic, because posits are not a bad idea (actually, I think they're a pretty good format), but this paper is either ignorant of the state of the art, or more marketing than science.



I'm just going to be blunt here. John and I have decided that we need to be more marketing savvy after he's had trouble with several rounds pitching other floating point formats. Posits are just an intermediate step to try to build acceptance for valids, so there's a lot of effort put into branding.

A couple of points: 2-4x faster means 2x faster in dot product based simd and 4x faster in matrix simd, assuming that your bottleneck is memory throughput.

The sigmoid function realistically isn't a bottleneck in general, but you gotta admit it is pretty cool to have a ~zero clock cycle approximation. (I've tried to rein in John a bit on this one)


If you're bound by memory throughput, you can't go beyond a 2x speedup (there's 1/2 as much data to move in an 8b format, whether it's in vectors or matrices doesn't matter). I still don't see any reasonable expectation for 4x.


if your matrix contents are static during your rate-limiting step (as they are for most DL applications) your FLOPs scale with O(n^2) relative to your memory throughput on your vector component.

[a b, c d] dot [e, f]

is four multiplies

[a b c, d e f, g h i] dot [j, k, l]

is nine multiplies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: