I have a question about LEB128. I use a similar format in my code, but with two ...

KMag · on April 24, 2020

You should put all of those "has another byte" bits together, so you can look at just the first two bytes to figure out what to do, and then perform your load. In your most common case, you can perform an 8-bite MOVBE, an LZCNT/BSR, and if the number of leading 0s/1s < 8, just shift and mask to get your value.

Have a look at how UTF-8 is encoded for a similar scheme. The space taken is exactly the same as LEB128, but your code is less branchy and needs fewer masks and shifts.

IshKebab · on April 24, 2020

> The space taken is exactly the same as LEB128

No because values under 128 (very common in my data) take 2 bytes instead of 1. Unless I'm misunderstanding you.

KMag · on April 25, 2020

You are misunderstanding me. Values less than 128 take only one byte.

The obvious implementation fast path is:

1. Check that the current position is at least a 8 bytes from the end of the buffer. Otherwise, go to the short buffer slow path.

2. Count the number of leading 1s (or 0s) using BSR/LZCNT.

3. If the number of leading 1s (or 0s) indicates more than 8 bytes are necessary, use the multi-word slow path.

4. Shift and mask out your value.

5. Increment your position pointer by the length of your varint.

Values less than 128 will have no leading 1s (or no leading zeros, if your scheme uses leading zeroes instead). The shift value is 56 bits. Unless you have a 1-byte optimized path, your calculated mask value is (((uint64_t)-1LL) >> 57). The position pointer is incremented by one byte at the end of the decode.

KMag · on April 25, 2020

Now that I think about it, for the uint126_t encoding case, you actually need to inspect up to the first 3 bytes. I've only implemented this for encoding/decoding uint64_ts, in which case there's a little trick that saves one byte, but the trick only works because 64 % 7 == 1.

Maybe it's most clear if I spell out the 19 cases necessary to encode any uint128_t (system using BSR (leading 1s) insteod of LZCNT (leading 0s)).

  0 xxxxxxx -> 0 to 127
  10 xxxxxx yyyyyyyy -> 128 to 2**14 - 1
  110 xxxxx yyyyyyyy zzzzzzzz -> 2**14 to 2**21-1
  1110 xxxx yyyyyyyy zzzzzzzz aaaaaaaa -> 2**21 to 2**28-1
  11110 xxx yyyyyyyy zzzzzzzz ... 2 more bytes -> 2**28 to 2**35-1
  111110 xx yyyyyyyy zzzzzzzz ... 3 more bytes -> 2**35 to 2**42-1
  1111110 x yyyyyyyy zzzzzzzz ... 4 more bytes -> 2**42 to 2**49-1
  11111110  yyyyyyyy zzzzzzzz ... 5 more bytes -> 2**49 to 2**56-1
  11111111 0 yyyyyyy zzzzzzzz ... 6 more bytes -> 2**56 to 2**63-1
  11111111 10 yyyyyy zzzzzzzz ... 7 more bytes -> 2**63 to 2**70-1
  11111111 110 yyyyy zzzzzzzz ... 8 more bytes -> 2**70 to 2**77-1
  11111111 1110 yyyy zzzzzzzz ... 9 more bytes -> 2**77 to 2**84-1
  11111111 11110 yyy zzzzzzzz ... 10 more bytes -> 2**84 to 2**91-1
  11111111 111110 yy zzzzzzzz ... 11 more bytes -> 2**91 to 2**98-1
  11111111 1111110 y zzzzzzzz ... 12 more bytes -> 2**98 to 2**105-1
  11111111 1111111 0 zzzzzzzz ... 13 more bytes -> 2**105 to 2**112-1
  11111111 11111111 0zzzzzzz ... 14 more bytes -> 2**112 to 2**119-1
  11111111 11111111 10zzzzzz ... 15 more bytes -> 2**119 to 2**126-1
  11111111 11111111 110zzzzz ... 16 more bytes -> 2**126 to 2**133-1

IshKebab · on April 25, 2020

Ah I see, that makes sense.

timerol · on April 24, 2020

I would guess that Rust wants to stick to one standard encoding, and LEB128 is already in WebAssembly, DWARF, and LLVM. Adding BEB128 would add more complexity. Also, since Rust is generating a bunch of output files, it's definitely possible that it encodes more often than it decodes.

You missed a "shift += 7;" in your original implementation code, though it's present at the godbolt link.

It's not clear what point 2 is supposed to specify. LEB128 sets the 0x80 bit if there's a continuation. I'm not sure what "(00001 instead of 11110)" is supposed to specify. Does BEB128 leave the top bit clear when there's a continuation? Or does it set the lowest bit when there's a continuation? The code for counting would work for LEB128, but the text around it makes it seem like it's an improvement over LEB128.

IshKebab · on April 24, 2020

Yeah I mean the continuation bit is just inverted so 0=continue, 1=end-of-value. You can still count values efficiently with 1=continue, 0=end-of-value, but you have an extra subtraction at the end (`return number_of_bytes - number_of_bytes_with_top_bit_set;`)

Good point about not adding another format. I guess they want to avoid confusion.

brandmeyer · on April 24, 2020

I think its widely understood that a little-endian prefix VarInt decodes much faster as the leading alternative to LEB128. You can implement the whole thing without a loop, thus you don't have to exercise branch and loop prediction resources at all.

- count leading zeros (or ones, depending on the first byte's tagging preference)

- Unaligned load expressed as memcpy. Let the compiler's instruction scheduler peephole it out into a single unaligned load instruction.

- shift the loaded value by an amount indicated by the leading zero count.

- Zig-zag decode for signed integers

IshKebab · on April 24, 2020

What do you mean by "little-endian prefix varint"? Like this?

``` enum VarInt { Byte(u8), Short(u16), Long(u32), LongLong(u64), } ```

I.e. one byte tag followed by the value? It's definitely an option for me but the main issue is that values under 128 now take 2 bytes instead of 1. I guess I could do something like what CBOR does and use 0-251 = literal value, 252 = u8 follows, 253 = u16 follows, 254 = u32 follows, 255 = u64 follow.

brandmeyer · on April 24, 2020

It can be considered a variation on LEB128 that packs all of the continuation bits into the first byte.

Ie, two leading ones set in the first byte mean that there are two trailing bytes.

https://news.ycombinator.com/item?id=11263378 is one description.

https://github.com/stoklund/varint has some benchmarks.

IshKebab · on April 25, 2020

Ah ok I understand. Might be a little awkward to implement in Javascript given its lack of 64-bit ints, etc, although it does at least have CLZ32!