I have a question about LEB128. I use a similar format in my code, but with two changes:
1. It's big endian.
2. The "last byte" bit is inverted (so it's 00001 instead of 11110).
I changed the last byte flag so that it is easy to count the number of values in an array - you just do this:
let mut count = 0;
for byte in bytes {
count += byte >> 7;
}
And the reason I used big endian is because it makes decoding simpler. Encoding is more difficult but data is only ever encoded once, and is commonly decoded multiple times.
let mut value = 0;
for byte in bytes {
value = (value << 7) | (byte & 0x7F);
if value & 0x80 != 0 {
return value;
}
}
Vs the original implementation which has do to this basically:
let mut value = 0;
let mut shift = 0;
for byte in bytes {
value |= (byte & 0x7F) << shift;
if value & 0x80 != 0 {
return value;
}
}
You should put all of those "has another byte" bits together, so you can look at just the first two bytes to figure out what to do, and then perform your load. In your most common case, you can perform an 8-bite MOVBE, an LZCNT/BSR, and if the number of leading 0s/1s < 8, just shift and mask to get your value.
Have a look at how UTF-8 is encoded for a similar scheme. The space taken is exactly the same as LEB128, but your code is less branchy and needs fewer masks and shifts.
You are misunderstanding me. Values less than 128 take only one byte.
The obvious implementation fast path is:
1. Check that the current position is at least a 8 bytes from the end of the buffer. Otherwise, go to the short buffer slow path.
2. Count the number of leading 1s (or 0s) using BSR/LZCNT.
3. If the number of leading 1s (or 0s) indicates more than 8 bytes are necessary, use the multi-word slow path.
4. Shift and mask out your value.
5. Increment your position pointer by the length of your varint.
Values less than 128 will have no leading 1s (or no leading zeros, if your scheme uses leading zeroes instead). The shift value is 56 bits. Unless you have a 1-byte optimized path, your calculated mask value is (((uint64_t)-1LL) >> 57). The position pointer is incremented by one byte at the end of the decode.
Now that I think about it, for the uint126_t encoding case, you actually need to inspect up to the first 3 bytes. I've only implemented this for encoding/decoding uint64_ts, in which case there's a little trick that saves one byte, but the trick only works because 64 % 7 == 1.
Maybe it's most clear if I spell out the 19 cases necessary to encode any uint128_t (system using BSR (leading 1s) insteod of LZCNT (leading 0s)).
0 xxxxxxx -> 0 to 127
10 xxxxxx yyyyyyyy -> 128 to 2**14 - 1
110 xxxxx yyyyyyyy zzzzzzzz -> 2**14 to 2**21-1
1110 xxxx yyyyyyyy zzzzzzzz aaaaaaaa -> 2**21 to 2**28-1
11110 xxx yyyyyyyy zzzzzzzz ... 2 more bytes -> 2**28 to 2**35-1
111110 xx yyyyyyyy zzzzzzzz ... 3 more bytes -> 2**35 to 2**42-1
1111110 x yyyyyyyy zzzzzzzz ... 4 more bytes -> 2**42 to 2**49-1
11111110 yyyyyyyy zzzzzzzz ... 5 more bytes -> 2**49 to 2**56-1
11111111 0 yyyyyyy zzzzzzzz ... 6 more bytes -> 2**56 to 2**63-1
11111111 10 yyyyyy zzzzzzzz ... 7 more bytes -> 2**63 to 2**70-1
11111111 110 yyyyy zzzzzzzz ... 8 more bytes -> 2**70 to 2**77-1
11111111 1110 yyyy zzzzzzzz ... 9 more bytes -> 2**77 to 2**84-1
11111111 11110 yyy zzzzzzzz ... 10 more bytes -> 2**84 to 2**91-1
11111111 111110 yy zzzzzzzz ... 11 more bytes -> 2**91 to 2**98-1
11111111 1111110 y zzzzzzzz ... 12 more bytes -> 2**98 to 2**105-1
11111111 1111111 0 zzzzzzzz ... 13 more bytes -> 2**105 to 2**112-1
11111111 11111111 0zzzzzzz ... 14 more bytes -> 2**112 to 2**119-1
11111111 11111111 10zzzzzz ... 15 more bytes -> 2**119 to 2**126-1
11111111 11111111 110zzzzz ... 16 more bytes -> 2**126 to 2**133-1
I would guess that Rust wants to stick to one standard encoding, and LEB128 is already in WebAssembly, DWARF, and LLVM. Adding BEB128 would add more complexity. Also, since Rust is generating a bunch of output files, it's definitely possible that it encodes more often than it decodes.
You missed a "shift += 7;" in your original implementation code, though it's present at the godbolt link.
It's not clear what point 2 is supposed to specify. LEB128 sets the 0x80 bit if there's a continuation. I'm not sure what "(00001 instead of 11110)" is supposed to specify. Does BEB128 leave the top bit clear when there's a continuation? Or does it set the lowest bit when there's a continuation? The code for counting would work for LEB128, but the text around it makes it seem like it's an improvement over LEB128.
Yeah I mean the continuation bit is just inverted so 0=continue, 1=end-of-value. You can still count values efficiently with 1=continue, 0=end-of-value, but you have an extra subtraction at the end (`return number_of_bytes - number_of_bytes_with_top_bit_set;`)
Good point about not adding another format. I guess they want to avoid confusion.
I think its widely understood that a little-endian prefix VarInt decodes much faster as the leading alternative to LEB128. You can implement the whole thing without a loop, thus you don't have to exercise branch and loop prediction resources at all.
- count leading zeros (or ones, depending on the first byte's tagging preference)
- Unaligned load expressed as memcpy. Let the compiler's instruction scheduler peephole it out into a single unaligned load instruction.
- shift the loaded value by an amount indicated by the leading zero count.
I.e. one byte tag followed by the value? It's definitely an option for me but the main issue is that values under 128 now take 2 bytes instead of 1. I guess I could do something like what CBOR does and use 0-251 = literal value, 252 = u8 follows, 253 = u16 follows, 254 = u32 follows, 255 = u64 follow.
1. It's big endian.
2. The "last byte" bit is inverted (so it's 00001 instead of 11110).
I changed the last byte flag so that it is easy to count the number of values in an array - you just do this:
And the reason I used big endian is because it makes decoding simpler. Encoding is more difficult but data is only ever encoded once, and is commonly decoded multiple times. Vs the original implementation which has do to this basically: Here is the generated assembly in both cases: https://godbolt.org/z/6FbX_ENo idea if it would be faster than the new code but surely it is faster than the old code?