Hashing is not encryption

triska · on Jan 9, 2022

A cryptographic hash function H can be used for encryption though, by mimicking a one-time pad: Pick a random integer k, concatenate it with a secret s, hash the result obtaining H(s•k), XOR this with the plaintext yielding the ciphertext C and send the pair (k,C). Pick another integer if you need to encrypt more data.

Decryption proceeds in the same way, mutatis mutandis. This is the basic idea behind one of the best currently used ciphers, ChaCha20: It builds on a hash function.

From a regulatory and legal perspective, this means that if you want to ban strong encryption, you must ban cryptographic hash functions.

ThePhysicist · on Jan 9, 2022

Yep we use that to e.g. construct stream ciphers for structure-preserving pseudonymization. Security proof is quite simple as well if you can assume the hash function is secure. No one would use this for encryption though as it's quite slow compared to a modern stream cipher. We use it because there's (to my knowledge) no standardized stream cipher that provides output feedback mode (OFB) out-of-the-box.

tptacek · on Jan 9, 2022

Pretty much everybody uses this for encryption; the most popular stream cipher (qua stream cipher) in the world is Salsa20/ChaCha20, which is simply a keyed hash function run in counter mode.

ignoramous · on Jan 9, 2022

> ...structure-preserving pseudonymization

Can you write about this a bit more, please (what does it mean)? Is it comparable to Tokenization (typically used to hide sensitive-date like credit card details, for example) [0], or to Homomorphic encryption [1]?

[0] https://en.wikipedia.org/wiki/Tokenization_(data_security)

[1] https://people.csail.mit.edu/vinodv/FHE/FHE-refs.html

ThePhysicist · on Jan 10, 2022

Tokenization typically means that a third party keeps an association table, though it is possible to generate tokens cryptographically. Homomorphic encryption is something different again, though you could call this structure-preserving pseudonymization homomorphic as it preserves certain structural properties of the input data. It does not provide the same security guarantees as FHE though as the structure preservation weakens the encryption. For pseudonymization this is a tradeoff that can be worthwhile though.

0xdeadb00f · on Jan 10, 2022

> (to my knowledge) no standardized stream cipher that provides output feedback mode (OFB) out-of-the-box.

I wasn't aware stream ciphers typically had different modes of operation like block ciphers did. Could you expand on this? I'm kind of interested.

ThePhysicist · on Jan 10, 2022

OFB means you feed the output of the cipher back to the input. Hence if you have two character sequences that share a prefix e.g. "abc123" and "abcxyz" the first three cipher stream bits will be identical and then deviate based on the feedback the cipher gets (assuming no IV and identical key). This is interesting for pseudonymization when you want to e.g. create cryptographically secure pseudonyms that preserve prefixes (think IP address pseudonyms that preserve subnet groupings).

freeqaz · on Jan 9, 2022

You nerd sniped me with this. At least now I could implement ChaCha from memory... so thanks for that!

It's cool to see how simple it is. It's a lot more intuitive than other ciphers are!

charlieyu1 · on Jan 9, 2022

I’m not even sure that OTP is secure by modern standards. Since OTP encrypts with XOR only, a man-in-the-middle can still modify the message by XOR a few bytes with constants like 0x01 to modify the message.

DarylZero · on Jan 9, 2022

OTP can be combined with some message authentication code, so I don't think that's a fair criticism.

mistercow · on Jan 9, 2022

Yeah, nothing is secure if you apply it incorrectly. SSH isn't secure if you tweet your private key.

Bender · on Jan 9, 2022

One mitigation against this is to specify on the target hosts where that keys are valid from. One could even be as liberal as to give the entire CIDR block of their ISP or Corporate IP CIDR block or preferably use a VPN network. The key will not be valid for anyone not in the specified networks or IP's separated by commas.

e.g. using my tinc VPN mesh

  grep keys /etc/ssh/sshd_config
  AuthorizedKeysFile          /etc/ssh/keys/%u

cat /etc/ssh/keys/root # honeypot server

  from="192.168.0.0/16,10.15.10.2" ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDDlyvd+LAveSiIIgQW3Y+aWsb8bTu70tgXQjoWMrMAbKt1/sDoE/w2qqLBSqLjJmA46zKqwgZhnO8hkZjpizjb909fNYrgyNOsRcHOvumanORUGjcumxi5fklqNZCXC1rehX4z+tnVzygtDNQcqlmyy1ElBM+SxKS+YIrOymPOU6bc5V1sildq5cVUSuEHcLggOadCj6azbiR1J46D6UYxAaKYjq2OloLZXRhvSYrmVeH+PNKiShgYBu4TfCg4UoJKZsHTaeS2akZAQTqcjy7bB8HRWGXYEMKAKA95w8MTqaWlIoMKBuOJnd8+wRXrZKnSsTQmQQZg0rggO6GC8ufT leaked_on_purpose

I would be happy to provide my private key and a VM IP if anyone would like to try this out.

xyzzyz · on Jan 9, 2022

If you think this is secure even if you leak private key, why not just disable key checking altogether and just allow anyone to log in if they come from accepted network?

Taek · on Jan 9, 2022

Defense in depth - have several independent layers of security so you have some breathing room to fix things if one of the layers is fully compromised.

Bender · on Jan 9, 2022

One could do that as well. Different use cases can mix and match any restrictions using the Match block in sshd_config or simply having a general restriction like this.

  AllowUsers root@192.168.*.*

DarylZero · on Jan 9, 2022

Does OpenSSH let you disable all authentication?

Bender · on Jan 9, 2022

Does OpenSSH let you disable all authentication?

Yes one can disable any/all forms of authentication. At that point it might be worth shutting down the SSH daemon to save a few MB of ram. If you mean permit null passwords that can be enabled too.

DarylZero · on Jan 9, 2022

Huh. Neat.

> At that point it might be worth shutting down the SSH daemon

Could be useful with ForceCommand. Sometimes you really only want to authenticate the server -- like with HTTPS.

Bender · on Jan 9, 2022

Good point. That is what I am doing:

  Match Group                     sftpusers
    PubkeyAuthentication            yes
    PasswordAuthentication          yes
    PermitEmptyPasswords            yes
    GatewayPorts                    no
    ChrootDirectory                 /data/sftphome/%u
    ForceCommand                    internal-sftp -l DEBUG1 -f AUTHPRIV -P symlink,hardlink,fsync,rmdir,remove,rename,posix-rename
    AllowTcpForwarding              no
    AllowAgentForwarding            no

DarylZero · on Jan 10, 2022

You have `PubkeyAuthentication` and `PasswordAuthentication` both set to `yes`, which is confusing to me. I thought you were saying they were disabled.

When I connect without a `-i` parameter the SSH client attempts to authenticate with all my keys, succeeding with none, then succeeds with authentication type `none`. When I specify -i `/dev/null` the logs are confusing as to what's happening:

    $ ssh -i /dev/null -v serveradmin@50.116.13.30 2>&1 | egrep -ie 'auth|attempt'
    debug1: Authenticating to 50.116.13.30:22 as 'serveradmin'
    debug1: Will attempt key: /dev/null  explicit
    debug1: Authentication succeeded (none).
    Authenticated to 50.116.13.30 ([50.116.13.30]:22).

I guess what you actually did was not disable authentication methods, but enable empty passwords, and set the account to an empty password -- and SSH does not even ask for the empty password in this case or even call it the same name as password auth in the logs. I guess!?

Bender · on Jan 10, 2022

Having both set to yes means you are able to use either. Earlier I mentioned there was the possibility to disable both of them at which point nobody would be able to log in because only keys and passwords are used on that system. That is why I jokingly said you may as well shut down the daemon to save ram. There are methods other than keys and passwords but I have not implemented any of them.

A null password is using none [1] which is a reserved term and has specific implications. In this case of passwords, none will succeed since I enabled null passwords and that account has no password set. So none is matching in the PasswordAuthentication option.

  /etc/passwd  serveradmin:x:22218:5000::/data/sftphome/serveradmin:/bin/false
  /etc/shadow  serveradmin::18955:0:99999:7:::

There is no password hash after the first colon in the shadow entry. This was accomplished with

  usermod -p '' serveradmin

Using -i pass a key file but since I have not added any of your keys to the system then any key you specify will not match and the client will proceed on to the next authentication method you have enabled and have not explicitly disabled in the client. To accept any random key would require a change to the source code of OpenSSH. One can even hard code a public key into OpenSSH in the source code which I have done when configuring OpenSSH as a recovery daemon that does not read or write to the disk.

I hope that helps this make sense. It is understandable that this is confusing because what I am doing is the opposite of what people are instructed to do. Null passwords are considered a security risk unless you do it intentionally for some unorthodox implementation reason which I have done. I have a cron job that adds accounts within some character restrictions that bots try to brute force. I then monitor the bots behavior to see what current bots are attempting over SSH. I do the same for http/https. Both are what is referred to as a low-interaction honeypot.

Here is a bot with IP and port redacted that just tried to use my server as a DNS proxy:

  Connection from 8.37.x.x port x on 50.116.13.30 port 22
  Accepted password for admin from 8.37.x.x port x ssh2
  Changed root directory to "/data/sftphome/admin" [postauth]
  refused local port forward: originator 127.0.0.1 port x,  target 1.1.1.1 port 53 [postauth]
  Received disconnect from 8.37.x.x port x:11: Disconnected on user's request. [postauth]

[1] - https://datatracker.ietf.org/doc/html/rfc4252#page-7

danlugo92 · on Jan 9, 2022

But my private key just laying there unencrypted in `~/.ssh/id_rsa` is safe for some reason??

I've never understood this.

jamespwilliams · on Jan 9, 2022

Smartcards e.g Yubikeys help with this. You store the key on the smartcard, then use ssh-agent to have the smartcard perform whatever cryptographic operations are necessary. On Yubikeys at least, the keys can never leave the smartcard, by design.

freeqaz · on Jan 9, 2022

You should encrypt your private key with a password. Then you can use ssh-agent to hold a copy of that decrypted key in-memory so that you don't have to type your password in every time you 'git push'

nerdponx · on Jan 9, 2022

Is this actually an extra layer of security? Is it harder to get read access to another user's file with 0600 permissions, or to read memory from another user's process? In my limited understanding they seem equivalent (at least on Linux).

joombaga · on Jan 9, 2022

They're approximately equivalent on Linux. Adding a password helps if you accidentally tweet or email your private key, leave it on a flash drive, leave your computer's disk decrypted, etc.

staticassertion · on Jan 9, 2022

It's primarily targeting offline attacks ie: I've stolen your laptop or I am accessing it when you haven't decrypted it.

But there are also niche situations where an attacker has read permissions for files but not memory. It's atypical, but in cases where you have 'confused deputies' it happens - for example, if I have a service that attempts to safely broker reads to files but that service has a logic bug where a client can trick it into reading sensitive files.

The classic example of this is a path traversal in a web service.

Also, while there's no boundary, attackers are a lot happier reading files than memory. They're only human.

In general I find that, when you can, you should just encrypt your data. It can protect you in some surprising ways.

auxym · on Jan 9, 2022

Encrypting the file, or better whole disk encryption, prevents someone stealing the laptop and yanking the drive to get access to everything.

csunbird · on Jan 9, 2022

No, you are supposed to use a passphrase to protect it.

m-p-3 · on Jan 9, 2022

You should always protect your private key(s) with a password.

tptacek · on Jan 9, 2022

Encryption and message authentication are separate problems; the combination of both is called "authenticated encryption". Almost all AE/AEAD schemes ultimately encrypt by simply XOR'ing a keystream against plaintext.

zarzavat · on Jan 9, 2022

"OTP" should not be taken literally here, it's merely an analogy.

Yes, true one-time pad encryption is not secure by modern standards, as you mention there's no authentication, you need a perfect random number source, and key management is horrendous.

However, the concept of encryption by XORing a stream and the plaintext is very much the modern way of doing things, and "OTP" is an analogy for that.

akerl_ · on Jan 9, 2022

“OTP” isn’t an analogy for the concept of XORing a stream. One-time-pad is a specific way of handling key generation.

zarzavat · on Jan 10, 2022

People use it that way.

akerl_ · on Jan 10, 2022

Who? OTP is a one time pad. You’re the first person I’ve ever seen make the claim that it’s an analogy for the concept of xor.

SavantIdiot · on Jan 9, 2022

I believe that's why ChaCha20 is used with Poly1305 to provide authentication, which is why you see it referred to as ChaChaPoly.

sangel · on Jan 9, 2022

I think this requires assuming H is a random oracle, no?

Suppose H(s||k) is a collision-resistant hash function. Let's build another CRHF from H as follows: H'(s || k) = H(s || k) || last-bit-of-k.

Now let's instantiate the cryptosystem you proposed with H'. Suppose m is message and that it is of length equal to the length of the bitstring produced by H'.

We have C = H'(s||k) XOR m.

The ciphertext is thus: (k, C).

What can an attacker do? Well, the attacker can XOR the last bit of C with the last bit of k and get the last bit of m. This is already enough to violate semantic security.

matthewdgreen · on Jan 10, 2022

You're correct that collision resistance is not sufficient for the above construction to be secure, but you don't need to assume H() is a random oracle. You could instead model H(k||s) as a pseudorandom function with k as the key. And of course, if you don't trust existing functions to be directly pseudorandom, then a PRF can be built from one-way functions: so pre-image resistance is sufficient. (The remaining question is how to get there from CR alone.)

staticassertion · on Jan 9, 2022

I don't think they're proposing their implementation is sufficient. ChaCha has a nonce for a reason.

ris · on Jan 9, 2022

The more classic example is probably Feistel ciphers.

Normal_gaussian · on Jan 9, 2022

Which along with other block ciphers are the surprisingly hard to search solution to the problem of providing a bounded, iterable, id that doesnt leak its position of iteration.

(ie, provide an n character id without duplicates and using the entire space. Useful for ids for image hosting etc.)

SAI_Peregrinus · on Jan 9, 2022

Or just CTR mode of a hash function (which is closer to what ChaCha20 does internally, it runs a permutation in CTR mode). H(constant, key, nonce, ctr) gets a keystream block, XOR each keystream block with the corresponding plaintext (to encrypt) or ciphertext (to decrypt) block. The constant is important in ChaCha's case, it keeps the attacker-controlled portion of the block strictly less than half the total block, prevents an all-zero input, and has some asymmetry to help diffusion overall.

axiosgunnar · on Jan 9, 2022

Interesting post but stuff like

> From a regulatory and legal perspective, this means that if you want to ban strong encryption, you must ban cryptographic hash functions.

aka „you can't ban math! ha!“ is just a nerd dream.

The government would simply ban Whatsapp etc from implementing E2E encryption, something that is completely feasible, and 99% of the population‘s communications would be unencrypted.

sammyq · on Jan 10, 2022

Why is k an integer, could it be any arbitrary string? Also does k need have some special properties such as primality?

nathias · on Jan 9, 2022

is this unbreakable like OTP?

triska · on Jan 9, 2022

No, because the key stream is not random: It depends functionally on the integer which is sent in plain text, and on the shared secret.

For example, if you pick the same integer twice with the same secret then the XOR of the two ciphertexts is the XOR of the two plaintexts, thus losing confidentiality.

Karliss · on Jan 9, 2022

Neither is OTP secure if you reuse the key. One of main properties of OTP is that an encrypted message could be decrypted to any message of same size. In case of block cyphers and message longer than one block only a fraction of all messages can be obtained. So if you had enough compute resource(and it wasn't more than atoms in universe) for brute force or weakness was found in cyphers you could find which of the potential decrypted messages makes sense and is more likely.

smitty1e · on Jan 9, 2022

> Neither is OTP secure if you reuse the key.

I thought the OT meant "One Time". Reusing the key would take the key from singular to plural usage, no?

Karliss · on Jan 9, 2022

Yes, but I was explaining that no reuse requirement can be equally important for encryption schemes that don't have it included in the name. Claiming that one of them is worse than other because of number reuse even though doing so is equally wrong for both is not a good example.

pdpi · on Jan 9, 2022

Yup, but it’s more useful to talk about that reuse as a bad implementation of one-time pads than it is to call it something other than OTP.

In the classic sense of a small paper pad used by spies, reuse was fairly common because of the logistical problem of supplying agents with fresh pads.

Closi · on Jan 9, 2022

Absolutely - OTP with key reuse is not OTP!

runeks · on Jan 9, 2022

Would it be secure to encrypt a sequence of blocks by having the next integer be H(k) — such that the random integer for block number i is H applied to the original random integer (k) i times? Thus needing only an initial random integer from which all subsequent random integers are derived.

eternalban · on Jan 9, 2022

The general idea is to create a PRNG using the cryptographic hash. Since these hashes generally produce far more bits than needed for scalars (64 bits), you can mix the extra bits for the next cycle of the PRNG and emit (say) 64 bits per cycle. The sequence of the PRNG output is the 'pad'.

aliceryhl · on Jan 9, 2022

No.

With the hash-based algorithm you can go through every possible secret key and starting integer and check whether the result looks like English text (or whatever the contents are). If your English-text check is sufficiently good (how good it has to be decreases with the length of the text), then you're only going to have a very small number of matches, and one of those will be the plain text.

The above attack doesn't work on OTP because the above attack would simply yield a list of all English strings of the given length, and there would be no way to tell them apart.

Obviously the above attack is infeasible because there are too many secrets to check them all in a reasonable amount of time, but to be unbreakable in the same way OTP is, you have to be robust against even attacks that take an infinite amount of time.

akerl_ · on Jan 9, 2022

I’m not sure what you mean by “the above attack would simply yield a list of all English strings of a given length”.

To attack classic OTP, you’d brute force the keyspace. Since we’re XORing, whatever we encode the key as, it’s fundamentally being used in binary to XOR between plaintext and ciphertext. So your key is a binary blob the same length as the ciphertext. You keep trying and looking for what seems to be viable plaintext. You never get “all English strings of the given length”, because you’re not brute forcing the output, you’re brute forcing the key, which isn’t necessarily English.

For the hash-based approach, you’re bruteforcing the secret used for the hash function. That keyspace can be any length, unbounded on either end by the size of the message. Likewise, you’re trying H(sk) by guessing s and k, and then XORing the message and trying to determine if it looks like what you’d expect.

Your lower bound for the hash method is if the message length is shorter than the hash functions output, in which case you can cheat and literally just treat it like classic OTP, by bruteforcing possible values for H(sk) rather than guessing k and running the computation. But beyond that it’s the same dance (and if the hash function is slow, you can always fall back to brute forcing any length message the same way you would for classic OTP).

In both cases, assuming a large key size puts you past computational sanity for brute forcing, but neither the hash construction nor OTP is secure against “infinite” time.

eperdew · on Jan 9, 2022

> To attack classic OTP, you’d brute force the keyspace. Since we’re XORing, whatever we encode the key as, it’s fundamentally being used in binary to XOR between plaintext and ciphertext. So your key is a binary blob the same length as the ciphertext. You keep trying and looking for what seems to be viable plaintext. You never get “all English strings of the given length”, because you’re not brute forcing the output, you’re brute forcing the key, which isn’t necessarily English.

If your key for OTP is uniform random bits without any additional encoding, and of the same length as the plaintext, then won't you enumerate all possible messages of the given length? E.g., "abcdef" and "123456" are indistinguishable when encrypted without knowing the key because there exist keys that map each string to the same ciphertext.

aliceryhl · on Jan 9, 2022

> You never get “all English strings of the given length”, because you’re not brute forcing the output, you’re brute forcing the key, which isn’t necessarily English.

You misunderstand me. If you go through all possible keys and try to decrypt with each key, then for every possible input (of the same length), there will be one of the keys that yield that input. Hence, by going through all keys, you will hit all possible inputs, which includes all strings written in English.

> For the hash-based approach, you’re bruteforcing the secret used for the hash function. That keyspace can be any length, unbounded on either end by the size of the message.

My assumption is that the secret for the hash function input is of fixed length and shorter than the plaintext. This is not an unreasonable assumption — pretty much all existing hash functions satisfy that assumption unless the plaintext is really short.

> In both cases, assuming a large key size puts you past computational sanity for brute forcing, but neither the hash construction nor OTP is secure against “infinite” time.

These algorithms are certainly not computationally feasible, but OTP really is secure against unbounded computation in a way that the hash-based OTP is not. Knowing the ciphertext from an OTP gives literally no information about the original plaintext besides its length.

On the other hand, knowing the ciphertext of the hash-based algorithm must yield some information about the plaintext. To see why, consider an example:

1. The hash function takes as input a secret/integer of a combined 1024 bits. 2. The plaintext and ciphertext is 1000000 bits long.

Here, if you try all possible inputs to the hash function and try to decrypt the ciphertext using each one, then you're going to get a list with 2^1024 possible plaintexts. However, there are 2^1000000 possibilities for what the plaintext could be, so there must be some plaintexts missing from the list. Any plaintext that is missing from the list cannot possibly be the original plaintext.

akerl_ · on Jan 9, 2022

What is an example of a hash function with a fixed length constraint on the input?

aliceryhl · on Jan 10, 2022

Ah, yeah, I have mixed up things here a bit. However, the overall points still apply. When people use encryption, they generally don't use secrets anywhere near as long as the plaintext.

soVeryTired · on Jan 9, 2022

I guess the other difference between encryption and hashing is that a hash function need not be one-to-one. Many inputs can result in the same hash, though hopefully it's hard to find collisions.

So a hash function is allowed to destroy information, whereas it's pretty important that an encryption algorithm doesn't!

avianlyric · on Jan 9, 2022

Yeah I always like to think of hashing as a functions that maps an infinitely large namespace (all possible sequences of data) in to bounded namespace (usually 128-bits).

Find that makes it obvious why hash collisions are a thing and also why hashes are useful. A limited namespace is much easier to work with, assuming you don’t need the actual data.

esamueljohnson · on Jan 9, 2022

Well, a hash function cannot be one-to-one because of the pigeonhole principle.

didericis · on Jan 9, 2022

Theoretically couldn’t there be a hashing algorithm that’s one to one if it always spits out a hash as long or longer than the input message?

I’ve never actually walked through the math behind hashing algorithms, but I’m assuming collisions come from truncation. I’m guessing you’re usually not able to know exactly where two inputs that collide for the first n bits end up diverging, so the only way to ensure most hash functions are one to one is if the outputs have infinite length. But, if you had outputs of infinite length for different inputs, eventually they’d have to diverge. Idk if that’s true of all hashing functions/maybe there’s a way to know after what point outputs for different inputs have to diverge for some.

oconnor663 · on Jan 9, 2022

Most cryptographic hash functions in practice mix their input block-by-block into some "state" that's of a fixed size. This lets you implement them with a small, constant memory footprint, which is important.

For older designs like MD5, SHA-1, and SHA-256, the final hash is literally that state, just serialized into bytes and returned to the caller. (This is what makes "length extension attacks" possible on these hashes, which is why we need constructions like HMAC.) For newer designs like SHA-3 and the BLAKE family, the output is some function of the state, which prevents length extension attacks. This also makes it easy for these functions to offer "extendable output" features, i.e. as many output bytes as you like. (SHA-3 isn't standardized with this feature, but the very closely related SHAKE functions will gladly give you outputs of any length.)

However, one important thing to realize about these functions is that extended outputs do not increase security. This is counterintuitive, because we're used to distinctions like SHA-256 vs SHA-512, with the larger output providing more security in some sense. That's true, but it requires SHA-512 to keep a larger state in addition to producing a larger output. SHAKE128 and BLAKE3 always use the same state size, regardless of how many output bytes you ask for, and if you produce a collision in that state, all the output bytes will collide too.

Another commenter mentioned perfect hash functions, and my understanding of those is that they typically require the input set to be of some fixed size. If the input set is "any possible string", which it pretty much is for cryptographic hashes, I think trying to design a perfect hash function starts to get weird? At the very least, the state you need to keep will be proportional to the longest message you want to hash.

garaetjjte · on Jan 10, 2022

>This is what makes "length extension attacks" possible on these hashes, which is why we need constructions like HMAC

Or just truncate the hash so you don't get full state, like SHA-512/256.

oconnor663 · on Jan 10, 2022

Yes, SHA-512/256 is often a good option, despite its tragically confusing name. But (maybe going off on my own tangent here) it's got a few important shortcomings:

1. It's less widely supported than SHA-256 or SHA-512. For example, it's present in OpenSSL but not in Python's hashlib. A reasonable workaround here is to just truncate SHA-512 yourself, which gives you a functionally similar but not-output-compatible hash. But no one loves monkeying around with standards like this.

2. SHA-512 doesn't have any hardware support that I'm aware of. SHA-256 has dedicated hardware acceleration on lots of ARM chips and recent x86 chips, which is very nice. But SHA-256 doesn't have enough margin in the digest size to truncate it. (SHA-224 is a thing, but it dumps more collision resistance than we're really comfortable with, in exchange for being only slightly resistant to length extensions.)

3. If your goal is to construct an XOF, SHA-512/256 is kind of perversely inefficient. You end up throwing away half your output bytes to prevent length extensions, but then running the hash (or at least the compression function) again to get more output bytes. On the output performance side, this is leaving a free factor of two on the table.

didericis · on Jan 9, 2022

This is very helpful, thank you. This whole thread is making me realize I should read up more on hash differences/implementations.

oconnor663 · on Jan 9, 2022

(shameless plug) If you want to start by doing your own implementation of SHA-256, you can take a look at one of my assignments :) https://github.com/oconnor663/applied_crypto_2021_fall/tree/...

mistercow · on Jan 9, 2022

Not so theoretically. Perfect hash functions have exactly that property, although I've never heard of a perfect cryptographic hash function. That concept seems inherently contradictory.

3np · on Jan 9, 2022

Arguably per-definition a hash function takes arbitrary-size input and produces fixed-length output - change either of those and it's no longer a hash function. Don't and the pigeonhole principle guarantees infinite theoretical collisions.

AlexSW · on Jan 9, 2022

The output length/size of a hash function is fixed, whereas it takes an arbitrary-length/size input.

didericis · on Jan 9, 2022

I know it's normally fixed, but I did a quick google and saw a stack overflow answer saying there are some algorithms that allow for variable length outputs: https://crypto.stackexchange.com/a/3564

That doesn’t necessarily mean you can figure out what length output for a given input is needed to make it one to one. Not sure you could avoid collisions even if the length of the output was infinite, but I’m assuming different inputs have to have outputs that diverge at some point.

SAI_Peregrinus · on Jan 9, 2022

The variable output length functions are called eXtensible Output Functions (XOFs). An XOF isn't a cryptographic hash function, though they can be very similar (and can have an identical internal function doing the work).

We use different words for Cryptographic Hashes, Password Hashes, XOFs, MACs, and (non-cryptographic) Hashes because they do different things and have different security properties. Misusing the terminology makes reasoning about what is meant difficult.

dchest · on Jan 9, 2022

They all have a fixed-length internal state and will have internal state collisions regardless of the output size. (Basically, they "consume" input into the state and then "expand" the resulting state into the output.)

Suggested reading: Sponge Functions - https://keccak.team/files/SpongeFunctions.pdf

"Informally speaking, a random oracle* maps a variable-length input message to an infinite output string. It is completely random, i.e., the produced bits are uniformly and independently distributed. The only constraint is that identical input messages produce identical outputs. A hash function produces only a fixed number of output bits, say, n bits. So, a hash function should behave as a random oracle whose output is truncated to n bits."

* https://en.wikipedia.org/wiki/Random_oracle

7steps2much · on Jan 9, 2022

You are correct, at least in theory.

Assume you have two inputs, A and B.

A may hash to: A38uT75kjGz B may hash to: A38uHso629t

So yes, if you were to cut these off after the A38u then you would no longer be able to say for sure if you hashed A or B to arrive at your hash.

Of course in practice this usually isn't a problem as long as you have "a long enough" output.

bitkrieg · on Jan 9, 2022

Your example made me wonder, is there a known instance from a common hash algorithm where the input results in exactly the same string representation of the output hash? Eg. "AE485D" hashes to "AE485D". Is this even mathematically possible?

morelisp · on Jan 9, 2022

The mathematical term for this is a "fixed point", where f(x) == x.

Assuming a perfectly random uniform distribution, the usual desirable property of a cryptographic hash - the probability of a hash function not having a fixed point (that is, hashing at least one x to itself) is (1-1/n)**n, where n is the number of possible outputs. As n approaches infinity - which it does pretty rapidly in this case, since we're talking about 2**32 to 2**512 in practice - this approaches 1/e, or about 37%.

So, not only is it possible, but most "good" hash functions (63% of them) will have them.

kevinventullo · on Jan 9, 2022

Java’s built-in hash function for integers is the identity function.

charcircuit · on Jan 9, 2022

The modulus function has this property.

charcircuit · on Jan 9, 2022

Those are XOFs (extendable output functions), not hash functions.

tux3 · on Jan 9, 2022

It doesn't have to be, in principle. There are hash functions that take fixed size input, and output no smaller (or arbitrarily long) hashes.

Look at the hash construction in stream ciphers, for example. The keystream is very long, but the key is short.

Or look at a perfect hash function, as used for hash tables.

13415 · on Jan 9, 2022

Hashing is also not secure hashing, cryptographic hash functions are a small subset of all hash functions. I wish the author would make that clear. I've seen many abuses of cryptographic hash functions where ordinary (though perhaps specialized) hash functions would be more suitable.

less_less · on Jan 9, 2022

Yes, but! Many libraries have migrated to cryptographic hash functions even for non-cryptographic use cases such as hash tables. They choose a random key. This mitigates problems where the algorithm performs much worse on pathologically bad, perhaps adversarially chosen, data sets.

TillE · on Jan 9, 2022

A library should absolutely not be using a cryptographic hash for something like a hash table unless they have very particular requirements. You don't want to force a significant performance penalty on all your users when there are good general-purpose hashes like XXH3 out there.

barsonme · on Jan 9, 2022

Counterpoint: SipHash.

Also, in general I disagree. For most people, safe defaults like hash tables with safe hashes and CSPRNG for random numbers are fast enough. And they have the important property of keeping people from shootings themselves in the feet.

People who have more stringent perf requirements will know and shouldn’t have a problem choosing a different implementation.

marginalia_nu · on Jan 9, 2022

If you are worried about adversarial data, it's probably better to choose a more suitable data structure instead (or change the way you deal with hash collisions: linear probing is probably not a good idea if the data is sketchy). Hash tables are only performant if the hash function is fast, and cryptographic hash functions are anything but.

tyingq · on Jan 9, 2022

Checksums might be a better example. People often use cryptographic hashes for check summing in non-security related scenarios. And, depends on the implementation, but MD5 or SHA-1 is often the same or better performance than a CRC checksum. That shouldn't be the case, but it often is.

fabian2k · on Jan 9, 2022

If duplicate hashes would cause an issue, you might as well just use a cryptographic hash function. For a non-specialist it's quite hard to evaluate the space of non-cryptographic hash functions. There's so much more documentation and information available for cryptographic hashes compared to everything else.

And you need to be sure that the properties you're trading off are really something you don't need. Using a cryptographic hash is much simpler, has fewer ways to going wrong and usually isn't that slow anyway.

ttyprintk · on Jan 9, 2022

To your first point: it’s not a bulletproof arrangement, either. PbKDF-hmac-sha1 can produce duplicate hashes if the input is larger than the hash block size. The special case is: if the input is bigger than block size, do one round of sha-1 before proceeding.

I agree with your second point, developers reach for asymmetric signatures when hmac would do, and reach for hmac when a long fast non-cryptographic algorithm would do. I _think_ its an over-abundance of caution rather than a misunderstanding of the characteristics of hash functions.

SAI_Peregrinus · on Jan 9, 2022

PbKDF-hmac-sha1 isn't a Cryptographic Hash Function. It's a Password Hashing Function. They have different security properties, and must not be confused. Sadly they're confusingly named.

ttyprintk · on Jan 9, 2022

Yeah, and it’s not pedantic. I hope a blog post nowadays points newcomers to an example decision tree for use-cases. Though we should hold this blog post to its own standards; it’s not intended to be best-practices.

chrismorgan · on Jan 9, 2022

> Hashing is also not secure hashing

And to work with the analogy of the article: grinding a cow into a hamburger is hashing but not secure hashing, because if you have the right tools you can inspect the hamburger and determine some properties about the particular cow that it came from, like if it had mad cow disease. I dunno exactly what a securely hashed burger would be, but I don’t think it’d taste very good.

wildzzz · on Jan 9, 2022

You'd turn the beef into totally unrecognizable meat paste that could come from any number of species. Pulverize the meat, wash it with solvents, and cook until there is nothing left of the original protein structures or flavor (i.e charcoal). Just unique enough in physical composition to say this lump of burnt paste is different from that lump of burnt paste but not enough information to say anything else about it or what it originally may have been.

relaunched · on Jan 9, 2022

You'd be surprised how many people could benefit from this type of information.

The only thing is add is that the definition of encryption is incomplete. It seems to focus on symmetric encryption. Asymmetric encryption doesn't require the initial encryption key to get back to the original message. Rather, it uses a key pair - one public and one private.

makach · on Jan 9, 2022

I have to admit that I was incredibly provoked by the title, then I read the article and thought "this is fine". #clickbaited

mirekrusin · on Jan 9, 2022

I read about hashing, encryption and encoding and somehow at the end I want to be more vegetarian.

thejackgoode · on Jan 9, 2022

Makes sense. Once you’re the vegetarianest, you have nothing to hide anymore

TorKlingberg · on Jan 9, 2022

It's worth noting that the terminology has changed a bit over time. I.e. the old Unix C function 'crypt' actually does hashing.

ttyprintk · on Jan 9, 2022

No, it uses an encryption scheme on the password to derive the field in /etc/passwd. It’s one-way (decrypt is not implemented) but it is encryption based on a rotor-design US Army encryption machine. Edit: Navy, not Army

TacticalCoder · on Jan 9, 2022

It would be nice if TFA added at least once sentence about symmetric encryption vs asymmetric encryption.

bgro · on Jan 9, 2022

I don't really understand the confusion around these terms in terms of day-to-day actual work. Is this really a problem? This seems like a trivial question to me so much so that I would (and have) screwed up this question in an interview which I'll discuss here.

In my pedantic technical opinion (technical as in literal, not technical-interview), these are all subsets of encryption. Encryption to me is anything that scrambles the data to non-literal-plain-text in a way where you need a key to read it. These are just encryption, but the password is always just the word "password", or for a specific example, the source text.

In my continued opinion, can't hashes be "found out" in theory if you had unlimited computing time + unlimited attempts at brute force hashing every string?

Encoding is just encrypting the text into a non-literal-plain-text format by using (an extremely weak, known password) to translate into another (computer-readable) language. I don't really have anything to add from the source to this one.

Why is my distinction about the definition of encryption important to me?

In my opinion, we shouldn't limit our mind to ONLY knowing encryption as a method containing some math formula someone came up with to scramble you data based on an input password. There are a magnitude of ways to encrypt your actions in a more broad sense.

For example, what if you identify yourself by handing in a series of paintings to somebody (an authenticator) who physically determines your entry? He can determine if you pass by having knowledge that the order of the paintings and the artist's initials correspond to their position in the alphabet to decrypt your ID number. (Some other tricks could be used to prevent random turn in or duplicates, such as only using a specific style of art, but I'm skipping that for this example.) Is that not an encryption method that accepts a user input and encrypts it with a black box formula to output some code?

3np · on Jan 9, 2022

In the context of cryptography, encryption has a specific meaning. Your intuitive non-standard definition may be interesting and useful, but its aking to bringing up perceptive hashing like what Apple's been introducing for CPAM on iCloud. At this point it becomes an overloaded term and the meaning depends on context. Words are more useful and efficient if we have a common understanding to stand on.

> I don't really understand the confusion around these terms in terms of day-to-day actual work. Is this really a problem?

It absolutely is. I've seen software that, instead of salt-hashing passwords in the DB, will encrypt them with an global RSA public key (not only less secure and way less efficient, you now also have an effective undocumented max-length of passwords).

Or, way more common, utilizing base64-encoding as "encryption".

If understanding of the differences was more widespread, at least these systems may have been less terrible.

bgro · on Jan 9, 2022

I agree with everything you said.

My point boils down to it seems like there is ambiguity between the technical definition vs the actual practice and security requirements we've currently decided on as acceptable.

Somebody who uses base64 to "encrypt" into a database clearly did their job wrong.

A test question that says something like "True/False, encryption can be used to alter the original string into a different string" is true because it doesn't go into the details about the security that we all (should) know needs to be there. When we ask the question kind of backwards from the ambiguous meaning like this, I think we can get a different definition and end up with silly things that technically meet the definition requirement such as base64.

Anyway, my take doesn't really matter. It's more of venting how I always get stuck on easy questions in software dev interviews and end up losing out to somebody who uses base64 in prod to attempt implementing "encryption" to the database.

ignoramous · on Jan 9, 2022

> Encryption to me is anything that scrambles the data to non-literal-plain-text in a way where you need a key to read it.

You can't re-read what's hashed, though.

pgCKIN · on Jan 9, 2022

Funnily enough, hambuger in french is "steak haché" (hashed steak).

denton-scratch · on Jan 9, 2022

Not exactly; steak haché is minced (or finely-chopped) steak, as in Steak Tartare for example. You might use steak haché to make a hamburger.

OJFord · on Jan 9, 2022

On a menu, it's a burger. Most likely not in a bun/McDonalds-style though (l'hamburger would be).

denton-scratch · on Jan 9, 2022

OK, I stand corrected.

ravenstine · on Jan 9, 2022

Referring to hashing as encryption is like calling a fingerprint a lock.

throwaway984393 · on Jan 9, 2022

Hashing is basically very, very lossy compression, while encryption is more like a jigsaw puzzle with a billion billion pieces and you hide the map of the original picture.

gitgud · on Jan 9, 2022

> "Even if you know the algorithm and any secret keys involved, there is no way to un-hash a string. It’s an entirely destructive operation."

Hashing is similar to compressing a massive photo into a small thumbnail, it makes it easier and quicker to browse through photos, but you cannot recover the detailed resolution from the thumbnail.

ericalexander0 · on Jan 9, 2022

>no way to un-hash

Wrong. Possible with a rainbow table.

tialaramex · on Jan 9, 2022

The Rainbow Table is merely a further refinement of a neat trick to improve the performance of a trivial time/space trade on an attack. Ultimately what the Rainbow Table is doing is exactly equivalent to remembering all the inputs you tried and what they hashed to, except it uses less memory/disk than the naive approach and a bit more CPU.

Knowing this you can see that it is only practical as an attack if the set of inputs you want to try is so small that you can realistically try all of them and keep the results somewhere.

Rainbow Tables got famous because Microsoft's incredibly bad LANMAN hashing scheme only has small inputs (7 bytes, the algorithm runs twice on passwords up to 14 bytes), so you actually can try literally all of them, but at the time a terabyte hard disk was very expensive, Rainbow Tables meant you needed much less disk space to store the resulting data and attack this lousy scheme (but somewhat more CPU to calculate the table).

charcircuit · on Jan 9, 2022

No, you are misunderstanding. There is no inverse function to a hashing algorithm. There is an infinite number of possible inputs for any given hash.

charcircuit · on Jan 10, 2022

algorithm -> function

0xdeadb00f · on Jan 10, 2022

A rainbow table must have the plaintext/hash tuple inside of it. This does not reverse the hash, but confirms that some plaintext hashes to some output.

yonixw · on Jan 9, 2022

If we think about it in good faith, there are some cases where hash is used for PROTECTION. Like in a password storing of hash(salt+password). So while technically correct, in the day to day language we do use hash as an encryption alternative sometimes.

less_less · on Jan 9, 2022

The article isn't just technically correct. Hashes and encryption are both parts of cryptographic systems, but their use cases are different. For example, implementing a password database as encrypt(salt, password) would be no good at all: the salt is also stored in the database, so an attacker who steals that database could easily recover everyone's plaintext passwords.

Password databases should use a specialized, intentionally slow hash function, with a salt and preferably also with a secret key. That is, they should use something like argon2(secret, salt, password) or HMAC(secret,argon2(salt,password)) -- the latter so that you can keep the HMAC secret on an HSM.

This is a common interview question for a reason. A candidate who thinks hashes and encryption are typically alternatives isn't ready to do secure system design.

yonixw · on Jan 9, 2022

I agree, just pointed out where the confusion might come from..

mypastself · on Jan 9, 2022

Yeah, the definitions and analogies in the linked article are correct, but I doubt they’d be entirely clear to the uninitiated since they explain the processes but not applications for all three concepts. We get the “how”s but not all of the “why”s.

I suspect it’s partly for the reason you’ve mentioned: the line between uses can sometimes get blurry. It’s a decent article that could do with some additional clarifications.

KSPAtlas · on Jan 9, 2022

An encrypted password is worse than a hashed one, since you only need it to be one way

feldrim · on Jan 9, 2022

Calling it as an alternative might be an exaggeration. Both are used in security or as a protection like you mentioned. Yet, encryption is a way of storing data for a future[0] use in a secure manner while using the hash, you destroy the data and keep only the footprint.

[0] Future can be nanoseconds later in any data-in-transit unlike data-at-rest, so I picked future to address the ambiguity of time measurement in different contexta.