from __init__.guess_bytes(): "This is not a magic bullet. If the bytes are comin...

rspeer · on Aug 17, 2014

I would like to emphasize that "guess_bytes" is not what ftfy is about. It's there for convenience, and because the command-line version can't work without it. But the main "fix_text" function isn't about guessing the encodings of bytes.

Not all text arrives in the form of unmarked bytes. HTTP gives you bytes marked with an encoding. JSON straight up gives you Unicode. Once you have that Unicode, you might notice problems with it like mojibake, and ftfy is designed for fixing that.

Like you say, encoding detection has to be done statistically. That's a great goal for another project (man, I wish chardet could do it), but once statistics get in there, it would be completely impossible to get a false positive rate as low as ftfy has.