Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

from __init__.guess_bytes(): "This is not a magic bullet. If the bytes are coming from some MySQL database with the "character set" set to ISO Elbonian, this won't figure it out. Perhaps more relevantly, this currently doesn't try East Asian encodings."

The world is a very large place, there are many codepages in use besides latin-1 and "ISO Elbonian". All central european contries use latin-2 (1250), or cyrillic codepage (1251). Since they are all single-byte codepages, they cannot be detected by try: convert() catch: try_another_codepage() and must be distinguished statistically. LTR/RTL language and asian encoding detection is even worse. https://en.wikipedia.org/wiki/Code_page https://en.wikipedia.org/wiki/Windows_code_pages

Another python Unicode conversion module which is slightly less US/English-centric: https://github.com/buriy/python-readability



I would like to emphasize that "guess_bytes" is not what ftfy is about. It's there for convenience, and because the command-line version can't work without it. But the main "fix_text" function isn't about guessing the encodings of bytes.

Not all text arrives in the form of unmarked bytes. HTTP gives you bytes marked with an encoding. JSON straight up gives you Unicode. Once you have that Unicode, you might notice problems with it like mojibake, and ftfy is designed for fixing that.

Like you say, encoding detection has to be done statistically. That's a great goal for another project (man, I wish chardet could do it), but once statistics get in there, it would be completely impossible to get a false positive rate as low as ftfy has.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: