You're being unnecessarily argumentative. The web is filled with badly-encoded/re-encoded/mixed-encoded text. One of the worst offenders is Microsoft Outlook, which sends emails using local 8-bit windows codepage by default, instead of utf-8. Pass that through several mail gateways, display those messages on some php forum, and you get yourself a bad mess that cannot be decoded successfully by any python codec. chardet is useless in that case - there is no 'valid' encoding per se.
This library is taking the only possible approach, which is to segment the text and try to convert each segment into its most probable unicode representation. It seems to cover a larger number of encoding mixups compared to other libraries, that's great!
Comment to rspeer: in fix_text_segment(), I would limit the number of recursive passes on the text to 5-10. Right now it's using 'while True', which might take a very long time to converge on corrupted/binary data.
So, actually, it seemed to make sense at first to limit it to, say, 2 or 3 passes. But then I read about Spotify's username exploit [1]. That made it pretty clear to me that any Unicode-transforming function should be idempotent whenever possible, so that you never end up with inconsistent answers about whether strings are equivalent.
I have also seen text that was encoded six times in UTF-8 (and decoded five times in Windows-1252). Although ftfy had to leave it as is; it didn't successfully decode because it was truncated.
> This library is taking the only possible approach
It is taking the only possible approach if we assume that it must use one and only one algorithm for all uses. Otherwise, it seems to me that a lot of careful tuning and configuration would be needed in order for this library to make the best guesses it possibly can make for a specific user’s situation and data.
> might take a very long time to converge
A limit there might be appropriate – otherwise there might exist a “billion laughs” style attack.
This library is taking the only possible approach, which is to segment the text and try to convert each segment into its most probable unicode representation. It seems to cover a larger number of encoding mixups compared to other libraries, that's great!
Comment to rspeer: in fix_text_segment(), I would limit the number of recursive passes on the text to 5-10. Right now it's using 'while True', which might take a very long time to converge on corrupted/binary data.