If a string like "&" is supplied to ftfy, it will probably guess or *detect*...

rspeer · on Aug 17, 2014

fix_entities is a parameter. It's not guessing, you told it.

You could vaguely criticize the fix_entities='auto' setting as a guess, except it's a guess that's only wrong if you manage to provide it an HTML document with zero tags in it.

An example of a false positive is "├┤a┼┐a┼┐a┼┐a┼┐a". That is what I mean by non-serious text. False positives will always exist, and you should appreciate that I'm testing on millions of examples to find out what they are. Your suggested machine learning approach would never get to 99.999984% precision.

Using non-ASCII is totally fine, and this library would have no purpose in an all-ASCII world.

teddyh · on Aug 17, 2014

So in other words, this library is only for “serious” text. You know, I’ve been accused of hating fun, but I you think you beat me, having written a library which is incompatible with it.

rspeer · on Aug 17, 2014

I couldn't accuse you of hating fun, but I could accuse you of hating thorough documentation. Would you prefer I hadn't told you about the one case in six million where the library fails? And that failure is really an amazing coincidence when you look into it; change any one of those line-drawing characters and it'll be fine.

Keep in mind that this is a library that finds emoji that has been damaged by "serious" software written by, let's call them "Basic Multilingual Plane neanderthals", and puts it back. There's your fun.

teddyh · on Aug 17, 2014

You brought up the fact that the library was only for (your word) “serious” text. You chose the specific example of “non-serious” text. You can hardly fault me, then, for accusing the library of being incompatible with non-seriousness.

unicodedammit · on Aug 17, 2014

You're being unnecessarily argumentative. The web is filled with badly-encoded/re-encoded/mixed-encoded text. One of the worst offenders is Microsoft Outlook, which sends emails using local 8-bit windows codepage by default, instead of utf-8. Pass that through several mail gateways, display those messages on some php forum, and you get yourself a bad mess that cannot be decoded successfully by any python codec. chardet is useless in that case - there is no 'valid' encoding per se.

This library is taking the only possible approach, which is to segment the text and try to convert each segment into its most probable unicode representation. It seems to cover a larger number of encoding mixups compared to other libraries, that's great!

Comment to rspeer: in fix_text_segment(), I would limit the number of recursive passes on the text to 5-10. Right now it's using 'while True', which might take a very long time to converge on corrupted/binary data.

rspeer · on Aug 17, 2014

So, actually, it seemed to make sense at first to limit it to, say, 2 or 3 passes. But then I read about Spotify's username exploit [1]. That made it pretty clear to me that any Unicode-transforming function should be idempotent whenever possible, so that you never end up with inconsistent answers about whether strings are equivalent.

I have also seen text that was encoded six times in UTF-8 (and decoded five times in Windows-1252). Although ftfy had to leave it as is; it didn't successfully decode because it was truncated.

[1] http://labs.spotify.com/2013/06/18/creative-usernames/

teddyh · on Aug 17, 2014

> This library is taking the only possible approach

It is taking the only possible approach if we assume that it must use one and only one algorithm for all uses. Otherwise, it seems to me that a lot of careful tuning and configuration would be needed in order for this library to make the best guesses it possibly can make for a specific user’s situation and data.

> might take a very long time to converge

A limit there might be appropriate – otherwise there might exist a “billion laughs” style attack.

teddyh · on Aug 17, 2014

> you told it.

Another commenter claims that this worked:

> ftfy.fix_text('L&amp;Atilde;&amp;sup3;pez')

There is no “fix_entities = True” there. Indeed, if the library would require such a parameter, what would be the point? If you already know the encoding, the library has no reason to exist. Therefore, the whole point of the library is to guess the encoding.

> you should appreciate that I'm testing on millions of examples

Examples taken, I would assume, either from your personal use cases, the use cases of your customers, or some sort of general grab-bag of mis-encoded text you could find. I would assume that this one-size-fits-all ad-hoc rule set would be wrong for many users in their specialized use cases, and would bite them when they least expect it.

rspeer · on Aug 17, 2014

Examples are taken from Twitter's live stream.

You're not even reading the documentation, you're just searching for reasons to call me an "ASCII neanderthal" over a library that an ASCII neanderthal would have no use for.

And I fail to see how the default settings being able to fix 'L&amp;Atilde;&amp;sup3;pez' is anything but a resounding success.

teddyh · on Aug 17, 2014

Would that not bias your rules towards decoding only the errors which are made by all existing Twitter clients (of which I understand there are relatively few)?

Anyway, now we’re just arguing in circles. I said that the library would guess that “&” was HTML encoded. You said “fix_entities is a parameter. It's not guessing, you told it.” I said that the example given has no such parameter. You then turned around and said that it was a success that it guessed correctly, but my point was that it was, indeed, guessing, and might, therefore, guess wrong.

I don’t want to call you an ASCII neanderthal, really, and I’m sorry I did, it’s just that your library helps the actual ASCII neanderthals from having to bother with evolving. This is my main complaint about this library. It will probably be used by them, reflexively, to decode everything, even when the encoding is known, and therefore introduce (admittedly relatively small amounts of) data corruption (but these things have a tendency to crop up when you least wanted them). Whereas if your library was not used, users would have to think about what decoding to use, use it, and not introduce data corruption.

Also, I have some misgivings about a one-size-fits-all solution of guessing encodings – I guess that it would never really work in quite the painless way most your users imagine. To solve this, I advocated a user-customizable training approach, which would, for each user, be the best possible one for their use case. It would also have the beneficial side effect of forcing the users to actually think about their data and what encodings it was likely to have and in what circumstances, thus making them evolve to Homo Unicodus. ☺ Of course, I could be wrong about this, but my principal worry about this library, as stated above, remains.