Having lost the plot of your other argument, you fall back on the time-honored l...

belorn · on Aug 17, 2014

Vuze, a very commonly used torrent client has to take encoded torrent titles and decode them. Sometimes the decoding fails, because of corruption or missing encoding type.

Their approach is then to attempt each type of encoding they got, ignoring errors, and show the result to the user so the user can decide (detect) which decoding worked and which didn't.

Can ftfy replace this functionality? Is it doing the very encoding detection which is currently done by humans?

rspeer · on Aug 17, 2014

That sounds like a use case where the manual intervention is pretty important. That's encoding detection, where sometimes you have unmarked bytes. I would not, in fact, recommend ftfy there, and the README warns you against using it in that case.

There's some auto-detection you can do -- for example, you can distinguish UTF-8 from byte-order-marked UTF-16 with 100% accuracy, by design. You could also try chardet if you're okay with some amount of errors. Maybe show the detected encoding first.

Cases where ftfy is useful:

* Web scraping -- sometimes you get data that decodes in the encoding it claims to be in, but isn't quite right

* Handling data that has, at one point, been imported and exported in Microsoft Office, without every user consistently picking exactly the right format from like 20 inaccurately-named options

* Handling data that was stored by half-assed services written in, say, PHP, and not tested outside of ASCII

* Reading CESU-8 (the non-standard encoding that Java and MySQL call "UTF-8", for backward compatibility) in Python without breaking it even more. (This isn't automatic.)

* Handling data that's combined from multiple sources, in mixed encodings

* All the other situations in which mojibake arises in mostly-readable text, which there seem to be no end of.

teddyh · on Aug 17, 2014

Lost the plot? I have had two main arguments:

1. Due to its simplicity for a large group of naïve users, the library will likely be prone to over- and misuse. Since the library uses guessing as its method of decoding, and by definition a guess may be wrong, this will lead to some unnecessary data corruption in situations where use of this library (and the resulting data corruption) was not actually needed.

2. The library uses a one-size-fits-all model in the area of guessing encodings and language. This has historically proven to be less than a good idea, since different users in different situations use different data and encodings, and your library’s algorithm will not fit all situations equally well. I suggested that a more tunable and customizable approach would indeed be the best one could do in the cases where the encoding is actually not known. (This minor complexity in use of the library would also have the benefit of discouraging overuse in unwarranted situations, thus also resolving the first point, above.)

You have, as far as I can see, not yet responded substantively to the first point, and for the second point you have only asserted that your user-uncustomizable alorithm is superior to any possible other automatically derived algorithm.

Yet, I’m the one who deserves a Plonk? I think not.