Very common problem in web scraping, for example forum site might contain a mix of MacRoman, Windows codepage and various european codepages in a single page (yes, even in 2014!). Seems like a more advanced version of UnicodeDammit module ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicod... ).
Note: this module, like UnicodeDammit, is very US/English-centric, and is practically useless for worldwide web scraping. For non-english pages, it is necessary to statistically estimate the codepage and language of each page segment, and then try to normalize each segment to unicode.
Oh cool. I've used BeautifulSoup and UnicodeDammit, but I didn't know the "detwingle" function was in there. I'll take a look at if there's anything I can learn from its heuristic.
You should perhaps take a look at the "sloppy-windows-1252" codec in ftfy, and it may help "detwingle" handle some messier cases. (For example, Python will say 0x81 isn't a valid byte in Windows-1252. It's technically right. But there it is anyway.)
feedparser and the often-paired-with chardet library also approach this problem. The feedparser library has a series of fallbacks it uses to attempt to figure out the character encoding of a feed (finally bailing out and assuming windows-1252 if nothing else works). The chardet library is also quite good at guessing the intended encoding of a chunk of text, and will report its best guesses and confidence in them.
When you give chardet text in a 1-byte encoding, it sometimes ends up >99% confident that it's in ISO-8859-2.
Empirically, it's not in ISO-8859-2.
I think the problem here is that chardet is built on the assumption that "encoding detection is language detection" (from its docs). This assumption is necessary, and basically correct, when distinguishing Japanese encodings from Chinese encodings. It's even pretty much taken as a given that you can't have Japanese and Chinese text in the same document without contortions that most developers are unwilling to go through.
But European languages and encodings are much more intermixed than that. One document may contain multiple European languages, and these languages may be written outside of their traditional encoding.
I wouldn't know how to fix the European languages without damaging chardet's clear success at distinguishing East Asian encodings.
The way I understand it from their examples, it's rather latin-written-languages-centric, no? Could you give an example where it doesn't work with a romanized language? If not, then I'd hardly call that English-centric and practically useless worldwide.
Luminoso cofounder here (we make ftfy, among other things). Our use case is fairly specific: a customer uploads text documents, often as a spreadsheet originally exported from someone else's tool, but doesn't think hard about the encodings involved, so in order to serve them well we have to fix whatever happened. Accordingly, we put the most effort into solving problems that happen for our English-centric, US-centric customer base; they're the most important problems for us (though Russian has also gotten some love, as you can see from the commit history). On top of that, making Asian language exports from other tools work correctly usually requires enough encoding-awareness to mitigate a bunch of the problem[1], so we see those problems less frequently.
That said, if you have examples where ftfy fails in any language, please submit them! We want this tool to work well, because anything that we can't fix will cause us to have egg on our faces with a customer someday...
[1]No peeking: how many format options in Excel's "Save As" dialog, excluding Excel formats, produce a document from which Unicode can be recreated accurately?
from __init__.guess_bytes(): "This is not a magic bullet. If the bytes are coming from some MySQL database with the "character set" set to ISO Elbonian, this won't figure it out. Perhaps more relevantly, this currently doesn't try East Asian encodings."
The world is a very large place, there are many codepages in use besides latin-1 and "ISO Elbonian". All central european contries use latin-2 (1250), or cyrillic codepage (1251). Since they are all single-byte codepages, they cannot be detected by try: convert() catch: try_another_codepage() and must be distinguished statistically. LTR/RTL language and asian encoding detection is even worse.
https://en.wikipedia.org/wiki/Code_pagehttps://en.wikipedia.org/wiki/Windows_code_pages
I would like to emphasize that "guess_bytes" is not what ftfy is about. It's there for convenience, and because the command-line version can't work without it. But the main "fix_text" function isn't about guessing the encodings of bytes.
Not all text arrives in the form of unmarked bytes. HTTP gives you bytes marked with an encoding. JSON straight up gives you Unicode. Once you have that Unicode, you might notice problems with it like mojibake, and ftfy is designed for fixing that.
Like you say, encoding detection has to be done statistically. That's a great goal for another project (man, I wish chardet could do it), but once statistics get in there, it would be completely impossible to get a false positive rate as low as ftfy has.
Is that really necessary? Web browsers do no statistically analysis, and yet users seem to have no problem with them? Okay, certainly they have menus which allow you to override them, but they are relatively infrequently used. The most common issue with web scrapers that I've seen is they don't detect the character encoding the same way as web browsers do — which is what web authors expect.
Had a dataset where this was the case.. old devs used Windows, I'm not sure what the DB encoding was set as when they did imports, etc. I've been putting off fixing it because it's just a PITA to deal with.
But I built a sanitizer in a couple hours with this lib, and it seems to work pretty well.
The only unexpected thing is that it converts the ordinal indicator º to o in addresses. Luckily there are only a handful I need to fix.
This is an impact of the default Unicode normalization, which is set to NFKC. This normalization is lossy for things like the ordinal indicator and trademark symbol; if you'd like to keep the ordinal indicator unchanged, use NFC normalization:
>>> print ftfy.fix_text(u'ordinal indicator º to o in addresses.')
ordinal indicator o to o in addresses.
>>> print ftfy.fix_text(u'ordinal indicator º to o in addresses.',normalization='NFC')
ordinal indicator º to o in addresses.
I can’t help but think that this merely gives people the excuse they need for not understanding this “Things-that-are-not-ASCII” problem. Using this library is a desperate attempt to have a just-fix-it function, but it can never cover all cases, and will inevitably corrupt data. To use this library is to remain an ASCII neanderthal, ignorant of the modern world and the difference of text, bytes and encodings.
Let me explain in some detail why this library is not a good thing:
In an ideal world, you would know what encoding bytes are in and could therefore decode them explicitly using the known correct encoding, and this library would be redundant.
If instead, as is often the case in the real world, the coding is unknown, there exists the question of how to resolve the numerous ambiguities which result. A library such as this would have to guess what encoding to use in each specific instance, and the choices it ideally should make are extremely dependent on the circumstances and even the immediate context. As it is, the library is hard-coded with some specific algorithms to choose some encodings over others, and if those assumptions does not match your use case exactly, the library will corrupt your data.
A much better solution would perhaps involve a machine learning solution to the problem, and having the library be trained to deduce the probable encodings from a large set of example data from each user’s individual use case. Even these will occasionally be wrong, but at least it would be the best we could do with unknown encodings without resorting to manual processing.
However, a one-size-fits-all “solution” such as this is merely giving people a further excuse to keep not caring about encodings, to pretend that encodings can be “detected”, and that there exists such a thing as “plain text”.
I think you're criticizing the wrong library. ftfy isn't about encoding detection. Should I take guess_bytes out of the documentation to stop giving that impression?
It's the library you use when the data you get has already been decoded incorrectly. The user of ftfy cares about encodings, but gets data from sources that don't.
And in no practical sense does it corrupt your data. I don't know where you got that idea from. It leaves good data alone.
I will not say that false positives are nonexistent, but they are vanishingly rare -- see http://ftfy.readthedocs.org/en/latest/#accuracy -- and they don't occur in "serious" data, they occur when people are screwing around with bizarre emoticons and stuff.
Well, teddyh might have a point here, nonetheless: By now, I understand that ftfy is about fixing mixed up encodings between UTF8, latin-1, CP437, CP125[12] and MacRoman (only). But by claiming you are fixing "Unicode" in general as the first thing on the GitHub page, you might be misleading first-time visitors. Maybe you should try to place the "warning" about the encodings your library does handle right at the start somewhere? And make it clear that "moji-un-baking" is the library's central and main use-case, not just an "interesting thing" it can do. Despite being quite aware of Unicode and string encoding, I had exactly the same thoughts as teddyh as I read the first few paragraphs ("Oh, now we will see those encoding illiterates converting all those beautiful bytes in some highly informative character encoding to all-too-boring-ASCII.")
Which leads me to my other concern: Why do you use NFKC compatibility as the default normalization? Given you are a text mining company, you of all guys should know you loose valuable information - particularly about numbers, super- and subscript characters - with this normalization strategy. Doing NFKC on stuff like all kinds of articles, books, patents, etc. would lead to potentially disastrous results (e.g., NFKC "decomposes" the string 'O\u2082\u00B9' to 'O21' instead of 'O_2^1' - "oxygen, reference 1"). In general, I think NFC is what Python and many other libraries do, while I believe NFKC should only be used when you know what you are doing (and why you need it). Maybe it is useful for some strange, geeky tweets, but I would argue that its the corner case, not the default.
I wonder if I could change the default to NFC in the next version without breaking people's expectations. It is a safer default.
When it comes to text analytics, the underlying tagger and stuff won't know what O21 is any more than it knows what O_2^1 is anyway. And NFKC is useful for mixed Latin and Japanese text, which I wouldn't entirely dismiss as strange and geeky. But it's true that the default could be more conservative.
> It's the library you use when the data you get has already been decoded incorrectly.
Wait, does this not make the usability of this library very limited? How would I know that something has been decoded incorrectly? If I’m already handling this manually, what is the point of having a library?
Vuze, a very commonly used torrent client has to take encoded torrent titles and decode them. Sometimes the decoding fails, because of corruption or missing encoding type.
Their approach is then to attempt each type of encoding they got, ignoring errors, and show the result to the user so the user can decide (detect) which decoding worked and which didn't.
Can ftfy replace this functionality? Is it doing the very encoding detection which is currently done by humans?
That sounds like a use case where the manual intervention is pretty important. That's encoding detection, where sometimes you have unmarked bytes. I would not, in fact, recommend ftfy there, and the README warns you against using it in that case.
There's some auto-detection you can do -- for example, you can distinguish UTF-8 from byte-order-marked UTF-16 with 100% accuracy, by design. You could also try chardet if you're okay with some amount of errors. Maybe show the detected encoding first.
Cases where ftfy is useful:
* Web scraping -- sometimes you get data that decodes in the encoding it claims to be in, but isn't quite right
* Handling data that has, at one point, been imported and exported in Microsoft Office, without every user consistently picking exactly the right format from like 20 inaccurately-named options
* Handling data that was stored by half-assed services written in, say, PHP, and not tested outside of ASCII
* Reading CESU-8 (the non-standard encoding that Java and MySQL call "UTF-8", for backward compatibility) in Python without breaking it even more. (This isn't automatic.)
* Handling data that's combined from multiple sources, in mixed encodings
* All the other situations in which mojibake arises in mostly-readable text, which there seem to be no end of.
1. Due to its simplicity for a large group of naïve users, the library will likely be prone to over- and misuse. Since the library uses guessing as its method of decoding, and by definition a guess may be wrong, this will lead to some unnecessary data corruption in situations where use of this library (and the resulting data corruption) was not actually needed.
2. The library uses a one-size-fits-all model in the area of guessing encodings and language. This has historically proven to be less than a good idea, since different users in different situations use different data and encodings, and your library’s algorithm will not fit all situations equally well. I suggested that a more tunable and customizable approach would indeed be the best one could do in the cases where the encoding is actually not known. (This minor complexity in use of the library would also have the benefit of discouraging overuse in unwarranted situations, thus also resolving the first point, above.)
You have, as far as I can see, not yet responded substantively to the first point, and for the second point you have only asserted that your user-uncustomizable alorithm is superior to any possible other automatically derived algorithm.
Yet, I’m the one who deserves a Plonk? I think not.
If a string like "&" is supplied to ftfy, it will probably guess or detect that this is HTML encoded. Therefore, ftfy detects encodings. If the guess is wrong, the data has now been corrupted.
Your protestations about how corruptions will not occur in “serious” data (what is is that, anyway?), and blaming “bizarre emoticons” is exactly symptomatic of what I’m talking about – you are blaming every wrong guess which this library makes on uses of non-ASCII. This is being an ASCII neanderthal. The problems of ignoring encodings are real, and should not be blamed on users of “bizarre emoticons and stuff”.
fix_entities is a parameter. It's not guessing, you told it.
You could vaguely criticize the fix_entities='auto' setting as a guess, except it's a guess that's only wrong if you manage to provide it an HTML document with zero tags in it.
An example of a false positive is "├┤a┼┐a┼┐a┼┐a┼┐a". That is what I mean by non-serious text. False positives will always exist, and you should appreciate that I'm testing on millions of examples to find out what they are. Your suggested machine learning approach would never get to 99.999984% precision.
Using non-ASCII is totally fine, and this library would have no purpose in an all-ASCII world.
So in other words, this library is only for “serious” text. You know, I’ve been accused of hating fun, but I you think you beat me, having written a library which is incompatible with it.
I couldn't accuse you of hating fun, but I could accuse you of hating thorough documentation. Would you prefer I hadn't told you about the one case in six million where the library fails? And that failure is really an amazing coincidence when you look into it; change any one of those line-drawing characters and it'll be fine.
Keep in mind that this is a library that finds emoji that has been damaged by "serious" software written by, let's call them "Basic Multilingual Plane neanderthals", and puts it back. There's your fun.
You brought up the fact that the library was only for (your word) “serious” text. You chose the specific example of “non-serious” text. You can hardly fault me, then, for accusing the library of being incompatible with non-seriousness.
You're being unnecessarily argumentative. The web is filled with badly-encoded/re-encoded/mixed-encoded text. One of the worst offenders is Microsoft Outlook, which sends emails using local 8-bit windows codepage by default, instead of utf-8. Pass that through several mail gateways, display those messages on some php forum, and you get yourself a bad mess that cannot be decoded successfully by any python codec. chardet is useless in that case - there is no 'valid' encoding per se.
This library is taking the only possible approach, which is to segment the text and try to convert each segment into its most probable unicode representation. It seems to cover a larger number of encoding mixups compared to other libraries, that's great!
Comment to rspeer: in fix_text_segment(), I would limit the number of recursive passes on the text to 5-10. Right now it's using 'while True', which might take a very long time to converge on corrupted/binary data.
So, actually, it seemed to make sense at first to limit it to, say, 2 or 3 passes. But then I read about Spotify's username exploit [1]. That made it pretty clear to me that any Unicode-transforming function should be idempotent whenever possible, so that you never end up with inconsistent answers about whether strings are equivalent.
I have also seen text that was encoded six times in UTF-8 (and decoded five times in Windows-1252). Although ftfy had to leave it as is; it didn't successfully decode because it was truncated.
> This library is taking the only possible approach
It is taking the only possible approach if we assume that it must use one and only one algorithm for all uses. Otherwise, it seems to me that a lot of careful tuning and configuration would be needed in order for this library to make the best guesses it possibly can make for a specific user’s situation and data.
> might take a very long time to converge
A limit there might be appropriate – otherwise there might exist a “billion laughs” style attack.
There is no “fix_entities = True” there. Indeed, if the library would require such a parameter, what would be the point? If you already know the encoding, the library has no reason to exist. Therefore, the whole point of the library is to guess the encoding.
> you should appreciate that I'm testing on millions of examples
Examples taken, I would assume, either from your personal use cases, the use cases of your customers, or some sort of general grab-bag of mis-encoded text you could find. I would assume that this one-size-fits-all ad-hoc rule set would be wrong for many users in their specialized use cases, and would bite them when they least expect it.
You're not even reading the documentation, you're just searching for reasons to call me an "ASCII neanderthal" over a library that an ASCII neanderthal would have no use for.
And I fail to see how the default settings being able to fix 'López' is anything but a resounding success.
Would that not bias your rules towards decoding only the errors which are made by all existing Twitter clients (of which I understand there are relatively few)?
Anyway, now we’re just arguing in circles. I said that the library would guess that “&” was HTML encoded. You said “fix_entities is a parameter. It's not guessing, you told it.” I said that the example given has no such parameter. You then turned around and said that it was a success that it guessed correctly, but my point was that it was, indeed, guessing, and might, therefore, guess wrong.
I don’t want to call you an ASCII neanderthal, really, and I’m sorry I did, it’s just that your library helps the actual ASCII neanderthals from having to bother with evolving. This is my main complaint about this library. It will probably be used by them, reflexively, to decode everything, even when the encoding is known, and therefore introduce (admittedly relatively small amounts of) data corruption (but these things have a tendency to crop up when you least wanted them). Whereas if your library was not used, users would have to think about what decoding to use, use it, and not introduce data corruption.
Also, I have some misgivings about a one-size-fits-all solution of guessing encodings – I guess that it would never really work in quite the painless way most your users imagine. To solve this, I advocated a user-customizable training approach, which would, for each user, be the best possible one for their use case. It would also have the beneficial side effect of forcing the users to actually think about their data and what encodings it was likely to have and in what circumstances, thus making them evolve to Homo Unicodus. ☺ Of course, I could be wrong about this, but my principal worry about this library, as stated above, remains.
I think the least we can give in return to companies that open-source useful stuff like this is look the other way when they self-promote a little bit.
I clicked on the careers link and would seriously consider working for a company with such an enlightened attitude to open source, chill! (In fact in the wrong country, but nevertheless...)