How to convert unicode accented characters to pure ascii without accents?

Question

I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like x85, xa7, x8d, etc. My question is, is there any way i can

Accepted Answer

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?Assume you have loaded your unicode into a variable called my_unicode&#8230; normalizing à into a is this simple&#8230;import unicodedataoutput = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')Explicit example&#8230;>>> myfoo = u'àà'>>> myfoou'xe0xe0'>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')'aa'>>>How it worksunicodedata.normalize('NFD', "insert-unicode-text-here") performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore') to transform the NFD mapped characters into ascii (ignoring errors).

Advertisement

Answer