Tag: unicode

How to completely sanitize a string of illegal characters in python?

I have a feature of my program where the user can upload a csv file, which my program goes through and uses as input. I have one user complaining about a problem where his input is throwing up an error. The error is caused by there being an illegal character that is encoded wrong. The characters is below: Sometimes it

Truncating unicode so it fits a maximum size when encoded for wire transfer

json python truncate unicode

Given a Unicode string and these requirements: The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape) The encoded string has a maximum length For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes. What is the best way to truncate the string so that it re-encodes to

What is the default content-type/charset?

encoding html python unicode

According to this answer: urllib2 read to Unicode I have to get the content-type in order to change to Unicode. However, some websites don’t have a “charset”. For example, the [‘content-type’] for this page is “text/html”. I can’t convert it to Unicode. Is there a default “encoding” (English, of course)…so that if nothing is found, I can just use that?

How to get string objects instead of Unicode from JSON

json python python-2.x serialization unicode

I’m using Python 2 to parse JSON from ASCII encoded text files. When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can’t change the libraries nor update them. Is

What is the best way to remove accents (normalize) in a Python unicode string?

diacritics python python-2.x python-3.x unicode

I have a Unicode string in Python, and I would like to remove all the accents (diacritics). I found on the web an elegant way to do this (in Java): convert the Unicode string to its long normalized form (with a separate character for letters and diacritics) remove all the characters whose Unicode type is “diacritic”. Do I need to

Character reading from file in Python

ascii encoding python unicode

In a text file, there is a string “I don’t like this”. However, when I read it into a string, it becomes “I donxe2x80x98t like this”. I understand that u2018 is the unicode representation of “‘”. I use command to do the reading. Now, is it possible to read the string in such a way that when it is read

Python, Unicode, and the Windows console

python unicode

When I try to print a Unicode string in a Windows console, I get an error . UnicodeEncodeError: ‘charmap’ codec can’t encode character …. I assume this is because the Windows console does not accept Unicode-only characters. What’s the best way around this? Is there any way I can make Python automatically print a ? instead of failing in this