Given a Unicode string and these requirements:
- The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
- The encoded string has a maximum length
For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.
What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?
(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn’t crash when handling the data.)
See Also:
- Related Java question: How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
- Related Javascript question: Using JavaScript to truncate text to a certain size
Advertisement
Answer
def unicode_truncate(s, length, encoding='utf-8'): encoded = s.encode(encoding)[:length] return encoded.decode(encoding, 'ignore')
Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would’ve crashed if the split Unicode code point wasn’t ignored:
>>> unicode_truncate(u'абвгд', 5) u'u0430u0431'