Unicode problems have been one of the harder issues to deal with as external libraries, hardware like label printers and such sometimes don’t support it and throw nasty errors or worse: mysterious silent bugs.
I’ve continually found better ways to deal with these strings. Here’s my journey:
Quick, dirty, and destructive list comprehension
One solution I used while in the shell was just to make sure the “ord(char)“ is below 128.
This method was destructive, but it was acceptable to me given the situation.
unicode_string = u'Österreich' dirty_fix = ''.join([x for x in unicode_string if ord(x) < 128])
Built in string method encode
Next up I learned about the “encode“ method on a string. It encodes a string to a given encoding, but the important part is the second argument “errors“ which you can pass as a parameter “ignore” or “replace”.
unicode_string = u'Österreich' unicode_string.encode('ASCII', 'ignore') # out: 'sterreich' unicode_string.encode('ASCII', 'replace') # out: '?sterreich'
Graceful degradation with python standard library unicodedata
The best solution thus far I’ve found is the standard library “unicodedata“ which allows latin unicode characters to degrade gracefully into ASCII.
The library contains a function “normalize“ which is described as follows:
Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.
The short version: if you use NFD or NFKD, the function converts each unicode character into its “Normal form D“ known as canonical decomposition.
A character may have a similar letter expressed in ASCII such as “Ö“ –> “O“
unicode_string = u'Österreich' unicodedata.normalize('NFKD', unicode_string).encode('ASCII', 'ignore') # out: 'Osterreich'
This is great for us as .01% of data has these unicode characters and human readability is all that matters.