Python Unicode Graceful Degradation to ASCII


Unicode problems have been one of the harder issues to deal with as external libraries, hardware like label printers and such sometimes don’t support it and throw nasty errors or worse: mysterious silent bugs.

I’ve continually found better ways to deal with these strings. Here’s my journey:

Quick, dirty, and destructive list comprehension

One solution I used while in the shell was just to make sure the “ord(char)“ is below 128.

This method was destructive, but it was acceptable to me given the situation.

unicode_string = u'Österreich'
dirty_fix = ''.join([x for x in unicode_string if ord(x) < 128])

Built in string method encode

Next up I learned about the “encode“ method on a string. It encodes a string to a given encoding, but the important part is the second argument “errors“ which you can pass as a parameter “ignore” or “replace”.

unicode_string = u'Österreich'
unicode_string.encode('ASCII', 'ignore')
# out: 'sterreich'

unicode_string.encode('ASCII', 'replace')
# out: '?sterreich'

Graceful degradation with python standard library unicodedata

The best solution thus far I’ve found is the standard library “unicodedata“ which allows latin unicode characters to degrade gracefully into ASCII.

The library contains a function “normalize“ which is described as follows:

Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.


The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.


The short version: if you use NFD or NFKD, the function converts each unicode character into its “Normal form D“ known as canonical decomposition.

A character may have a similar letter expressed in ASCII such as “Ö“ –> “O“

unicode_string = u'Österreich'
unicodedata.normalize('NFKD', unicode_string).encode('ASCII', 'ignore')
# out: 'Osterreich'

This is great for us as .01% of data has these unicode characters and human readability is all that matters.


One thought on “Python Unicode Graceful Degradation to ASCII

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s