Encoding#
Besides numbers and punctuation, data files often contain text with characters, ranging from the well-known a-z, over accented characters ä, é, ù, to mathematical symbols ∞ and modern emojis 🐍. Most of the time the text is understood as human-readable string. But when this string must be saved as binary format in some file, the decision must be made how to map the characters to a binary representation. This operation is known as encoding. The inverse, converting a sequence of raw bytes (that may in principle represent any binary data) into a human-readable text, is called decoding.
Common text encodings#
Many different encodings exist, depending on the characters used by a language or even the computer’s operating system. And by just looking at the binary data in a text file, it’s non-trivial to know which encoding was used. Moreover, most encodings are not compatible with each other. Hence the common problem of decoding errors or mis-mapped characters when opening the files with a wrong encoding.
ASCII#
ASCII (American Standard Code for Information Interchange) dates back to the 1960s and maps each of its 128 characters (including non-printable and control characters) to an integer represented by 7 bits.
Latin1#
Latin1, or rather ISO/IEC 8859-1 is an 8-bits extension of ASCII to include typical accented characters used in latin languages. The Windows operating system used the Windows-1252 counterpart, with some deviations.
Unicode and UTF#
Already in the 1980s people realized, that a much larger set of characters must be supported to accommodate the world’s languages and needs. The efforts lead to the birth of the Unicode Standard, which currently defines ~150k characters and can support over a million in total. Strictly speaking, Unicode is itself not an encoding. But the Unicode Standard defines for instance the utf-8 and utf-16 encodings. As a matter of fact, the 8-bit utf-8 is compatible with ASCII, but utf-16 or utf-32 are not. Unicode has become the de-facto standard and when in doubt you should be using utf-8 for everything.
Encoding in Python#
The default encoding in Python 3 is utf-8. This applies to the source code files themselves, as well as to strings, which are unicode characters by default. The str.encode()
and bytes.decode()
methods convert between strings and bytes.
See also
The Python documentation contains a section Encodings and Unicode with many details.
Different encodings of the Umlaut “ä” yield different bytestrings.
>>> "ä".encode("utf8")
b"\xc3\xa4"
>>> "ä".encode() # shorthand for above
b"\xc3\xa4"
>>> "ä".encode("latin1") # different encoding
b"\xe4"
Conversely, the decoding of the bytestring requires knowledge of the correct encoding.
>>> b"\xc3\xa4".decode()
"ä"
>>> b"\xe4".decode("latin1")
"ä"
Otherwise an incorrect string gets returned, known as mojibake,
>>> "ä".encode("utf8").decode("latin1")
"ä"
>>> "ä".encode("latin1").decode("utf8") # reverting wrong encoding
"ä"
or the decoding fails completely
>>> "ä".encode("latin1").decode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data
In the worst case, you can decide to handle the error by replacing the failing characters with “?” in ASCII, or the official Unicode replacement character “�” in utf-8.
>>> "ä".encode("latin1").decode("utf8", errors="replace")
"�"