Notes, links, etc about Unicode, including UTF-8, UTF-16, etc.
Byte Order Mark (BOM)
For systems that use UTF-16 (Java, Windows), bytes could be read in different order depending if you were using big-endian or little-endian (Intel x86) systems. If a file had FE FF (most common) then it was little-indian; if it had FF FE then it was big-endian.
Glyphs/Unicode Code Points
Glyph: an abstract concept representing a character or symbol. 'A' is different than 'a', etc. Best not to think of character; it's specifically not the C concept, and it is not the language concept, as some human languages have will require more than 1 Unicode Code Point to express a language (diacritics, colored-skin emoji, etc).
Unicode assigns a number (nothing to do with encoding!) to each glyph, called a Unicode Code Point.
Syntax for 'A', Latin capital letter A:
- Code point = U+0041
- Unicode escape in ASCII = \u0041
E.g. in Python 3:
>> b'\u0041'
b'\\u0041'
>>> b'\u0041'.decode('unicode-escape')
'A'
Mojibake
Japan has 3 or 4 encodings, all mostly incompatible with each other. Hence, the word "mojibake", because garbled characters happens often in Japan.
UTF-8
Details on what the binary representation of UTF-8 looks like.
1 def PrintUnicodeBytes(char: str, encoding: str = 'utf-8') -> None:
2 bytes = []
3 for b in char.encode(encoding):
4 bytes.append(f'0b{b:>08b}')
5 print(f'{char} = {bytes}')
6
7 PrintUnicodeBytes('a')
8 >>> a = ['0b01100001']
9 PrintUnicodeBytes('á')
10 >>> á = ['0b11000011', '0b10100001']
11 PrintUnicodeBytes('🍨')
12 >>> 🍨 = ['0b11110000', '0b10011111', '0b10001101', '0b10101000']
For a glyph that can be represented as a single byte (i.e. ASCII), the leading bit is always 0 (0b0xxxxxxx).
For a glyph that needs multiple bytes, the start byte will have topmost bits set to the number of continuation bytes that will follow. For example, for a glyph that needs to bytes, the topmost bits will be 110, e.g. 0b110xxxxx. The start byte of a multi-byte character allows up to 6 continuation bytes e.g. 0b1111110x.
All following continuation bytes have the topmost bits set to 10, e.g. 0b10xxxxxx.
Makes it easy to seek forward and backward in a string of glyphs, as you just need to check the top-most bits to know when the representation for a glyph started.