Notes, links, etc about Unicode, including UTF-8, UTF-16, etc.

Byte Order Mark (BOM)

For systems that use UTF-16 (Java, Windows), bytes could be read in different order depending if you were using big-endian or little-endian (Intel x86) systems. If a file had FE FF (most common) then it was little-indian; if it had FF FE then it was big-endian.

Glyphs/Unicode Code Points

Glyph: an abstract concept representing a character or symbol. 'A' is different than 'a', etc. Best not to think of character; it's specifically not the C concept, and it is not the language concept, as some human languages have will require more than 1 Unicode Code Point to express a language (diacritics, colored-skin emoji, etc).

Unicode assigns a number (nothing to do with encoding!) to each glyph, called a Unicode Code Point.

Syntax for 'A', Latin capital letter A:

E.g. in Python 3:

>> b'\u0041'
b'\\u0041'
>>> b'\u0041'.decode('unicode-escape')
'A'

Mojibake

Japan has 3 or 4 encodings, all mostly incompatible with each other. Hence, the word "mojibake", because garbled characters happens often in Japan.

UTF-8

Details on what the binary representation of UTF-8 looks like.

   1 def PrintUnicodeBytes(char: str, encoding: str = 'utf-8') -> None:
   2     bytes = []
   3     for b in char.encode(encoding):
   4         bytes.append(f'0b{b:>08b}')
   5     print(f'{char} = {bytes}')
   6 
   7 PrintUnicodeBytes('a')
   8 >>> a = ['0b01100001']
   9 PrintUnicodeBytes('á')
  10 >>> á = ['0b11000011', '0b10100001']
  11 PrintUnicodeBytes('🍨')
  12 >>> 🍨 = ['0b11110000', '0b10011111', '0b10001101', '0b10101000']

For a glyph that can be represented as a single byte (i.e. ASCII), the leading bit is always 0 (0b0xxxxxxx).

For a glyph that needs multiple bytes, the start byte will have topmost bits set to the number of continuation bytes that will follow. For example, for a glyph that needs to bytes, the topmost bits will be 110, e.g. 0b110xxxxx. The start byte of a multi-byte character allows up to 6 continuation bytes e.g. 0b1111110x.

All following continuation bytes have the topmost bits set to 10, e.g. 0b10xxxxxx.

Makes it easy to seek forward and backward in a string of glyphs, as you just need to check the top-most bits to know when the representation for a glyph started.

SamatsWiki: Unicode (last edited 2021-04-09 23:39:00 by SamatJain)