Notes, links, etc about Unicode, including UTF-8, UTF-16, etc.

 * [[https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/|The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]]
 * [[http://doc.cat-v.org/bell_labs/utf-8_history|The history of UTF-8 as told by Rob Pike]]
 * [[https://www.youtube.com/watch?v=MijmeoH9LT4|Characters, Symbols, and the Unicode Miracle]] [YouTube]

== Byte Order Mark (BOM) ==

For systems that use UTF-16 (Java, Windows), bytes could be read in different order depending if you were using big-endian or little-endian (Intel x86) systems. If a file had FE FF (most common) then it was little-indian; if it had FF FE then it was big-endian.

== Glyphs/Unicode Code Points ==

Glyph: an abstract concept representing a character or symbol. 'A' is different than 'a', etc. Best not to think of character; it's specifically not the C concept, and it is not the language concept, as some human languages have will require more than 1 Unicode Code Point to express a language (diacritics, colored-skin emoji, etc).

Unicode assigns a number (nothing to do with encoding!) to each glyph, called a Unicode Code Point.

Syntax for 'A', Latin capital letter A:

 * Code point = U+0041
 * Unicode escape in ASCII = \u0041

E.g. in Python 3:
{{{#!highlight python numbers=off
>> b'\u0041'
b'\\u0041'
>>> b'\u0041'.decode('unicode-escape')
'A'
}}}

== Mojibake ==

Japan has 3 or 4 encodings, all mostly incompatible with each other. Hence, the word "mojibake", because garbled characters happens often in Japan.

== UTF-8 ==

Details on what the binary representation of UTF-8 looks like.

{{{#!highlight python
def PrintUnicodeBytes(char: str, encoding: str = 'utf-8') -> None:
    bytes = []
    for b in char.encode(encoding):
        bytes.append(f'0b{b:>08b}')
    print(f'{char} = {bytes}')

PrintUnicodeBytes('a')
>>> a = ['0b01100001']
PrintUnicodeBytes('á')
>>> á = ['0b11000011', '0b10100001']
PrintUnicodeBytes('🍨')
>>> 🍨 = ['0b11110000', '0b10011111', '0b10001101', '0b10101000']
}}}

For a glyph that can be represented as a single byte (i.e. ASCII), the leading bit is always 0 (0b0xxxxxxx).

For a glyph that needs multiple bytes, the start byte will have topmost bits set to the number of continuation bytes that will follow. For example, for a glyph that needs to bytes, the topmost bits will be 110, e.g. 0b110xxxxx. The start byte of a multi-byte character allows up to 6 continuation bytes e.g. 0b1111110x.

All following continuation bytes have the topmost bits set to 10, e.g. 0b10xxxxxx.

Makes it easy to seek forward and backward in a string of glyphs, as you just need to check the top-most bits to know when the representation for a glyph started.