Notes, links, etc about Unicode, including UTF-8, UTF-16, etc. * [[https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/|The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]] * [[http://doc.cat-v.org/bell_labs/utf-8_history|The history of UTF-8 as told by Rob Pike]] * [[https://www.youtube.com/watch?v=MijmeoH9LT4|Characters, Symbols, and the Unicode Miracle]] [YouTube] == Byte Order Mark (BOM) == For systems that use UTF-16 (Java, Windows), bytes could be read in different order depending if you were using big-endian or little-endian (Intel x86) systems. If a file had FE FF (most common) then it was little-indian; if it had FF FE then it was big-endian. == Glyphs/Unicode Code Points == Glyph: an abstract concept representing a character or symbol. 'A' is different than 'a', etc. Best not to think of character; it's specifically not the C concept, and it is not the language concept, as some human languages have will require more than 1 Unicode Code Point to express a language (diacritics, colored-skin emoji, etc). Unicode assigns a number (nothing to do with encoding!) to each glyph, called a Unicode Code Point. Syntax for 'A', Latin capital letter A: * Code point = U+0041 * Unicode escape in ASCII = \u0041 E.g. in Python 3: {{{#!highlight python numbers=off >> b'\u0041' b'\\u0041' >>> b'\u0041'.decode('unicode-escape') 'A' }}} == Mojibake == Japan has 3 or 4 encodings, all mostly incompatible with each other. Hence, the word "mojibake", because garbled characters happens often in Japan. == UTF-8 == Details on what the binary representation of UTF-8 looks like. {{{#!highlight python def PrintUnicodeBytes(char: str, encoding: str = 'utf-8') -> None: bytes = [] for b in char.encode(encoding): bytes.append(f'0b{b:>08b}') print(f'{char} = {bytes}') PrintUnicodeBytes('a') >>> a = ['0b01100001'] PrintUnicodeBytes('á') >>> á = ['0b11000011', '0b10100001'] PrintUnicodeBytes('🍨') >>> 🍨 = ['0b11110000', '0b10011111', '0b10001101', '0b10101000'] }}} For a glyph that can be represented as a single byte (i.e. ASCII), the leading bit is always 0 (0b0xxxxxxx). For a glyph that needs multiple bytes, the start byte will have topmost bits set to the number of continuation bytes that will follow. For example, for a glyph that needs to bytes, the topmost bits will be 110, e.g. 0b110xxxxx. The start byte of a multi-byte character allows up to 6 continuation bytes e.g. 0b1111110x. All following continuation bytes have the topmost bits set to 10, e.g. 0b10xxxxxx. Makes it easy to seek forward and backward in a string of glyphs, as you just need to check the top-most bits to know when the representation for a glyph started.