a few years ago i started wondering: is there a better way to represent text from bits? ascii doesn't let us represent codepoints higher than 127. utf-16 is great for high-codepoint languages, but bloats the size of ascii text. utf-8 is great because it's ascii-compatible, but bloats the size of high-codepoint text. so i came up with an idea i call utf-n: get rid of the concept of bytes and replace it with pairs of bits. a pair of bits can have 4 combinations: 01, 10, 11, and 00. so if we use the first three (01, 10, and 11) to represent a number in ternary (base 3), and we take the last one (00) to represent a separator, then we could encode codepoints in a way that enables us to: * represent any codepoint without artificial limits, * use only the bits we need without padding, and * greatly simplify the encoding/decoding mechanism. it's also neat because it can easily represent negative numbers by shifting to balanced ternary, and has some of the elegance of stop codons and other facets of DNA. i built this page to test utf-n. change the text in this box to test how each encoding performs. you'll find that while never the best, utf-n is always better than the worst-case scenario. any feedback appreciated: @jedschmidt on twitter
bit size
byte size
bytes/char
perf
utf-16
0
0
0
--
utf-8
0
0
0
--
utf-n
0
0
0
--