Sections

2023-05-31

Subjectively Relevant Unicode Subset 64K

There's a limit of 65535 glyphs per OpenType font, and Unicode has 150249 graphic characters in version 15.1, of which 98682 is Han (you need 2 fonts for that alone), with leaves 51567 of everything else, however Indic scripts may require many ligatures, and (U)CSUR is tacitly endorsed as a method to use IP-encumbered scripts.

Alternatively, one can pretend Unicode 3.0 from 1999 is the last version with 49168 graphic characters all in BMP, or that Plane 2 never existed until 9.0 from 2016, when annoyingly large Tangut was added to Plane 1, but that gets outdated pretty quickly. If you ignore Tangut, Hieroglyphs, Cuneiform, and Bamum Supplement, you get basically Unifont-EX, which plateaus at 15.1 BMP and 11.0 SMP, but with some glyph deduplication could get to have the 16.0 Symbols for Legacy Computing and its Supplement.

In future Unicode, there will be more than 64K of nonHan, so better to choose interesting scripts and blocks now.

Hangul is best served by an advanced OpenType syllable composing system, there is way more than those precomposed 11172 syllables, 125*95*137=1626875, with North Korean extensions 128*99*141=1786752. That amount precomposed would need 25 or 28 fonts and doesn't even fit into 10:FFFFh limited Unicode, but may just fit into most-4-bytes UTF-8, which ends at 1F:FFFFh. Time to define usage of FDD0..FDEF as UTF-16 super-surrogate 8-plets to reach full 32 bits of UCS-4.

PUA assigment is based on Fairfax and Constructium. The E000~F8FF range should be considered mostly (inter)nationalized according to USCUR, as this contains the much needed tlhIngan pI'qaD and Tengwar. The Trekkies and Tolkienists are a stronger user base than medievalists and linguists. Most of MUFI, CYFI and SIL has been incorporated and the leftovers are mostly ligatures, variations, stylistic sets, or precomposed. There is SMuFL PUA agreement, but that is mostly getting into Unicode too, and I'm a tracker, pianoroll, and ASCII tab guy anyway. Also Nerd Fonts have finally fixed them overflowing into Arabic Presentation Forms by moving them to astral PUA, so they no longer mess with Quran text from tanzil.net, however Powerline conflicts with Tengwar (not Cirth though).


Begin   End     Name                                        Size  Stot  

000000  0033FF  Lower BMP                                   3400  3400  

004DC0  004DFF  Yijing Hexagrams                            0040  3440  

00A4D0  00ABFF  Middle BMP                                  0730  3B70  

00D7B0  00D7FF  Hangul Jamo Extended-B                      0050  3BC0  

00E000  00EDFF  Lower UCSUR                                 0E00  49C0  

00EF00  00EFFF  Hex Byte Pictures                           0100  4AC0  

00F000  00F1FF  Kamakawi                                    0200  4CC0  

00F200  00F27F  Box Drawing Ext, Fill Patterns, Shade Quads 0080  4D40  

00F400  00F43F  C1 Control Pictures                         0040  4D80  

00F4C0  00F4EF  Ath                                         0030  4DB0  

00F500  00F54F  Kodo Symbols                                0050  4E00  

00F550  00F55F  Mathematical Symbols Appendix               0010  4E10  

00F560  00F56F  Camp Duodecimal Numerals                    0010  4E20  

00F580  00F58F  Geomantic Figures                           0010  4E30  

00F590  00F5FF  C64-OS and Commander X16 Symbols            0070  4EA0  

00F600  00F7FF  Adobe: LGC Compatibility Forms              0200  50A0  

00F800  00F83F  Apple: Hoefler Ornaments                    0040  50E0  

00F880  00F89F  Adobe: Thai Compatibility Forms             0020  5100  

00F8A0  00F8FF  UCSUR: Aiha and Klingon                     0060  5160  

00FB00  00FFFF  Upper BMP                                   0500  5600  

 

010000  012FFF  Lower SMP, Cuneiform                        3000  8600

013000  015AFF  Egyptian, Anatolian, and Mayan Hieroglyphs  2B00  B170

016000  0160FF  Cirth and Tengwar (no Mandombe)             0100  B270

016140  0161FF  Sarati, other Tolkien scripts, and Moon     00C0  B330

016200  0167FF  Blissymbols                                 0600  B900

016EF0  016EFF  Bopomofo Ext-A, Kanbun Ext-A, IdeoSym&Punc  0060  B960

01A760  01A77F  Rejang Extended                             0020  B980

01AFD0  01AFFF  Kana Extended-C and B                       0030  B9B0

01B000  01B16F  Kana Supplement, Kana Ext-A, Small Kana Ext 0170  BB20

01BA00  01BCFF  Indus, Shorthands (RIP Rongorongo)          0300  BE20

01CC00  01CBFF  Symbols for Legacy Computing Supplement     0300  C120

01D100  01D24F  Musical Symbols, Ancient Greek Music Not.   0150  C270

01D2C0  01D2FF  Kaktovik and Mayan Numerals                 0040  C2B0

01D300  01D37F  Tai Xuan Jing Symbols, Counting Rod Nums    0080  C240

01D380  01D7FF  Mathematical Alphanumerical Symbols         0400  C640

01D800  01DAAF  Sutton SignWriting                          02B0  C7F0

01DF00  01E08F  Latin Ext-G, Glagolitic Sup, Cyrillic Ext-D 0190  C980

01E7E0  01E7FF  Buginese Sup, Lontara B-B, Ethiopic Ext-B   0090  CA10

01E900  01E95F  Adlam                                       0060  CA70

01EC00  01FFFF  Upper SMP                                   1400  DE70

 

0F0000  0F1C3F  Upper USCUR                                 1C40  FBB0

0FF030  0FF0DF  Domino Tiles Extended, Powerline Symbols    00B0  FC60

0FE000  0FE07F  Tengwar Presentation Forms                  0080  FCE0

0FE680  0FE6DF  Ewellic Presentation Forms                  0060  FD40

0FF380  0FF3FF  Tahano Veno and Aliphbeph                   0080  FDC0

0FF400  0FF51F  Voynich                                     0120  FEE0

0FF700  0FF7FF  7 Segment Display Patterns                  0100  FFE0

0FF900  0FFEFF  Sitelen Pona Presentation Forms-A,B         0300  02E0

0FFF00  0FFFFF  Symbols for Legacy Computing Appendix       0100  03E0


I am 3E0h=992 codepoints over 65536 in blocks, but some are intentionally oversized for potential expension (Cuneiform, Hieroglyphs), some are only proposals (Indus, Blissymbols), there are gaps in the allocated space (1300 codepoints in BMP alone), and some characters look the same and can use the same glyph. However there're the Indic scripts, which would have to contend with only viramas instead of ligatures.


No comments:

Post a Comment

Barely anyone comments, so I don't moderate. Free advertising, I guess.