Sections

2025-04-18

Unicode subset diagram

To be used as a font tech tree and milestones during design. Inspired by unicodesubsets.miraheze.org (Fandom edition at unicodesubsets.fandom.com), which seems to be made by the same person as https://typedesign.replit.app/, which explains the odd subsets not mentioned elsewhere. He's very OCD about terminal semigraphics and is making pan-Unicode bitmap fonts (6x12, 6x13, 6x14, 7x15, 8x16, 16x16, 9x18, 18x18, 11x24) mostly by himself, all because of a beef with Unifoundry's Unifont and Kreative Korp's Fairfax, >inb4 also misc-fixed, Terminus, Spleen, and int10h MxPlus, which are a bit behind on character count.

My personal typographical beef with established practices that Type Design fonts address is the Czechoslovak caron on ď ť ľ Ľ, which is confusable with apostrophe. This not only makes Neogetmanic orthography look ugly, it's also an optical security risk because ' is used to enclose strings in some programming or scripting languages, in addition to being confusable with d́ t́ ĺ Ĺ, which have an acute on top. The correct caron consistent with other letters like ȟ and ǩ can be forced with U+2060 Word Joiner, now recommended instead of U+FEFF Zero Width Non-Breaking Space. See: d⁠̌ t⁠̌ l⁠̌ L⁠̌. Sometimes the combining mark is on top of the next character instead, but that's either a font design skill issue or a nasty hack for emulating zero-advance characters from typewriters. Iosevka handles it just fine. Now only if Type Design maintained an AUR package.

Anything less than WGL4 isn't worthy of clogging the font list, unless it's a bitmap font for þe olde VGA text mode, and anything less than DIN91379 isn't Germany-compliant. The font selection dialog takes a while to open. There should also be all precomposed characters present to not to have to hack around with NFD. There is a mess in the MESes, but MES-4 seems like a nice target. GF_All is less outdated, though. AGL4+5 and DIN 91379 list combining sequences as well, but a good OpenType font should have accent attachment points instead of GSUB entries.

Shift-JIS isn't technically an ASCII superset, as it uses national variation allowances of ISO 646 to replace \ with ¥ and ~ with ¯. Some fonts compromise by raising the tilde. Neither is it a Unicode BMP subset, because it contains some SIP Hanzi. CNS 11643 is where it gets crazy with Han, Unicode is still trying to catch up. Around 20K Hanzi are missing, but the national bodies have agreed to not propose more than 1K each per IRG working set.


                                      ASCII
                                        95                     ArmSCII
  ┌───────┬─────────────────────────────┼────────────────────→ East Asian MBCSs
  ↓       ↓                             ↓                      ISCII+PASCII
VISCII  CP437                       ISOLatin1
 229     255                           191
  │       │      ┌──────────┬───────────┼────────────────────────────────────────────────┐
  │       │      ↓          ↓           ↓                                                │
  │       │    MES-1    DOSLatin1   WinLatin1                                            │
  │       │     335        255         218                                               │
  │       └──────┼──────────┴────────┘     ├──────────────┬──────────┬─────────┐         │
  │              ↓                         ↓              ↓          ↓         ↓         │
  │             WGL4                     VSECS           W1G    GFLatinCore   AGL1       │
  │             657                       361            603        319       229        │
  │              └──────────┐   ┌──────────┴─────┐        │          │         │         │
  │                         ↓   ↓                ↓        │          ↓         ↓         │
  │                        Subset1             SECS       │        GFAll      AGL2       │
  │                          678                708       │        2420       250        │
  └───────┬─────────────────┤   └───────┐        ├────────┤          │         │         │
          ↓                 ↓           ↓        ↓        │          ↓         ↓         │
    LPTT-1+Subset1   *   KRA-1.0      MES-2      *        │          #        AGL3       │
         2005        │     999         1064               │                   331        │
          ├──────────┘      ├───────┐   ├────────┬────────┤          ┌─────────┤         │
          ↓                 ↓       │   │        ↓        ↓          ↓         ↓         │
       LPTT-1.1          KRA-1.1    │   │     MES-3A   Subset2    DIN91379    AGL4       │
         2201              1098     │   │      2467      1193     781+149    617+2       │
          │                 │       │   │        │        ├────┐     │         │         │
          │                 ↓       ↓   ↓        │        │    ↓     │         ↓         │
          │              KRA-1.2    MES-3B       │        │  MEAS-1  │        AGL5       │
          │                1351      2821        │        │   1999   │      1307+439     │
          │                 │         ├──────────┴────────┼────┘     │         │         │
          │                 ↓         ↓                   │          │         ↓         │
          │               KRA-2     MES-4                 │          │       AGLFN       │
          │                2774      2887                 │          │      586+9999     │
          │                 └────┬────┘          ┌────────┤          ├─────────┤         │
          │                      ↓               ↓        ↓          ↓         │         │
          │                      %            MEAS-1  Subset2.5+     %         │         │
          │                                    1999      2050                  │         │
          │  ┌───────────────────────────────────┴────────┼──────────┐         │         │
          ↓  ↓                                            ↓          ↓         │         │
       LPTT-1.3                                        Subset3    Proper1      │         │
         3501                                            2823       2004       │         │
          │                             ┌─────────────────┤          │         │         │
          ↓                             │                 ↓          ↓         │         ↓
       LPTT-1.4                  %      │              Subset3+   Proper2      │      BHFEU-1
         3996                    │      │                3309       2311       │        3674
          │                      └──────┤                 │                    │         │
          │                             │                 ↓                    │         ↓
          │                             │              Subset4                 │      LGCVTK
          │                             │                3929                  │        3701
          │                             │                 │                    │         │
          │                             ↓                 ↓                    │         │
│ UniLGC+ Subset5 │ │ │ ~16384 15327 │ │ │ ┌─────────────┤ │ │ │ │ │ │ │ │ │ │ │ # │ │ │ │ │ │ │ │ │ │ │ ├───────────┐ ├──────┴──────┤ ┌─────────────┤ ┌───────────────┴─────────┘ │ ↓ ↓ │ │ │ │ │ UniNonHan │ │ │ │ │ ~60000 │ │ │ │ └───────────────┬─────────┐ │ │ │ │ │ ↓ ↓ ↓ │ │ │ SRUS-64K │ │ │ 65536 │ │ └───────────────────────────────┤ │ ↓ ↓ UniBMP 57086

 

Some subsets that didn't fit. Dependent extensions are omitted.

* B0016 - 199 character Spanish 8-bit charset with semigraphics

* CP5555 - very obscure 252 character Czech 8-bit charset with semigraphics

* CP7252 "Boot" - 252 character HyperLatin charset

* ZD18 - 213 character DOS-like code page for Greek and Lithuanian

* IBM EBCDIC and DOS codepages, as well as their hacks - horrible 8-bit mess

* ISO Latin alphabet soup - too confusing

* KOI-8 variants - merge mostly into WGL4

* Trusted555 - a 555 character WGL4-lite of sorts

* Subset1.5 - 1053 character extension of Subset1 (not quite like Subset2)

* Superset1 - 1440 character superset of Subset2 (not quite like Subset2.5)

* Subset2.5 - 1978 character extension of Subset2 (not quite like Subset3)


I have my subsets too:

* UniLGC+ is a combination of all Latin, Greek/Coptic, Cyrillic, Armenian, Georgian, Runic, Cherokee, Glagolitic, Lisu, Phonetic, Punctuation, and Symbol characters in Unicode. Optionally other dextrograde Phoenician descendants. It is to be accompanied by LINCUA.

* UniNonHan is all of Unicode except the CJK Unified and Compatibility Ideographs, of which there are 2 fonts worth already encoded, and Han precursors. Currently, the Non-Han characters appear to fit in the 65536 glyph limit. Once no more, drop Tangut, Khitan, Jurchen, Nüshu, and anything else that looks suspiciously close to Han, which is to be considered optional. Following that, drop other large scripts you don't feel like drawing.

* UniBMP is the Basic Multilingual Plane only, and the Windows charmap.exe limit. Filling of the last 1500 free slots before being sealed up for eternal compatibility is tracked in a separate page. Thanks to mainly PUA and surrogates, there are around 8500 codepoints left for glyphs for supplemental blocks of BMP scripts in the astral planes, which are optional.

* SRUS-64K means Subjectively Relevant Unicode Subset of 64K characters, and has a separate page listing the ranges, including not yet encoded and Fairfax/UCSUR Private Use Area ones. It doesn't contain Han, but contains the radicals, strokes, and enclosings.

Eventually, all fonts should reach the 65536 glyph limit, which amounts to roughly 80000 characters with clever glyph reuses (for sans serif fonts). Another way to deal with the glyph limit is to pretend it's Y2K and Unicode 3.1 is the last version, before adding the humongous CJK Extension B. But this approach misses on many interesting symbols added in the last 25 years.

I'm especially rooting for a 12x24 and a 16x32 pan-Unicode bitmap font. Terminus and Spleen have these sizes, and Unifont plans to have a 16/32x32 edition for some of the more annoying astral plane blocks. However, 12x24 is more urgent, because 16x32 can be made using hq2x, 24x48 by hq3x, and 32x64 by hq4x. The Linux framebuffer console now supports up to 64x128 fonts, in case someone got an 16K display, but still only 512 glyphs, which is barely enough for WGL4 with some ugly reuses (145), however in the interest of *NIX nationalism, W1G (91 reuses) or SECS (196 reuses) would be better. EurLatGr omits Cyrillic, which Eastern Europeans are more familiar with than they would maybe like. Это передложение с этим фонтом не можно прочитать. But LatArCyrHeb contains glyphs for sinistrograde scripts, which pose a bidi layout issue. The Unicode Bidi Algorithm is annoying to edit in.