With the release of Unicode 14, allocating Arabic Extended-B, there is now only a single unallocated 16-codepoint block between Kangxi radicals and Ideographical Description Characters in the original Basic Multilingual Plane, seemingly intended for more IDCs. There are however some gaps in existing blocks. This article will list them all, speculate about their possible purpose, and suggest which characters to fill these gaps with. As per Unicode Stability Policy, code point assignments are immutable,
so once the last code point is filled, the BMP will be forever sealed
with all its imperfections, and as a matter of compatibility, will not
go away. By no means is this an official proposal, but font vendors agreeing is sufficient for a de-facto extension. It has been said in the mailing list that the BMP is not a penny box to have every slot filled, but there is still considerable amount of legacy software limited to pure 16 bits (like even Windows 11 charmap.exe), so the BMP gaps are prime real estate. Average corporate Windows user doesn't know about BabelMap (RIP Andrew West, now only Unicode 17.0β), and EnableHexNumpad registry key is off by default. The SMP content can't cram into the BMP PUA anymore.
Unicode 14 added around 130 characters to the BMP, and while this list was being worked on, Unicode 15 added 2 BMP characters:
0CF3 ೳ KANNADA SIGN ANUSVARA ABOVE RIGHT
0ECE ໎ LAO YAMAKKAN
Unicode 15.1 added 5 BMP characters:
2FFC IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM RIGHT
2FFD IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER RIGHT
2FFE IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL REFLECTION
2FFF IDEOGRAPHIC DESCRIPTION CHARACTER ROTATION
31EF IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION
Unicode 16 added another 17 BMP characters:
0897 ARABIC PEPET
1B4E BALINESE INVERTED CARIK SIKI
1B4F BALINESE INVERTED CARIK PAREREN
1B7F BALINESE PANTI BAWAK
1C89 CYRILLIC CAPITAL LETTER TJE
1C8A CYRILLIC SMALL LETTER TJE
2427 SYMBOL FOR DELETE SQUARE CHECKER BOARD FORM
2428 SYMBOL FOR DELETE RECTANGULAR CHECKER BOARD FORM
2429 SYMBOL FOR DELETE MEDIUM SHADE FORM
31E4 CJK STROKE HXG
31E5 CJK STROKE SZP
A7CB LATIN CAPITAL LETTER RAMS HORN
A7CC LATIN CAPITAL LETTER S WITH DIAGONAL STROKE
A7CD LATIN SMALL LETTER S WITH DIAGONAL STROKE
A7DA LATIN CAPITAL LETTER LAMBDA
A7DB LATIN SMALL LETTER LAMBDA
A7DC LATIN CAPITAL LETTER LAMBDA WITH STROKE
Unicode 17 added these 62 BMP characters:
088F ARABIC LETTER NOON WITH RING ABOVE
0C5C TELUGU ARCHAIC SHRII
0CDC KANNADA ARCHAIC SHRII
1ACF COMBINING DOUBLE CARON
1AD0 COMBINING VERTICAL-LINE-ACUTE
1AD1 COMBINING GRAVE-VERTICAL-LINE
1AD2 COMBINING VERTICAL-LINE-GRAVE
1AD3 COMBINING ACUTE-VERTICAL-LINE
1AD4 COMBINING VERTICAL-LINE-MACRON
1AD5 COMBINING MACRON-VERTICAL-LINE
1AD6 COMBINING VERTICAL-LINE-ACUTE-GRAVE
1AD7 COMBINING VERTICAL-LINE-GRAVE-ACUTE
1AD8 COMBINING MACRON-ACUTE-GRAVE
1AD9 COMBINING SHARP SIGN
1ADA COMBINING FLAT SIGN
1ADB COMBINING DOWN TACK ABOVE
1ADC COMBINING DIAERESIS WITH RAISED LEFT DOT
1ADD COMBINING DOT-AND-RING BELOW
1AE0 COMBINING LEFT TACK ABOVE
1AE1 COMBINING RIGHT TACK ABOVE
1AE2 COMBINING MINUS SIGN ABOVE
1AE3 COMBINING INVERTED BRIDGE ABOVE
1AE4 COMBINING SQUARE ABOVE
1AE5 COMBINING SEAGULL ABOVE
1AE6 COMBINING DOUBLE ARCH BELOW
1AE7 COMBINING DOUBLE ARCH ABOVE
1AE8 COMBINING EQUALS SIGN ABOVE
1AE9 COMBINING LEFT ANGLE CENTRED ABOVE
1AEA COMBINING UPWARDS ARROW ABOVE
1AEB COMBINING DOUBLE RIGHTWARDS ARROW ABOVE
20C1 SAUDI RIYAL SIGN
2B96 EQUALS SIGN WITH INFINITY ABOVE
A7CE LATIN CAPITAL LETTER PHARYNGEAL VOICED FRICATIVE
A7D2 LATIN CAPITAL LETTER DOUBLE THORN
A7D4 LATIN CAPITAL LETTER DOUBLE WYNN
A7F1 MODIFIER LETTER CAPITAL S
FBC3 ARABIC LIGATURE JALLA WA-ALAA
FBC4 ARABIC LIGATURE DAAMAT BARAKAATUHUM
FBC5 ARABIC LIGATURE RAHMATU ALLAAHI TAAALAA ALAYH
FBC6 ARABIC LIGATURE RAHMATU ALLAAHI ALAYHIM
FBC7 ARABIC LIGATURE RAHMATU ALLAAHI ALAYHIMAA
FBC8 ARABIC LIGATURE RAHIMAHUM ALLAAHU TAAALAA
FBC9 ARABIC LIGATURE RAHIMAHUMAA ALLAAH
FBCA ARABIC LIGATURE RAHIMAHUMAA ALLAAHU TAAALAA
FBCB ARABIC LIGATURE RADI ALLAAHU TAAALAA ANHUM
FBCC ARABIC LIGATURE HAFIZAHU ALLAAH
FBCD ARABIC LIGATURE HAFIZAHU ALLAAHU TAAALAA
FBCE ARABIC LIGATURE HAFIZAHUM ALLAAHU TAAALAA
FBCF ARABIC LIGATURE HAFIZAHUMAA ALLAAHU TAAALAA
FBD0 ARABIC LIGATURE SALLALLAAHU TAAALAA ALAYHI WA-SALLAM
FBD1 ARABIC LIGATURE AJJAL ALLAAHU FARAJAHU ASH-SHAREEF
FBD2 ARABIC LIGATURE ALAYHI AR-RAHMAH
FD90 ARABIC LIGATURE RAHMATU ALLAAHI ALAYH
FD91 ARABIC LIGATURE RAHMATU ALLAAHI ALAYHAA
FDC8 ARABIC LIGATURE RAHIMAHU ALLAAH TAAALAA
FDC9 ARABIC LIGATURE RADI ALLAAHU TAAALAA ANH
FDCA ARABIC LIGATURE RADI ALLAAHU TAAALAA ANHAA
FDCB ARABIC LIGATURE RADI ALLAAHU TAAALAA ANHUMAA
FDCC ARABIC LIGATURE SALLALLAHU ALAYHI WA-ALAA AALIHEE WA-SALLAM
FDCD ARABIC LIGATURE AJJAL ALLAAHU TAAALAA FARAJAHU ASH-SHAREEF
Unicode 18 will add these 31 BMP characters:
0558 MODIFIER LETTER ARMENIAN SMALL EH
058B MODIFIER LETTER ARMENIAN SMALL INI
058C MODIFIER LETTER ARMENIAN SMALL YI
05C8 HEBREW POINT SHEVA NA MUDGASH
05C9 HEBREW POINT DAGESH HAZAQ MUDGASH
0984 BENGALI SIGN COMBINING ANUSVARA ABOVE
09FF BENGALI LETTER SANSKRIT BA
0B53 ORIYA SIGN DOT ABOVE
0B54 ORIYA SIGN DOUBLE DOT ABOVE
1ADE COMBINING GRAVE-DOT
1ADF COMBINING DOT-ACUTE
1AEC COMBINING CARON-ACUTE
1AED COMBINING VERTICAL-LINE-DOUBLE-ACUTE
1AEE COMBINING DOUBLE GRAVE ACCENT BELOW
1AEF COMBINING DOUBLE ACUTE ACCENT BELOW
1AF0 COMBINING DOUBLE COMMA ABOVE
208F MODIFIER LETTER HIGH AND LOW VERTICAL LINE
209D LATIN SUBSCRIPT SMALL W
209E LATIN SUBSCRIPT SMALL Y
20C2 RUFIYAA SIGN
20C3 UAE DIRHAM SIGN
20C4 OMANI RIAL SIGN
2E60 WIGGLY EXCLAMATION MARK
2E61 INVERTED WIGGLY EXCLAMATION MARK
2E62 LEFT PARENTHESIS WITH MIDDLE RING
2E63 RIGHT PARENTHESIS WITH MIDDLE RING
A7DD LATIN CAPITAL LETTER CLOSED OMEGA
A7E2 LATIN CAPITAL LETTER R WITH LONG LEG
AB6C LATIN CAPITAL LETTER SCRIPT R
AB6D LATIN CAPITAL LETTER SCRIPT R WITH RING
And maybe Unicode 19 will add these 9 BMP characters:
0C70 TELUGU SIGN SPACING CANDRABINDU
1879 MONGOLIAN LETTER ALTERNATE UE
1AF1 COMBINING GRAVE-ACUTE-MACRON
1AF2 COMBINING INVERTED LAZY S ABOVE
1AF3 COMBINING COMMA ABOVE AND ACUTE
1C8B CYRILLIC SMALL LETTER YERU WITH CONNECTING BAR
20C5 BELARUSSIAN RUBLE SYMBOL
20CF RUBLE SIGN WITH DOUBLE VERTICAL STEM
AB6E LATIN CAPITAL LETTER U WITH LEFT HOOK
Greek (9):
0378 GREEK CAPITAL LETTER ANTISIGMA
0379 GREEK SMALL LETTER ANTISIGMA (also final, also fit 3legged tau)
0380 GREEK CAPITAL LETTER IOTA WITH DIALYTIKA AND TONOS
0381 GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA AND TONOS
0382 GREEK SMALL LETTER LAMDA WITH TONOS
0383 GREEK SMALL LETTER RHO WITH TONOS
038B GREEK CAPITAL LETTER LAMDA WITH TONOS
038D GREEK CAPITAL LETTER RHO WITH TONOS
03A2 GREEK CAPITAL LETTER FINAL SIGMA
There's been a gap left at the beginning of this block so that the script could begin at a multiple of 80h or 128. Then there were and still are gaps in the capital letters with tonos. Progressively there were some obscure characters and punctuation added. There is a blank spot between Ρ and Σ, where in the small letters is a ς. Since ß got it's capital ẞ for all-caps signage despite being originally a ligature of small letters (ſʒ or ſs), thus in capital form has always been written SS, the ς deserves it's capital too.
Armenian (2):
0530 ARMENIAN CAPITAL LETTER TURNED AYB
0557 ARMENIAN CAPITAL LIGATURE ECH YIWN
These 3 gaps in capitals have their lowercase analogues assigned, so naturally they belong to their uppercase variant. Nagorno Karabakh AKA Artsakh is gone, so no new currency. Of these 3 gaps, 0558 has been assigned some dialectological letter instead of a capital variant.
Hebrew (22):
0590 HEBREW SPACE
05CA BABYLONIAN POINT TSERE / HEBREW PUNCTUATION ALTERNATE PASEQ
05CB BABYLONIAN POINT HIRIQ
05CC BABYLONIAN POINT HOLAM
05CD BABYLONIAN POINT QUBUTS
05CE BABYLONIAN POINT HITFA
05CF BABYLONIAN POINT SEGOL
05EB Lost letter or ligature
05EC Lost letter or ligature
05ED Lost letter or ligature
05EE Lost letter or ligature
05F5 PALESTINIAN POINT PATAH / HEBREW PUNCTUATION ELONGATED GERESH / HEBREW POINT VARIKA [1.0.1]
05F6 PALESTINIAN POINT QAMATS
05F7 PALESTINIAN POINT TSERE / HEBREW LETTER CONNECTED QOF
05F8 PALESTINIAN POINT HIRIQ
05F9 PALESTINIAN POINT HOLAM
05FA PALESTINIAN POINT SEGOL
05FB PALESTINIAN POINT QUBUTS
05FC BABYLONIAN POINT DOTTED PATAH
05FD BABYLONIAN POINT DOTTED QAMATS
05FE BABYLONIAN POINT DOTTED QUBUTS
05FF HEBREW LETTER WIDENER
Alternate vocalization marks are proposed, overflowing to SMP. Especially: PALESTINIAN POINT RAFE, PALESTINIAN POINT DAGESH, BABYLONIAN POINT DIGSHA, BABYLONIAN POINT MAPIQ, BABYLONIAN AND PALESTINIAN POINT SIN MARK, BABYLONIAN POINT QIFYA, BABYLONIAN AND PALESTINIAN POINT SHIN MARK. This block is unusually sparsely assigned for being located here.
Syriac (3):
070E Punctuation
074B Diacritic or extra letter
074C Diacritic or extra letter
Thaana (14):
07B2 Extra letter or punctuation
07B3 Extra letter or punctuation
07B4 Vowel sign (Å)
07B5 Vowel sign (ÅÅ)
07B6 Vowel sign (Ä)
07B7 Vowel sign (ÄÄ)
07B8 Vowel sign (Y)
07B9 Vowel sign (YY)
07BA Vowel sign (Ö)
07BB Vowel sign (ÖÖ)
07BC Vowel sign (Ü)
07BD Vowel sign (ÜÜ)
07BE Virama
07BF THAANA SPACE
Clearly a hacked abjad, however the letter shapes used to be 2 sets of numbers. Sort of like old ciphers, such as 1312 for ACAB or 88 for HH or 241913 for BDSM. Might need some more vowel signs. Not every language is happy with just 5 short and 5 long vowels. Also the existing vowel sign character names show that long English vowels are broken.
N'Ko (2):
07FB NKO SHORT LAJANYALAN / Some punctuation or tone mark
07FC Another punctuation or tone mark
Possibly new currency sign for RTL writing (just banana republic things), punctuation, letters, or tone marks for a newly-adopting language. If none, then at least duodecimal digits. Some phonetic extensions were proposed to the SMP.
Samaritan (3):
082E Some vowel sign or modifier
082F Another vowel sign or modifier
083F SAMARITAN SPACE
Were it not used for religious purposes like Glagolitic, it would end up in the SMP along with Phoenician.
Mandaic (3):
085C Some mark
085D Another mark
085F MANDAIC SPACE
Syriac Supplement (5):
086B Some extension letter
086C Some extension letter
086D Some extension letter
086E Some extension letter
086F SYRIAC SPACE
Christian Sogdian is said to be unified with Syriac. In this block, there are some Malayalam extension letters. So if any other script is to be unified with Syriac, there are 3 free spaces in the original block and 5 more in this supplement.
Arabic Extended B (5):
0892 Mid-level Hamzah [nonapproved]
0893 Sindhi heh
0894 Kurdish heh
0895 ARABIC DAMMA BELOW
0896 ARABIC VOWEL SIGN SMALL V BELOW
The last few free codepoints in BMP Arabic blocks are likely for modern use case omissions, the SMP Arabic Extended C seems to be the default place to go. Not sure why Wolio went with Vs instead of damma below for O (called low waw) and sukun below for E.
There has been a proposal complaining about the letters connecting between words when not using spaces, and not connecting to a final presentation form, trying to invent a more explicit form of the final form called closing form. This looks like a job for Zero Width Non-Joiner or a Zero Width Space.
Bengali (30):
Gurmukhi (48):
Gujarati (37):
Oriya (35):
Tamil (56):
Telugu (26):
Kannada (36):
Malayalam (10):
CANDRA E, CANDRA O, VOWEL SIGN CANDRA E, VOWEL SIGN CANDRA O
0D50 MALAYALAM CONSONANT SIGN CHILLU [nonapproved] / KETTI AEDA-PILLA
DIGA AEDA-PILLA, KETTI IS-PILLA, DIGA IS-PILLA, KETTI PAA-PILLA / UTADA, ANUTADA, GRAVE, ACUTE
Sinhala (37):
Indic scripts have gaps where full fat Devanagari has that character, a leftover from ISCII, which was published just as Unicode was being proposed. The inverse isn't always true as the gaps were filled independently. Each script also has a 16-character space for its very own symbols. Tamil has been denied a reencoding request to atomic syllables, so this form is doomed to remain in the PUA at E100h to E32Fh, which collides with some (U)CSUR scripts. Despite having the most gaps, there is a Tamil Supplement block in SMP with 51 out of 64 codepoints full. What were they thinking at the UTC I know not.
Thai (9+32):
0E00 THAI MORPHOLOGICAL BOUNDARY
0E3B THAI VOWEL SIGN MAI KON
0E3C THAI SEMIVOWEL SIGN LO
0E3D THAI SEMIVOWEL SIGN NYO
0E3E Some vowel sign or punctuation or currency sign
0E5C THAI HO NO
0E5D THAI HO MO
0E5E THAI KHMU GO or duodecimal digit
0E5F THAI KHMU NYO or duodecimal digit
0E60..0E6F
0E70 THAI PHONETIC ORDER VOWEL SIGN SARA E [1.0.1]
0E71 THAI PHONETIC ORDER VOWEL SIGN SARA AE [1.0.1]
0E72 THAI PHONETIC ORDER VOWEL SIGN SARA O [1.0.1]
0E73 THAI PHONETIC ORDER VOWEL SIGN SARA MAI MUAN [1.0.1]
0E74 THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI [1.0.1]
0E75..0E7F
Nothing was added since Unicode 1.0.0 and there are no non-approval notices either. 2 blocks of 16 wasted, but the 2nd one used to have phonetic order vowel signs. There seem to be some concerns about line breaking, which requires morphological analysis as Thai doesn't use spaces. Idon'tuderstandhowcananyonekeepwritinglikethis.
Lao (13+32):
0E80 LAO MORPHOLOGICAL BOUNDARY
0E83 LAO LETTER KHO KHUAT
0E85 LAO LETTER KHO KHON
0E8B LAO LETTER SO SO
0EA4 LAO LETTER RU
0EA6 LAO LETTER LU
0EBE Some vowel sign or punctuation or currency sign
0EBF Currency sign
0EC5 LAO LAKKHANGYAO
0EC7 LAO MAITAIKHU
0ECF LAO FONGMAN
0EDA LAO ANGKHANKHU
0EDB LAO KHOMUT
0EE0..0EFF
0EF0 LAO PHONETIC ORDER VOWEL SIGN E [1.0.1]
0EF1 LAO PHONETIC ORDER VOWEL SIGN EI [1.0.1]
0EF2 LAO PHONETIC ORDER VOWEL SIGN O [1.0.1]
0EF3 LAO PHONETIC ORDER VOWEL SIGN AY [1.0.1]
0EF4 LAO PHONETIC ORDER VOWEL SIGN AI [1.0.1]
0EF5..0EFF
Some letters for Sanskrit and Pali were added, and later Khmu Go and Khmu Nyo. Gaps seem to line up with Thai. 0E3E or 0EBF would have been a good place for Bitcoin sign, considering people were misusing ฿. 2 blocks of 16 wasted again, but the 2nd one used to have phonetic order vowel signs too.
Tibetan (13+32):
0F48 A companion to Nya (Lya?)
0F6D Some letter (Dya?)
0F6E Another letter or punctuation (Tya?)
0F6F Yet another letter or punctuation (subjoined Tya?)
0F70 Some vowel sign (schwa?)
0F98 Subjoined companion to Nya (Lya?)
0FBD Some subjoined letter (Dya?)
0FCD ying-yeng-yang-yong-yung (pentagrammical)
0FDB RIGHT-FACING SVASTI SIGN ROTATED 45 DEGREES
0FDC LEFT-FACING SVASTI SIGN ROTATED 45 DEGREES
0FDD RIGHT-FACING SVASTI SIGN WITH DOTS ROTATED 45 DEGREES
0FDE LEFT-FACING SVASTI SIGN WITH DOTS ROTATED 45 DEGREES
0FDF TIBETAN SPACE (or 6-yingyang)
0FE0..0FFF
Was reencoded in Unicode 2.0. Has swastikas, or Hakenkreuzer if you are miseducated about their historical origins. Nazis on 4chan would definitely make use of one rotated 45 degrees, maybe even in a combining circle. For Easterners, these continue to be symbols of peace. Nushu script in SMP contains 𛋇 and 𛈄. The Tibetan block also contains the BDSM symbol ࿋. Typically it's mirrored, but one has to count with looking at oneself in the mirror, and also some cameras mirror the image. 2 blocks of 16 wasted yet again, to the total of 96 codepoints, enough for a new script, or 3 Philippinic scripts, 1 per each wastage.
Georgian (8):
10C6 GEORGIAN CAPITAL LETTER FI
10C8 GEORGIAN CAPITAL LETTER ELIFI
10C9 GEORGIAN CAPITAL LETTER TURNED GAN
10CA GEORGIAN CAPITAL LETTER AIN
10CB Some text separator / GEORGIAN LETTER U-BRJGU [nonapproved]
10CC MODIFIER GEORGIAN CAPITAL LETTER NAR
10CE GEORGIAN CAPITAL LETTER HARD SIGN
10CF GEORGIAN CAPITAL LETTER LABIAL SIGN
Some capital letters are missing. There is also a 3rd and 4th case somewhere else. Actually seems more like 2 scripts with 2 cases each. Considering case is not a typical feature in the global scale, Georgian is just mental.
Ethiopic (26):
135B ETHIOPIC SYLLABLE -YA / Some combining mark
135C ETHIOPIC SYLLABLE -YA / Some combining mark
137D ETHIOPIC NUMBER MILLION
137E ETHIOPIC NUMBER HUNDRED MILLION
137F ETHIOPIC NUMBER TEN MILLIARD
Mostly syllabic gaps. A-U-I-AA-EE-E-O-OA/WA(A). Could have used the space in the supplement blocks instead of spreading Ethiopic across planes.
Ethiopic Supplement (6):
139A Tone mark
139B Tone mark
139C Tone mark
139D Tone mark
139E Tone mark or punctuation
139F ETHIOPIC SPACE
Cherokee (4):
13F6 Some letter
13F7 Some puctuation
13FE Some small letter
13FF CHEROKEE SPACE
Apparently V is syllabic. Looks like a hybrid between Lisu and Deseret. Small letters are in Cherokee Supplement, but that block is full.
Ogham (3):
169D OGHAM ARROW MARK
169E OGHAM REVERSED ARROW MARK
169F OGHAM OUTER SPACE
If the lines look like >-----<, then why not add <----->?
Runic (7):
16F9 Some letter
16FA Another letter
16FB Yet another letter
16FC Some punctuation
16FD Another punctuation
16FE SCHUTZSTAFFEL SIGN
16FF RUNIC SPACE
We need to capitalize on 4chan Nazis who will decorate their general threads with Nazi insignia, so the ᛋᛋ sign goes well with more Swastikas in Tibetan. May also throw in symbols of other "forbidden" "unconstitutional" organizations in Germany. How long before a single letter (not just 2) is declared a "hate symbol" and removed? Wait, that happened on Twitch already with D. And the German ban extended to ᛟ as well. And also some people were considering banning Z and V due to Ruᛋᛋia doing Mongol shit in Ukraine, painting A,Z,O,V,X on their (mostly former) vehicles. Azov logo is Ƶ, by the way, which some pointed to me looks like 3/4-swastika. As we say in Czechia, forbidden fruit tastes the best. Banilingus people are either kinky or don't understand brat logic.
Tagalog (9):
1716..171E
Also called Baybayin.
Hanunóo (9):
1737 HANUNOO SIGN VIRAMA
1738...173F
Buhid (12):
1754 BUHID SIGN VIRAMA
1755 BUHID SIGN PAMUDPOD
1756..175F
Tagbanwa (14):
176D TAGBANWA LETTER RA
1771 TAGBANWA LETTER HA
1774 TAGBANWA SIGN VIRAMA
1775 TAGBANWA SIGN PAMUDPOD
1776..177F
Philippinic scripts work same as Indic scripts, but each script now requires only 2 16-codepoint blocks with a lot of room to spare for extension. Only 18 phonemes is abysmally little, English has 40. Maybe good scripts for toki pona, but need 2 more vowels.
Khmer (14):
17DE KHMER SIGN CANDRABINDU
17DF KHMER SIGN LAAK [nonapproved] (originally for U+17DD)
17EA..17EF KHMER HEXADECIMAL DIGIT TEN..FIFTEEN
17FA Some symbol
17FB Some symbol
17FC Some symbol
17FD Some symbol
17FE Some symbol
17FF KHMER SPACE
We need more hexadecimal digit systems in this age of binary computers. Nyström might have been trolling the decimal commitee, but had the intuition that base-16 will be very useful. Khmer is said to have the most letters, so why not add some digits to the set of boring 10?
Mongolian (17):
181A..181F Birga variations or hexadecimal digits
187A..187F Extra letters
18AB..18AE Extra letters
18AF MONGOLIAN SPACE
Free Variation Selector 4 was needed to get rid of a nasty M$ hack with ᠀ and a Zero Width Non-joiner. Some fonts may place the premade glyphs behind the digits.
Unified Canadian Aboriginal Syllabics Extended (10):
18F6 Something that looks like O
18F7 Something that looks like I (wide spacing)
18F8 Something that resembles 6 more closely
18F9 Something that resembles 9 more closely
18FA Something that resembles ð more closely
18FB Something that resembles e more closely
18FC open box up
18FD open box right
18FE open box down
18FF open box left
Here I need the missing shapes for my Anti-Dyslexic Pigpen-Moon Style font. Some (И/N/Z/S) have been added to the UCAS Extended Additional block in the SMP. I can't find certain shape rotations without any dots, they must be hidden somewhere. Also combining dot would be helpful. There should probably be a Pigpen block in the SMP or UCSUR.
Limbu (12):
191F LIMBU SPACE
192C..192F
193C..193F
1941..1943
Tai Le (13):
196E
196F
1975..197E
197F TAI LE SPACE
New Tai Lue (13):
19AC..19AF
19CA..19CE
19CF NEW TAI LUE SPACE
19DB..19DD
Buginese (2):
1A1C BUGINESE VOWEL SIGN similar to AE in appearance
1A1D BUGINESE VIRAMA
Tai Tham (17):
1A5F Some consonant sign
1A7D Some sign
1A7E Some cryptogrammic sign
1A8A..1A8F TAI THAM HORA HEXADECIMAL DIGIT TEN..FIFTEEN
1A9A..1A9F TAI THAM THAM HEXADECIMAL DIGIT TEN..FIFTEEN
1AAE Some punctuation sign
1AAF TAI THAM SPACE
There are 2 sets of digits. They could be combined for base 20 as is, or for base 32 after hexadecimal extension.
Combining Diacritical Marks Extended (12):
1AF4 COMBINING COLON ABOVE
1AF5 COMBINING COLON BELOW
1AF6 COMBINING HOOK-ACUTE
1AF7 COMBINING LATIN SMALL LETTER TURNED A
1AF8 COMBINING LATIN SMALL LETTER H WITH HOOK
1AF9 COMBINING LATIN SMALL LETTER DOTLESS I
1AFA COMBINING LATIN SMALL LETTER J
1AFB COMBINING LATIN SMALL LETTER ENG
1AFC COMBINING ...
1AFD COMBINING LATIN SMALL LETTER TURNED R
1AFE COMBINING LATIN SMALL LETTER Y
1AFF COMBINING LATIN SMALL LETTER EZH
Seems to have overflown into SMP in Combining Diacritical Marks Extended-A block next to Latin Extended-F. As of late, each new proposal regarding phonetical orthographies seems to propose different combining characters to these BMP codepoints, so it can get very confusing. The compound tone diacritics seem to be not valid anymore for this block.
Balinese (1):
1B4D Some letter
Batak (8):
1BF4 Consonant sign
1BF5 Consonant sign
1BF6 Consonant sign
1BF7 Consonant sign
1BF8 Punctuation symbol
1BF9 Punctuation symbol
1BFA Punctuation symbol
1BFB Punctuation symbol
Lepcha (6):
1C38 Extra sign
1C39 Extra sign
1C3A Extra punctuation
1C4A LEPCHA DUODECIMAL DIGIT TEN
1C4B LEPCHA DUODECIMAL DIGIT ELEVEN
1C4C Extra letter
Cyrillic Extended C (4):
1C8C CYRILLIC CAPITAL LETTER RJE / RZHE
1C8D CYRILLIC SMALL LETTER RJE / RZHE
1C8E CYRILLIC CAPITAL LETTER DOTTED BYELORUSSIAN-UKRAINIAN I
1C8F CYRILLIC SMALL LETTER DOTLESS BYELORUSSIAN-UKRAINIAN I
Modifier letters and subscripts are being put in the SMP. Some obscure cyrillization orthography may be found that would need some extra letters. There aren't any precomposed acute-accented vowels. I see these in dictionaries all the time. They probably won't be accepted and you can basically put the acute above any letter that happens to be stressed, so that would need too many code points for this small gap. Letter Tje is being added to complete the ďťňľ set. We in the Western Slavonia also have have ř, which would be nice to have a cyrillic letter for so I don't have to use Ҏ or Ԗ.
Georgian Extended (2):
1CBB
1CBC
Mtavruli, the uppercase for lowerercase.
Sundanese Supplement (8):
1CC8 Punctuation or symbol
1CC9 Punctuation or symbol
1CCA Punctuation or symbol
1CCB Punctuation or symbol
1CCC Punctuation or symbol
1CCD Punctuation or symbol
1CCE Punctuation or symbol
1CCF SUNDANESE SPACE
Vedic Extensions (5):
1CFB
1CFC
1CFD
1CFE
1CFF
Greek Extended (23):
1F16 GREEK SMALL LETTER EPSILON WITH PSILI AND PERISPOMENI
1F17 GREEK SMALL LETTER EPSILON WITH DASIA AND PERISPOMENI
1F1E GREEK CAPITAL LETTER EPSILON WITH PSILI AND PERISPOMENI
1F1F GREEK CAPITAL LETTER EPSILON WITH DASIA AND PERISPOMENI
1F46 GREEK SMALL LETTER OMICRON WITH PSILI AND PERISPOMENI
1F47 GREEK SMALL LETTER OMICRON WITH DASIA AND PERISPOMENI
1F4E GREEK CAPITAL LETTER OMICRON WITH PSILI AND PERISPOMENI
1F4F GREEK CAPITAL LETTER OMICRON WITH DASIA AND PERISPOMENI
1F58 GREEK CAPITAL LETTER UPSILON WITH PSILI
1F5A GREEK CAPITAL LETTER UPSILON WITH PSILI AND VARIA
1F5C GREEK CAPITAL LETTER UPSILON WITH PSILI AND OXIA
1F5E GREEK CAPITAL LETTER UPSILON WITH PSILI AND PERISPOMENI
1F7E GREEK SMALL LETTER RHO WITH VARIA (or Lambda, Mu, Nu?)
1F7F GREEK SMALL LETTER RHO WITH OXIA
1FB5 GREEK SMALL LETTER ALPHA WITH OXIA AND PERISPOMENI AND YPOGEGRAMMENI
1FC5 GREEK SMALL LETTER ETA WITH OXIA AND PERISPOMENI AND YPOGEGRAMMENI
1FD4 GREEK SMALL LETTER LAMBDA WITH PSILI (or Mu or Nu?)
1FD5 GREEK SMALL LETTER LAMBDA WITH DASIA
1FDC GREEK CAPITAL LETTER LAMBDA WITH DASIA
1FF0 GREEK SMALL LETTER OMEGA WITH VRACHY
1FF1 GREEK SMALL LETTER OMEGA WITH MACRON
1FF5 GREEK SMALL LETTER OMEGA WITH OXIA AND PERISPOMENI AND YPOGEGRAMMENI
1FFF Some polytonic mark
Clearly arranged to a table with some slots missing. No idea if it makes sense since it looks like as if Alexander the Great came all the way into Vietnam, but whatever, will make for more fun with text effects.
North Korean KPS 9566 has a complete Greek lowercase subscript and superscript set, in addition to Latin, presumably for plaintext math.
General Punctuation (1):
2065 HORIZONTAL FRACTION LINE (for KPS 9566 1/2 1/3 1/4 2/3 1/4 3/4)
Can't paste this. This block is very "spacy". U+2065 could also be ARTIFICIAL INTELLIGENCE GENERATED TEXT INDICATOR, to assist with filtering AI slop from training data.
Superscripts and subscripts (2):
2072 SUPERSCRIPT SOLIDUS
2073 SUPERSCRIPT GREEK SMALL LETTER PI
Superscripts are called modifier letters and there are many of them except Q. They are very useful for indicating whisper or small print, despite not being meant for that. Subscripts are needed for plaintext junior high school math, despite arguments that in PhD math, a subscript can hold any expression and is therefore a markup and people should use LaTeX $_{z}$, HTML <sub>z</sub>, ECMA-48 \e[74mz\e[75m, or C1 \x8Bz\x8C, or get a proper text editor that can do half line feed up and down like old typewriters. The Amish and 1970s teletype cavemen are onto something here.
Currency symbols (9):
20C6 Social Credit token
20C7 EU emission allowance certificate token
20C8 HUNGARIAN FORINT SYMBOL
20C9 POLISH ZLOTY SYMBOL
20CA DOGECOIN
20CB WOWNERO
20CC ZCASH SYMBOL (also a logo of a Czech Punk band Zputnik)
20CD ETHEREUM SYMBOL
20CE SATOSHI SYMBOL
With governments inventing new fiat currencies and their fancy symbols, this block won't be the last one to be filled completely. Especially with inflation getting out of bounds. Bank account is nothing more than a custodial fiat shitcoin wallet. Also I need a symbol for Monero, something like M̶ or M=, if the font co-operates. Coming up at 1DF4A as along with for milli-Monero.
There was a proposal for an Etheteum symbol, apparently Ξ, ≡, or 𐤎 isn't used anymore, apparently due to Satoshi now using Ξ̩̍. The authors made a mistake of either keeping the poposed symbol as a brand asset on their website, as logos are forbidden in Unicode, or not straightening the bend, so that it looks like ⬧ with a gap or an inverted version of ⟠. Script Ad-Hoc stopped it before formal nonapproval, so possibly either Ethereum foundation will disclaim trademark or the proposed symbol will be simplified further. However, if I can draw it in 8x16 as 20CE:000008081C1C3E3E7F3E1C2A141C0808, it's not trademark-worthy.
Combining Diacritical Marks For Symbols (15):
20F1
20F2
20F3
20F4
20F5
20F6
20F7
20F8 COMBINING SHOGI PIECE TURNED 0 DEGREES
20F9 COMBINING SHOGI PIECE TURNED 90 DEGREES
20FA COMBINING SHOGI PIECE TURNED 180 DEGREES
20FB COMBINING SHOGI PIECE TURNED 270 DEGREES
20FC COMBINING PROMOTED SHOGI PIECE TURNED 0 DEGREES
20FD COMBINING PROMOTED SHOGI PIECE TURNED 90 DEGREES
20FE COMBINING PROMOTED SHOGI PIECE TURNED 180 DEGREES
20FF COMBINING PROMOTED SHOGI PIECE TURNED 270 DEGREES
I demand combining frames for Shogi pieces for my Tai and Taikyoku Shogi projects. It would probably work via ZWJs (arbitrary number of characters) or as a binary ligator since most Shogi pieces have 2 characters on them. For rotation and promotion it may be better to use CSS, since there are variants for more than 2 players and with really whacky grids. Assuming 30 degree increments, 24 code points would be needed, which is something more suited for the SMP. Also TIP is missing some peculiar kanji, or maybe just the Wikipedia pages have not been updated yet.
Number Forms (4):
218C TONAL DIGIT NINE NI
218D TONAL DIGIT ELEVEN HU
218E TONAL DIGIT TWELVE VY
218F TONAL DIGIT FIFTEEN FY
Definitely Tonal digits, 2 of them are unifiable with the duodecimal ones. Musical clefs can go into the SMP to the Musical Symbols block.
Would appreciate if there was some space for base4^2, which needs 16 distinct and 4 unifiable (o ı ɔ c) codepoints, so that Roman numerals no longer need to be misappropriated in the font.
Control Pictures (6+16):
242A some legacy control picture
242B some legacy control picture
242C some legacy control picture
242D some legacy control picture
242E some legacy control picture
242F some legacy control picture
2430 SYMBOL FOR PADDING CHARACTER
2431 SYMBOL FOR HIGH OCTET PRESET
2432 SYMBOL FOR BREAK PERMITTED HERE
2433 SYMBOL FOR NO BREAK HERE
2434 SYMBOL FOR INDEX
2435 SYMBOL FOR NEXT LINE
2436 SYMBOL FOR START OF SELECTED AREA
2437 SYMBOL FOR END OF SELECTED AREA
2438 SYMBOL FOR CHARACTER TABULATION SET (Horizontal Tabulation Set)
2439 SYMBOL FOR CHARACTER TABULATION WITH JUSTIFICATION (Horiz. TwJ)
243A SYMBOL FOR LINE TABULATION SET (Vertical Tabulation Set)
243B SYMBOL FOR PARTIAL LINE FORWARD (Partial Line Down)
243C SYMBOL FOR PARTIAL LINE BACKWARD (Partial Line Up)
243D SYMBOL FOR REVERSE LINE FEED (Reverse Index)
243E SYMBOL FOR SINGLE SHIFT TWO
243F SYMBOL FOR SINGLE SHIFT THREE
Needs control pictures for Teletext and C1 Controls, but there's not enough space (25, need 64), so they might have to go behind Symbols For Legacy Computing into a new Control Pictures Supplement block. There are also bibliographical C1 codes (next 32), and 17 modifications of C0, so a 128 code point block is needed.
Optical Character Recognition (5+16):
244B
244C
244D
244E
244F
2450 SYMBOL FOR DEVICE CONTROL STRING
2451 SYMBOL FOR PRIVATE USE ONE
2452 SYMBOL FOR PRIVATE USE TWO
2453 SYMBOL FOR SET TRANSMIT STATE
2454 SYMBOL FOR CANCEL CHARACTER
2455 SYMBOL FOR MESSAGE WAITING
2456 SYMBOL FOR START OF GUARDED AREA
2457 SYMBOL FOR END OF GUARDED AREA
2458 SYMBOL FOR START OF STRING
2459 SYMBOL FOR CONTROL
245A SYMBOL FOR SINGLE CHARACTER INTRODUCER
245B SYMBOL FOR CONTROL SEQUENCE INTRODUCER
245C SYMBOL FOR STRING TERMINATOR
245D SYMBOL FOR OPERATING SYSTEM COMMAND
245E SYMBOL FOR PRIVACY MESSAGE
245F SYMBOL FOR APPLICATION PROGRAM COMMAND
Could be repurposed with the remaining space in Control Pictures for encoding C1 controls if Teletext ones are deemed inferior. However, heterodox C0 and bibliographical C1 variants still need another 17 and 28 code points, that is a block of 64.
There is a proposal to put all the Unicode dashed box character symbols into SSP for easier reference, which would make these symbols-for obsolete.
Miscellaneous Symbols and Arrows (2):
2B74 EXTERNAL LINK SIGN [nonapproved]
2B75 UP DOWN TRIANGLE HEADED ARROW TO BARS
There's like 1000 different arrows in BMP and SMP, yet a single sign for external link, used not just on a very popular website Wikipedia, was met with a nonaproval notice. That's why there's a bunch of UI symbols in Font Awesome.
In North Korean KPS 9566 there are 2 mutually mirrored diagonally striped triangles resembling an Adidas knock-off, possibly to be found somewhwere amongst Symbols for Legacy Computing, if not, then complete this block. The Triangleception and circled upwards manicule can be combined. The scissors just happen to point in a different direction, see directinal emoji. The hammer-brush-sickle should be an emoji ZWJ sequence. The RedStarOS 3.0 E0 to EB prefixed extensions seem to be more suited for SMP, and EC to FE prefixed seem to be OS specific.
Coptic (5):
2CF4 Some obscure capital letter
2CF5 Some obscure small letter
2CF6 Another obscure capital letter
2CF7 Another obscure small letter
2CF8 Punctuation
Disunified from Greek. One could have thought it was just a font, but in early stages contained way more characters derived from Demotic.
Georgian Supplement (8):
2D26 GEORGIAN SMALL LETTER FI
2D28 GEORGIAN SMALL LETTER ELIFI
2D29 GEORGIAN SMALL LETTER TURNED GAN
2D2A GEORGIAN SMALL LETTER AIN
2D2B Some text separator / GEORGIAN LETTER U-BRJGU [nonapproved]
2D2C MODIFIER GEORGIAN SMALL LETTER NAR
2D2E GEORGIAN SMALL LETTER HARD SIGN
2D2F GEORGIAN SMALL LETTER LABIAL SIGN
This is the 3rd case, and also there's Mtavruli. 2D1Bh ⴛ (small Jil) looks like ch in certain fonts.
Tifinagh (21):
2D68..2D6E
2D71..2D7E
Despite stemming from Phoenician, it doesn't resemble it at all, much like these Indic scripts.
Ethiopic Extended (17):
2D97
2D98..2D9F Some syllable group
2DA7
2DAF
2DB7
2DBF
2DC7
2DCF
2DD7
2DDF
Supplemental Punctuation (30):
2E5E QUESTION COMMA
2E5F EXCLAMATION COMMA
2E64
2E65 Alcanter de Brahm’s Irony Point
2E66 HEART-SHAPED DOUBLE EXCLAMATION MARK (Love point) [Bazin]
2E67 EXCLAMATION MARK WITH CROSSBAR (Certitude/conviction point)
2E68 EXCLAMATION MARK WITH CAP (Authority point)
2E69 EXCLAMATION MARK WITH TIE OVERLAY (Irony point)
2E6A DOUBLE EXCLAMATION MARK CONVERGING INTO SINGLE DOT (Acclamation)
2E6B ZIGZAG EXCLAMATION MARK (Doubt point)
2E6C SNARK
2E6D IRONIETEKEN
2E6E SINGLE QUASIQUOTATION MARK
2E6F DOUBLE QUASIQUOTATION MARK
2E70 LEFT SINGLE QUASIQUOTATION MARK
2E71 RIGHT SINGLE QUASIQUOTATION MARK
2E72 LEFT DOUBLE QUASIQUOTATION MARK
2E73 RIGHT DOUBLE QUASIQUOTATION MARK
2E74 SINCERIOD [CollegeHumor]
2E75 LEFT SARCASTISES
2E76 RIGHT SARCASTISES
2E77 HEMI-DEMI-SEMI-COLON
2E78 ANDORPERSAND
2E79 LEFT MOCKWOTATION MARK
2E7A RIGHT MOCKWOTATION MARK
2E7B SUPERELLIPSIS
2E7C JÈ-MARK
2E7D EL-REY
2E7E REVERSED INVERTED QUESTION MARK
2E7F HALF HYPHEN
No shortage of space here. Could use some CollegeHumor and Bazin punctuation marks. Bazin ones were actually proposed once, but UTC didn't find the evidence sufficient. Fairfax contains even more. SarcMark™ is trademarked and Morgan Freemark wouldn't pass even as an Emoji for depicting a specific person. Now mostly emoji are used instead of punctuation.
CJK Radicals Supplement (13):
2E9A CJK RADICAL C-SIMPLIFIED RAP
2EF4 2nd stage simplified radicals
2EF5
2EF6
2EF7
2EF8
2EF9
2EFA
2EFB
2EFC
2EFD
2EFE
2EFF
Here can go the components needed for proper representation with IDS.
KhangXi Radicals (10):
2FD6
2FD7
2FD8
2FD9
2FDA
2FDB
2FDC
2FDD
2FDE
2FDF
Spill-over area for IDCs, strokes, components, maybe even Japanese era names.
*Ideographic Description Characters Extended (16):
2FE0 IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL CONJOINER
2FE1 IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL CONJOINER
2FE2 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUNDING CONJOINER
2FE3 IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL DOUBLE MULTIPLIER
2FE4 IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL DOUBLE MULTIPLIER
2FE5 IDEOGRAPHIC DESCRIPTION CHARACTER TRIANGLE MULTIPLIER
2FE6 IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL TRIPLING MULTIPLIER
2FE7 IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL TRIPLING MULTIPLIER
2FE8 IDEOGRAPHIC DESCRIPTION CHARACTER SQUARE MULTIPLIER
2FE9 IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL QUADRUPLE MULTIPLIER
2FEA IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL QUADRUPLE MULTIPLIER
2FEB IDEOGRAPHIC DESCRIPTION CHARACTER LEFT SPAN DELIMITER
2FEC IDEOGRAPHIC DESCRIPTION CHARACTER RIGHT SPAN DELIMITER
2FED IDEOGRAPHIC DESCRIPTION CHARACTER STROKE VARIATION
2FEE IDEOGRAPHIC DESCRIPTION CHARACTER MULTIPLY
2FEF IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL FLIP
This is the last unallocated block. After that and a few fill-ins, BMP will be sealed for the rest of encoding history due to backwards compatibility, thus Unicode will have completed its original purpose of a 16-bit charset. After filling the 16 astral planes, attention should be turned back to the original ISO 10646 32-bit UCS-4.
The 13 CDP Big5 IDCs have been proposed since 2002 at this exact place.
Multiplication takes an IDC and a component, and writes that component into each field specified by IDC, recursively. Better than these CDP multipliers.
Stroke subtraction, now just subtraction, has been moved to 31EF at the end of strokes.
Hiragana (3):
3040 KANA SPACE
3097 COMBINING HIRAGANA-KATAKANA OVERLINE
3098 COMBINING HIRAGANA-KATAKANA HALF OF VOICED SOUND MARK
Small letters are now going into Small Kana Extended block in SMP, Gojúon fillers in Kana Extended Additional (HIRAGANA LETTER ARCHIAIC YI is still missing, reportedly same as a hentaigana letter derived from the same manyougana character), and precomposed letters with (han)dakuten are refused. So I came up with a single dot from the dakuten (ejective consonants maybe) and remembered some Taiwanese overlined extensions. Also some white space may come in handy with all these variable-width fonts.
Bopomofo (5):
3100 BOPOMOFO SPACE
3101 BOPOMOFO TONE-1
3102 BOPOMOFO TONE-2
3103 BOPOMOFO TONE-3
3104 BOPOMOFO TONE-4
For some reason a Bopomofo Extended-A block is being added to the SMP. These 5 codepoints are reserved probably for something very special.
Hangul Compatibility Jamo (2):
3130 HANGUL JAMO SPACE
318F
CJK Strokes (9):
31E6
31E7
31E8
31E9
31EA
31EB
31EC
31ED
31EE
IDCs are creeping into here instead of taking the last free block. Hangul srokes were proposed here once, but Hangul is really a post-Brahmic abugida.
Enclosed CJK Letters and Months (1):
321F SQUARE ERA NAME [classified]
After recent Japanese emperor resignation, Unicode 12.1 promptly added a square ㋿ era. Weirdly enough, Heisei matches with the Velvet revolution (1989), and Reiwa with Anno Covidiam (2019). Since there needs to be space for more era names, Enclosed Ideographic Supplement in SMP seemingly not being the place, this 321Fh codepoint is the only free one in this range.
*** Ocean of CJKV Ideographs, briefly interrupted by Yijing hexagrams ***
Hangul precomposed syllables originally occupied the space of CJK Extension A from 3400 to 3D2D in 1.0.1 and then to 4DFF in 1.1.
Yi Syllables (3):
A48D YI ITERATION MARK
A48E YI PLACEHOLDER
A48F YI SPACE
This is just the Modern Yi, there's also Classical Yi which consists of 88591 ideographs according to an enormeous proposal from probably some mainland China govermental organization. There is a clear similarity with Clerical script. It looks like the UTC just gave up since there is no way to efficiently check this collection for unifiable characters, having introduced some mistakes in A6D6h (42710) character CJK ExtB and realizing the TARFU during work on ExtC.
Yi Radicals (9):
A4C7
A4C8
A4C9
A4CA
A4CB
A4CC
A4CD
A4CE
A4CF
Vai (4+16):
A62C
A62D
A62E
A62F VAI SPACE
A630..A63F
1 block of 16 wasted. That's a nice space for some hexadecimal digit set.
Bamum (8):
A6F8 Some punctuation
A6F9 Some punctuation
A6FA Some punctuation
A6FB Some punctuation
A6FC Some punctuation
A6FD Some punctuation
A6FE Some punctuation
A6FF BAMUM SPACE
Latin Extended D (18):
A7DE LATIN CAPITAL LETTER THETA
A7DF LATIN SMALL LETTER THETA
A7E0 REVERSED CAPITAL G
A7E1 REVERSED SMALL G
A7E3 REVERSED CAPITAL J
A7E4 REVERSED SMALL J
A7E5 REVERSED CAPITAL Z
A7E6 REVERSED SMALL Z
A7E7 REVERSED CAPITAL B
A7E8 REVERSED CAPITAL P
A7E9 REVERSED CAPITAL Q
A7EA REVERSED SMALL A
A7EB REVERSED SMALL F
A7EC REVERSED SMALL L
A7ED REVERSED SMALL N
A7EE REVERSED SMALL R
A7EF REVERSED SMALL T
A7F0 LATIN LETTER SMALL CAPITAL X
This section is probably reserved for medievalists. Some missing capitals. The missing pieces for fancy Unicode Latin text can go to the SMP, but since they would be used quite frequently in plain text chats and comments for emphasis, more than those medieval variants, it makes sense to squeeze them in BMP. Lisu or Fraser script provides many turned capital letters and reversed capital letters that are symmetrical along a horizontal axis. There are 4 more codepoints in Latin Extended E and 3 in Superscripts and Subscripts.
To Latin Extended-G in SMP: SUBSCRIPT LETTER SMALL B, SUBSCRIPT LETTER SMALL C, SUBSCRIPT LETTER SMALL D, SUBSCRIPT LETTER SMALL F, SUBSCRIPT LETTER SMALL G, SUBSCRIPT LETTER SMALL Q, TURNED CAPITAL Q, TURNED SMALL J
Maybe could fit the rest of letters for Unifon, but not small caps.
Siloti Nagri (3):
A82D Punctuation
A82E Punctuation
A82F SILOTI NAGRI SPACE
Common Indic Number Forms (6):
A83A
A83B
A83C
A83D
A83E
A83F
Phags-Pa (8):
A878 PHAGS-PA MARK TRIPLE SHAD
A879
A87A
A87B
A87C
A87D
A87E
A87F PHAGS-PA SPACE
This is said to be the inspiration behind Hangul, whilst not being a direct predecessor nor an evolution, just like Aramaic isn't a direct predecessor of Brahmi. Considerable eclecticism and originality went into Brahmi and Hangul. The further from original Phoenician, the more featural the writing systems get in order to disambiguate shapes for different sounds, however the shapes have degenerated and merged, so more disambiguation is required. The most dramatic change seems to come with transitioning from stone or wood carving to writing on paper, a Chinese invention, or leaves. On the other hand, the invention of printing press has set the letter shapes in stone (pun not intended).
Saurashtra (14):
A8C6
A8C7
A8C8
A8C9
A8CA
A8CB
A8CC
A8CD
A8DA..A8DF SAURASHTRA HEXADECIMAL DIGIT TEN..FIFTEEN
Again some hexadecimal digits, get creative if you got through the trouble of having different digit shapes.
Rejang (11):
A954
A955
A956
A957
A958
A959
A95A
A95B
A95C
A95D
A95E
Hangul Jamo Extended A (3):
A97D HANGUL CHOSEONG NORTH KOREAN LIEU / YEORINRIEUL
A9FE HANGUL CHOSEONG NORTH KOREAN NNIEUL / NIONRIEUL
A97F HANGUL CHOSEONG NORTH KOREAN WIEUP
Apparently, North Korea tried some extensions, while South Korea tried the colonial alphabet, before returning to common 1946 version.
Javanese (5):
A9CE
A9DA
A9DB
A9DC
A9DD
Myanmar Extended B (1):
A9FF MYANMAR SPACE
There's and Extension C in SMP adding various forms of decimal digits. Like who the hell actually uses them in >currentYear? The latinized forms are basically universal and digitizing old records into them instead would simplify processing considerably.
Cham (13):
AA37
AA38
AA39
AA3A
AA3B
AA3C
AA3D
AA3E
AA3F
AA4E
AA4F CHAM SPACE
AA5A CHAM DUODECIMAL DIGIT TEN
AA5B CHAM DUODECIMAL DIGIT ELEVEN
Tai Viet (24):
AAC3..AACF
AAD0..AADA
Meetei Mayek Extensions (9):
AAF7
AAF8
AAF9
AAFA
AAFB
AAFC
AAFD
AAFE
AAFF MEETEI MAYEK SPACE
Ethiopic Extended A (16):
AB00
AB07
AB08
AB0F
AB10
AB17
AB18..AB1F Extra syllable group
AB27
AB2F
Latin Extended E (1):
AB6F LATIN SUBSCRIPT LETTER SMALL U WITH LEFT HOOK
Seems to be an IPA block, but new IPA characters seem to go to ExtG in SMP now. Latin Theta is a subject of heated discussion, however it doesn't look like a certain style of Greek Theta encoded separately as ϑ. Also there is both Z and Ƶ, g and ɡ, and many more, yet Cyrillic cursive forms are ignored.
Meetei Mayek (8):
ABEE Some sign
ABEF Another sign
ABFA..ABFF MEETEI MAYEK HEXADECIMAL DIGIT TEN..FIFTEEN
Space for hexadecimal found again.
*** Ocean of Hangul Syllables ***
Some 12 leftover spaces at D7A4..D7AF, maybe for North Korean stylized dictator names from KPS 9566:
Kim Il Song (Kim Ir Sen)
Kim Chong Il
Kim Chong Un
Kim ...
Hangul Jamo Extended B (8):
D7C7 HANGUL JUNGSEONG NORTH KOREAN YI / YEORINI
D7C8 HANGUL JONGSEONG NORTH KOREAN LIEU / YEORINRIEUL
D7C9 HANGUL JONGSEONG NORTH KOREAN NNIEUL / NIONRIEUL
D7CA HANGUL JONGSEONG NORTH KOREAN WIEUP
D7FC HANGUL JONGSEONG NORTH KOREAN RIEUL-YEORINGHIEUH
D7FD HANGUL JONGSEONG NORTH KOREAN YESIEUNG-KIEUK
D7FE HANGUL CHOSEONG NORTH KOREAN RIEUL-YEORINGHIEUH
D7FF HANGUL CHOSEONG NORTH KOREAN YESIEUNG-KIEUK
The North Korean extensions fit almost too perfectly. Han-gul might be quite a misnomer as Koreans are not Han, with North Koreans wanting to use Chosongul or "Korean Character" instead, but anyway. There are many characters that could have been named better in Unicode.
*** Surrogates and Private Use ***
Should be (inter)nationalized according to UCSUR. LINCUA and SIL leftovers are either precomposed characters, font variants, or can be represented with a higher-level protocol like HTML, LaTeX, or ECMA-48.
Private Use used to be from E800 to FDFF in 1.0.0, then it was moved a bit earlier to E000 to F7FF in 1.0.1 to make way for CJK compatibility ideographs, and seemingly gone from 1.1, before reappearing extended to F8FF in 2.0, being immediately preceded by surrogates from D800 to DFFF.
CJK Compatibility Ideographs (8+32):
FA6E
FA6F
FADA..FADF
FAE0..FAFF
12 ideographs in this block are actually regular ideographs. May be used for urgently needed ones or components for IDS. Requested disunifications could also go there. Formerly used by Nerd Fonts in its entirety as a PUA spillover. Since version 3, Nerd Fonts use astral PUA, which isn't supported by Windows charmap.
Alphabetic Presentation Forms (22):
FB07 LATIN SMALL LIGATURE CT
FB08 LATIN SMALL LIGATURE FJ
FB09 LATIN SMALL LIGATURE FFJ
FB0A LATIN SMALL LIGATURE LONG S EZH
FB0B LATIN SMALL LIGATURE LONG S H
FB0C LATIN SMALL LIGATURE LONG S I
FB0D LATIN SMALL LIGATURE LONG S L
FB0E LATIN SMALL LIGATURE LONG S LONG S
FB0F LATIN SMALL LIGATURE T H
FB10 LATIN CAPITAL LETTER CH
FB11 LATIN SMALL LETTER CH
FB12 LATIN CAPITAL LETTER C WITH SMALL LETTER H
FB18 ARABIC LETTER SEEN ISOLATED FORM WITH TAIL
FB19 ARABIC LETTER SHEEN ISOLATED FORM WITH TAIL
FB1A ARABIC LETTER SAD ISOLATED FORM WITH TAIL
FB1B ARABIC LETTER DAD ISOLATED FORM WITH TAIL
FB1C RIGHT HALF ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM
FB37 RIGHT HALF ARABIC LIGATURE LAM WITH ALEF FINAL FORM
FB3D LEFT HALF ARABIC LIGATURE LAM WITH ALEF
FB3F LEFT HALF ARABIC LIGATURE LAM WITH ALEF WITH HAMZA ABOVE
FB42 LEFT HALF ARABIC LIGATURE LAM WITH ALEF WITH HAMZA BELOW
FB45 LEFT HALF ARABIC LIGATURE LAM WITH ALEF WITH MADDA ABOVE
Hebrew wide letters (use Letter Widener instead): HEBREW LETTER WIDE BET, HEBREW LETTER WIDE GIMEL, HEBREW LETTER WIDE VAV, HEBREW LETTER WIDE HET, HEBREW LETTER WIDE TET, HEBREW LETTER WIDE MEM, HEBREW LETTER WIDE NUN, HEBREW LETTER WIDE SAMEKH, HEBREW LETTER WIDE AYN, HEBREW LETTER WIDE PE, HEBREW LETTER WIDE TSADI, HEBREW LETTER WIDE QOF, HEBREW LETTER WIDE SHIN, HEBREW WIDE LIGATURE ALEF LAMED
Precomposed Ch is needed for Czech crosswords and word searches, it sticks out like a sore thumb. Initial Teaching Alphapet has the lowercase, which is to be encoded in the SMP Latin ExtG. Also Ch was in a variant of KOI8-CS. However, a similar zł proposal from the Polish National Bank itself with relevant historical references still got rejected, as well as the Hungarian Ft one (could also be in CJK compatibility as feet square, there's ㏎ after all). Meanwhile, ₧ from CP437 still lingers around, as well as ₯.
Uighur presentation forms were probably rejected in part due to complaints from Mainland China.
The arabic letters and the other 3 in Arabic Presentation Forms-B block are from a proposal for Win9x Arabic codepages compatibility, which seem to be a bespoke proprietary solution, like WinCyrillic, that was only popular because of market dominance. A slightly different encoding model was chosen for Unicode 1.0.0 back in 1991, even before Win3.0, that replaces the Win9x one in WinNT. Win10 has also finally began defaulting to UTF-8 instead of "ANSI". Better place to fit these compatibility characters would be alongside the wide Hebrew letters to not fragment the writing direction zones.
*** 32 noncharacters, maybe mechanism for more UCS-4 planes ***
Once the 17 planes run out, original ISO 10646 UCS-4 would want its revenge. Could work like this: once FDDx for uppermost 4 bits to make it self-synchronizing, and then 7 times FDEx for lower 28 bits, with overlong encodings forbidden. That makes it 16 octets for a character above 10:FFFFh, hopefully a valid reason for switching to UTF-8, originally capable of 31 bits in 6 octets.
Vertical Forms (6):
FE1A PRESENTATION FORM FOR VERTICAL LESSER THAN
FE1B PRESENTATION FORM FOR VERTICAL GREATER THAN
FE1C PRESENTATION FORM FOR VERTICAL LESSER OR EQUAL THAN
FE1D PRESENTATION FORM FOR VERTICAL GREATER OR EQUAL THAN
FE1E PRESENTATION FORM FOR VERTICAL EQUALS
FE1F PRESENTATION FORM FOR VERTICAL NOT EQUALS
Also let's not forget about arrows, which should be rotated adequately.
Small Form Variants (6):
FE53 SMALL APOSTROPHE
FE67 SMALL SOLIDUS
FE6C SMALL QUOTATION MARK
FE6D SMALL LOW LINE / SMALL CIRCUMFLEX ACCENT
FE6E SMALL VERTICAL LINE / SMALL GRAVE ACCENT
FE6F SMALL TILDE
Not the same as subscripts. The names of some ASCII symbols are weird. Who the hell uses "solidus" instead of slash, or "low line" instead of underscore?
Arabic Presentation Forms B (3):
FE75 ARABIC LIGATURE SHADDA WITH KASRATAN MEDIAL FORM
FEFD ARABIC LIGATURE SHADDA WITH FATHATAN ISOLATED FORM
FEFE ARABIC LIGATURE SHADDA WITH FATHATAN MEDIAL FORM
Halfwidth and Fullwidth Forms (15):
FF00 FULLWIDTH SPACE (not necessarily same as Ideographic Space)
FFBF HALFWIDTH HANGUL LETTER NORTH KOREAN LIEU / YEORINRIEUL
FFC0 HALFWIDTH HANGUL LETTER NORTH KOREAN NNIEUL / NIONRIEUL
FFC1 HALFWIDTH HANGUL LETTER NORTH KOREAN WIEUP
FFC8 HALFWIDTH HANGUL LETTER NORTH KOREAN RIEUL-YEORINGHIEUH
FFC9 HALFWIDTH HANGUL LETTER NORTH KOREAN YESIEUNG-KIEUK
FFD0 HALFWIDTH SPACE
FFD1 HALFWIDTH HANGUL LETTER NORTH KOREAN YI / YEORINI
FFD8 HALFWIDTH HANGUL LETTER ARAE-A
FFD9 HALFWIDTH HANGUL LETTER ARAE-A YI
FFDD SQUARE WITH SPECKLES FILL [nonapproved]
FFDE FULLWIDTH PERIOD AND RIGHT PARENTHESIS (KPS 9566)
FFDF FULLWIDTH PERIOD AND RIGHT DOUBLE ANGULAR BRACKET (KPS 9566)
FFE7 IDEOGRAPHIC BLACK SQUARE [nonapproved] (IDEOGRAPHIC FULL BLOCK)
FFEF IDEOGRAPHIC WHITE SQUARE [nonapproved] (4-bit little endian BOM shenanigans)
Specials (9):
FFF0 * FORCE MIRRORING DOWNWARD BOUSTROPHEDON (Attic Greek)
FFF1 * FORCE MIRRORING UPWARD BOUSTROPHEDON
FFF2 * FORCE ROTATING DOWNWARD BOUSTROPHEDON
FFF3 * FORCE ROTATING UPWARD BOUSTROPHEDON (Rongorongo)
FFF4 * TEXT TURN LEFT (like Befunge)
FFF5 * TEXT TURN RIGHT
FFF6 * TEXT TRAMPOLINE (jump accross a cell)
FFF7 * INHIBIT BOUSTROPHEDON
FFF8 * CHARACTER OVERSTRIKE
I can't paste these here, despite being perfectly regular Unicode characters. The noncharacter FFFE is Wrong Byte Order Mark and FFFF is EOF in 16-bit character type. In UTF-8, EOF is FF, and FE means it's maybe UTF-16, UTF-32, or some legacy 8-bit encoding. Also possible is that FE indicates a non-standard 36-bit codepoint and FF a 42-bit codepoint, to allow for even bigger character sets and font banks. However, a ROM for a 16-bit 16x16 monochrome charset is already 2 MB (34 MB for all 17 Unicode planes), and for a 32-bit one it would be 128 GB.
Total remaining space in BMP (18.0+provisional apr 2026): 9+2+22+3+14+2+3+3+5+5+30+48+37+35+56+26+36+10+37+41+45+45+8+26+6+4+3+7+9+9+12+14+14+17+10+12+13+13+2+17+12+1+8+6+4+2+8+5+23+1+2+9+15+4+22+21+2+5+8+21+17+30+13+10+16+3+5+2+9+1+3+9+20+8+18+3+6+8+14+11+3+5+1+13+24+9+16+1+8+12+8+40+22+6+6+3+15+9 = 1296
That's all the space left in the BMP, the original scope of Unicode. Still quite plenty, but fragmented and bound to specific scripts.
The SMP is almost fully laid out in the Roadmap. The largest gaps are:
300h (768) characters between Egyptian and Mayan hieroglyphs,
400h (1024) between Khitan Large Script and Pau Cin Hau Syllabary,
300h (768) between Linear Elamite and Oromo (Rongorongo gone),
270h (624) between Jianzi and Latin ExtG,
290h (656) between Adlam and Persian Siyaq Numbers.
In SIP there are 3 free ranges of 20h (32), 9A0h (2464) and 5DEh (1502). Could be used for hybrid ideographs, innovations, and non-radical components.
The TIP is 1/5 full in Unicode 17, which with annual ~5000 sinogram repertoire means about 10 years before filling up, pushing the long-planned old Hanzi scripts out of this plane. Also there was a Classical Yi proposal of 88613 characters, which needs 2 dedicated planes (65534+23079), which is not extraordinary given the Han characters span 3 planes already.
The SSP is still mostly empty. There are some requests to add obscure control codes and dashed boxes there.
I can see the planes turning out like this:
0 - Basic Multilingual Plane
1 - Supplementary Multilingual Plane
2 - Supplementary Ideographic Plane - Han + hybrid + components
3 - Tertiary Ideographic Plane - Han + Sawndip + Old Han (Seal, Bronze, Bone)
4 - Archaic Ideographic Plane - Old Han (Warring States)
5 - Quaternary Ideographic Plane - Obscure Han place and personal names
6 - Pentenary (?) Ideographic Plane - Classical Yi
7 - Sexenary (?) Ideographic Plane - Classical Yi
8 - Tertiary Multilingual Plane - unproposed historical scripts as per SEI
9 - Quaternary Multilingual Plane - neographies used for living languages
A - Intelectual Property Encumbered Plane - paid advertising, conscripts
B - Supplementary Pictographic Plane - more Emoji, flags finally
C - Supplementary Pseudographics Plane - 4x4 (+65280)
D - Tertiary Pseudographics Plane - 2x8 (+65280), 1x6 (64)
E - Supplementary Special-Purpose Plane - boxes, super-surrog8, 3x5 (32766)
F - Supplementary Private Use Plane A - USCUR, rejected scripts
10 - Supplemental Private Use Plane B - direct glyph address, font tech stuff
beyond - alien scripts
Still, according to the SEI, there are many scripts not yet even proposed (taken from old website, probably outdated):
Ariyaka, Avoiuli, Aztec Pictograms, Badaga, Bimanese, Borama (Gadabuursi), Bowen (Lao Baiwen), Bronze script*, Byblos*, Chak, Chola*, Cretan Hieroglyphs, Demotic (Egyptian)**, Epi-Olmec, Fula-1 (Fula Dita), Fula-2 (Ba), Gabelsberger Shorthand, Gbékoun, Gregg Shorthand, HamNoSys, Hausa-3 (Tafi), Hieratic (Egyptian)**, Iban, Isibheqe Sohlamvu, Kadamba, Kaddare, Kaida*, Karani, Khom, Kurux Banna, Kushan, Kwekor, Lontara Bilang Bilang*, Marchung*, Ma-sa-ba (Bambara), Micmac Hieroglyphs, Minangkabau, Mixtec, Nasu, Nisu, Numidian*, Nwagu Aneke Igbo, Old Minahasa, Olmec, Otomaung, Pabuchi, Pahawh Hmong First Stage, Pungchung*, Punic***, Ranjana (Landzya)*, Satavahana, Shankha (Shell script), Shavian Quikscript*, Stokoe (Stokoe Notation), Tai Pao, Teotihuacan, Veso Bey, Visible Speech****, Yi - Chuxiong, Yi - Sani, Zapotec, Zhuang Square, Archaic Miao Square Script, Asho Chin, Cacaxtla, Duota, Fakkham, Izapa, Jing (Zinan), Kaminaljuyu, Khamyang, Lik Hto Ngouk, Old Khmer, Old Mon, Pale Palaung, Rakhawunna, Rencong, Savara, Tajin, Takalik Abaj, Thai Nithet, Thai Noi (Lao Buhan), Thai Yo(r), Tula
(no star) (69)
* Has tentative allocations in the roadmap (9)
** Said to be unified with Hieroglyphs or Meroitic (2)
*** Said to be unified with Phoenician (1)
**** Already encoded in UCSUR (1)
People were making scripts up like crazy even then, only it was before copyright, so these conscripts could actually be used widely enough. Once Tolkien's copyright expires in 2044 (18 years left as of 2026) if not extended somehow, Cirth, Tengwar, and Sarati are likely coming, with more on the way, courtesy of the LOTR fandom. Klingon pI'qaD won't get into public domain before 2065, which would make it one of final additions. (Inter)Nationalizing the PUA is more likely. I doubt anyone cares about Mexico's life + 100 years nonsense, as if vampires were real.