Sections

2021-12-18

Future of my conlangs, modlangs, relexes, spelling reforms etc.


Summary ASCII diagram:

                           ,--------> Budhót'n -------> Gudhotn -------,
                           |                                           |
Palaeogetmanic ------------+-------------------------------------------+---> Neogetmanic
                           |                                           |
                           |          ,---> CRCreole -----> Hashtalk   |
                           v          |                                |
Onomatopoeic creole ---> GeSeL -------+---> Semitic Japanese ----------+
           |               |          |                                |
           |               |          +---> Slavitic -----> OSloJ -----+
           |               v          |                                |
           '-------> OnomaSemitic     +---> Getmarabic ---> GeSeL 2 ---+
                           |          |                                |
Inglish                    |          +---> AcroSem -------------------'
                           v          |
Satemic                  Blobl        '--------------> GeSeL 3 (Vocalized Acronym Creole)

Toki Pakala



Summary local use ISO 639-3 codes:

qbh - Budhót'n

qcc - CRCtalk

qga - G3S3L 2: Arabic Boogaloo

qge - G3S3L 1: Semitic Japanese (Getman's Generic Genuinely Silly Stupid Semitic Language) and its dialects

qgh - Gudhotn

qgp - G3S3L 3: Semperanto

qgn - Palaeogetmanic and Neogetmanic

qht - HashTalk

qig - Inglish

qlb - Blobl

qoa - G3S3L 3: Vocalized Acronym Creole

qoj - OSloJ (Obecni Slovanski Jazik)

qon - Onomatopoeic Creole

qst - Satemic

qtk - toki pona pakala pi toki nipon (old toki pona code)



Palaeo- and Neogetmanic / Staro- a novogetmanština [qgn]

Originally started as a bunch of modifications to Czech. It went for a highly German vibe with some Japanese. Puns were a large part of the vocabulary. No longer actively developed, though there exist some undocumented corona additions, like "nosoplenařit", to wear and make people wear a face cloth.

Instead, there are plans for Neogetmanic, which in addition to connecting all the Palaeogetmanic components is more Ponaszymu-Japanese with some Arabic, and is to incorporate the vocabulary assembled for GeSeL and a spinoff of Palaeogetmanic called Budhót'n. Basically something like English, which is really 3 languages in a trench coat - Old Norse via Anglosaxon, Latin via French, and Galeic. Only in the case of Neogetmanic, these languages are all from different families, ensuring variety. This should help with writing poetry, as there will be more options how to rhyme words, and the result will be mostly intelligible for fellow Western Slavic japanologists or arabists, like Jabberwocky.


Onomatopoeic Creole [qon]

Just a dead experiment. Was dependent on parent language's interjections. There are some onomatopoeic roots in GeSeL and a more sophisticated conlang with onomatopoeic cluster primitives is planned. Don't ever talk to your babies like that, their brain can handle all the natural language complexity to find patterns in. Also, play classical or jazz to them instead of this Baby Shark nonsense.

Successor languages GeSeL, it's dialect Onomasemitic, and Blobl incorporate onomatopoeic primitives.


GeSeL / G3S3L - Getman's Generic Genuinely Silly Stupid Semitic Language [qge]

Gone are the high school days when I had the time for putting in the content for GeSeL, which was mostly copied over from Semitic roots and Japanese dictionaries. Instead, it's unique structure has served as a useful topic for a bunch of semester works and it now has a database architecture and a web UI called GeSeLator. It's 1 deployment away to a better-advertised server from making others add the content for me, like a public questionnaire. I can filter out the junk.

There's a fundamental flaw in GeSeL, and it's the same as for Esperanto or other too regular languages. There are only grammatical rhymes, which makes it unsuitable for writing poetry. Maybe one of the letter order mods could fix it.

The alphabet is really good, so I'll use variations of it in later pure conlangs, and also extend the mapping table co cover most scripts that are in Unicode. Those which can not amass 27 letters (21+1 consonants, 4+1 vowels) will be hacked with accents or will have to use digraphs.

GeSeL has many root packages, giving way to many dialects. Each dialect will eventually associate a different grammar, becoming a descendant language.

S -> Getmarabic, later to become GeSeL 2
J -> Semitic Japanese, continuing to japanify
O -> Onomasemitic, a precursor to Blobl
A/Q/I -> AcroSem, becoming Vocalized Acronym Creole [qoa] with vowels in roots and more English-derived short words
H -> CRCreole, an attempt to keep Hashtalk within only 8 consonants
V -> Slavitic, a transitional language to OSloJ


Hashtalk [qht]

Basically vaporware, waiting for some hash-grinding scripts. It's supposed to correct the poetry flaw of GeSeL by rhyming completely random words. It can serve as a password hash dictionary, may be useful in cybersecurity. The only problem is the resulting words would be too long, even for long broken MD5. CRCtalk [qcc] could maybe be frendlier and would fit in the GeSeL root system, but no one really cares about checksums.

Essentially, Hashtalk is a reimplementation of Ptydepe from Havel's Memorandum. GeSeL is supposed to be Chorukor. The alphabet will be the same as GeSeL. The grammar could be stolen from toki pona, with the words now being very specific and arbitrarily numerous.


Gudhotn [qgh]

A streamlined dialect and a spelling reform of Budhót'n [qbh], which is mostly a relex of Western Slavic inspired by Palaeogetmanic. It will be a subset of Neogetmanic with a Budhót'n grammar and share vocabulary with OSloJ. It won't use the GeSeL alphabet, but will share it with Neogetmanic.

A Á B C Č D Ď Ʒ Ǯ E É F G H Ȟ I Í J K L Ĺ M N Ň O Ó P Q R Ŕ Ř S Š T Ť U Ú V W Y Ý Z Ž

Optionally Ǎ Ě Ǐ Ǒ Ǔ Y̌ Ą Ę Į Ǫ Ų Y̨ Â Ê Î Ô Û Ŷ and any combination. Maybe Ů for legacy ethymology and Å Ä A̋ Ö Ő Ü Ű for loanword support and expressiveness. Acutes over sonorants and funny caron consonants may come up too in interjections.

Also the Latin reabjadification still holds. Fonts and keyboard drivers need to get their shit together and support accent stacking properly as Unicode won't be adding any precomposed letters. For some reason the accents come in writing order, not typing order.


OSloJ ("donkeys") - Obecný Slovanský Jazyk (Generic Slavic Language) [qoj]

A more sophisticated dialect and mod of GeSeL, with a grammar vaguely resembling a Slavic one. It will have access to the same vocabulary, with a new V package with Slavic words. It will differ from Gudhotn and Neogetmanic in that it will use a GeSeL alphabet (with minor modifications) and have considerably simpler and more mathematical grammar. It's targeted as an auxlang for a broader range of Slavs with odd interests than basically private Western-Slav-centric Gudhotn and incomprehensibly random artlang Neogetmanic. Like puns, Slav memes will play a great role in OSloJ, so it can be guaranteed that "čeburek" and "kompot" will mean what you expect them to mean.

C is now only TS, J is now as in IPA or Esperanto, Q W X is CH SH ZH being roughly similar to Cyrillic Ч Ш Ж (Č Š Ž recommended if charset allows), Y is yeru or schwa. Palatalization is assimilation.

Grammar moves away from transfixes, every legacy root is now vocalized *a*a* or *y*y*. Short words are replaced with Slavic ones. A backwards compatible V root pack will be provided too. Origin table gets a new Osloj column for OSloJ transcriptions.

Slavitic, OSloJ, and Gudhotn form pons asinorum between GeSeLania and Budhót'nska, without going into Neogetmanic. They are mostly the product of trying to interconnect databases. Satemic broadens the zone of OSloJ. May have some lore implications.


GeSeL 2: Arabic Boogaloo [qga]

After acquainting myself with actual Arabic grammar, I'll stop stealing from Japanese and remake the grammar to allow writing of profound anasheed, finally fixing the poetry problem for good, and realizing the true vision of Getmarabic.

Alphabet is to remain the same as GeSeL. If I can't find sufficient regularity in Arabic morphology, Root table gets a new Class column.


Satemic - Sanskrit-inspired satem-type Indo-European creole [qst]

In descendant Indo-European languages, there are words that haven't really changed at all since Proto-Indo-European. That's why Latvian and Lithuanian are highly acclaimed by linguists, since they have preserved certain words that most other languages have replaced. There is also a significant Roma population which speaks Romani, which is related to Hindi which builds on Sanskrit which is also satem-type. Roma have nothing to do with Romans nor Romanians (whose language is centum-type), that's why we call Roma Gypsy or Zigeuner, but that's racist somehow. Slavic languages are also satem-type, which explains the odd familiarity of some Sanskrit words. So I want to take these words and modernize them according to the language they best survived in, while throwing out the fusional grammatical cruft (isolating languages are the best for learning, symbolic may be more fun) and filling the gaps with some internationally recognizable words. I suspect the Indoeuropean ablaut, that one bit of nonconcatenative morphology, may be a leftover from Nostratic superfamily or sprachbund, which also encompasses transfixive Afro-Asiatic and hyperconcatenative Altaic languages, including Japanese and Korean. And that's why I've compared Japanese to Moravian with how "ei" and "ou" are pronounced.

The centum-satemic split is said to be overcome, as there are some centum languages in satem land and vice versa, and some Indoeuropean languages are neither, and the consecutive sound changes have led to some crossing into the territory of other type, not to mention centum loanwords in satem languages undergoing their own changes. If you are very political, the art direction of Satemic can be explained as the Russo-Indian relations dating back to Soviet times when USA supported Pakistan, and the national food is curry potatoes with onion (see the recipe for KGBB). Also the Baltics, Afghanistan, and Sikkim are occupied, the Eastern Bloc is active, so Satemia is a continuous "Russkij Mir" territory, and there's Roma infestation. Quite a stretch for a zonal constructed language, but Indoeuropean words are mostly cognate.


Inglish [qig]

So far only a spelling reform for more convenient speech synthesis exists. I'm not sure if I am truly bilingual, so I don't have any plans to alter the grammar beyond meme speak. Furthermore, English is already a perfectly working auxlang, and in the few cases it's not, Russian, German, French, Spanish or Chinese usually work. I may maintain a list of some silly puns. Also I have tried making an emoji regraphemization of Shavian and Deseret.

Th Inglish vawl sistm is browkn es e rizalt of th Greyt Vawl Shift, wich komplikeyts jenereyting korekt sawnds bay simpli konketeneyting th weyvform primitivs ekording tu th lettrs in th tekst. This sistm srrteynli hes e bayes tuwords Chek eksnt. If yuu tray tu peyst this in Guugl Trensleyt to riid it, it will hev trabl with silebik konsonents.


Blobl - A complety original conlang for once [qlb]

A lot of complaints can be made about copying existing vocabulary and grammar, as well as using mainly Latin script. So this conlang will bring onomatopoeic vocabulary primitives giving way to coded classification, old GeSeL grammar hacks, 2nd order modalized predicate logic, base 16 number system, a featural script, nonconcatenative nonarabic morphology, and unusual phonemes, mostly from the unused spaces of Neogetmanic alphabet. The name "Blobl" is provisional. Bl as in blah, twice for plural, o for good.


Toki pona pakala pi toki Nipon [qtk]

Toki Pona, but with Japanese grammar and phonology, as well as some Finnish agglutination, maybe extra unneeded words because why not. Possibly a learning aid to practice agglutinative grammars. Taso. Ponamutemute.



2021-11-02

Sealing up the Basic Multilingual Plane - List and Countdown

With the release of Unicode 14, allocating Arabic Extended-B, there is now only a single unallocated 16-codepoint block between Kangxi radicals and Ideographical Description Characters in the original Basic Multilingual Plane, seemingly intended for more IDCs. There are however some gaps in existing blocks. This article will list them all, speculate about their possible purpose, and suggest which characters to fill these gaps with. As per Unicode Stability Policy, code point assignments are immutable, so once the last code point is filled, the BMP will be forever sealed with all its imperfections, and as a matter of compatibility, will not go away. By no means is this an official proposal, but font vendors agreeing is sufficient for a de-facto extension. It has been said in the mailing list that the BMP is not a penny box to have every slot filled, but there is still considerable amount of legacy software limited to pure 16 bits (like even Windows 11 charmap.exe), so the BMP gaps are prime real estate. Average corporate Windows user doesn't know about BabelMap (RIP Andrew West, now only Unicode 17.0β), and EnableHexNumpad registry key is off by default. The SMP content can't cram into the BMP PUA anymore.


Unicode 14 added around 130 characters to the BMP, and while this list was being worked on, Unicode 15 added 2 BMP characters:

0CF3    ೳ   KANNADA SIGN ANUSVARA ABOVE RIGHT

0ECE    ໎   LAO YAMAKKAN


Unicode 15.1 added 5 BMP characters:

2FFC   ⿼   IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM RIGHT

2FFD   ⿽   IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER RIGHT

2FFE   ⿾   IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL REFLECTION

2FFF   ⿿   IDEOGRAPHIC DESCRIPTION CHARACTER ROTATION

31EF   ㇯   IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION


Unicode 16 added another 17 BMP characters:

0897   ࢗ   ARABIC PEPET

1B4E   ᭎   BALINESE INVERTED CARIK SIKI

1B4F   ᭏   BALINESE INVERTED CARIK PAREREN

1B7F   ᭿   BALINESE PANTI BAWAK

1C89   Ᲊ   CYRILLIC CAPITAL LETTER TJE

1C8A   ᲊ   CYRILLIC SMALL LETTER TJE

2427   ␧   SYMBOL FOR DELETE SQUARE CHECKER BOARD FORM

2428   ␨   SYMBOL FOR DELETE RECTANGULAR CHECKER BOARD FORM

2429   ␩   SYMBOL FOR DELETE MEDIUM SHADE FORM

31E4   ㇤   CJK STROKE HXG

31E5   ㇥   CJK STROKE SZP

A7CB   Ɤ   LATIN CAPITAL LETTER RAMS HORN

A7CC   Ꟍ   LATIN CAPITAL LETTER S WITH DIAGONAL STROKE

A7CD   ꟍ   LATIN SMALL LETTER S WITH DIAGONAL STROKE

A7DA   Ꟛ   LATIN CAPITAL LETTER LAMBDA

A7DB   ꟛ   LATIN SMALL LETTER LAMBDA

A7DC   Ƛ   LATIN CAPITAL LETTER LAMBDA WITH STROKE


Unicode 17 added these 62 BMP characters:

088F   ࢏   ARABIC LETTER NOON WITH RING ABOVE

0C5C   ౜   TELUGU ARCHAIC SHRII

0CDC   ೜   KANNADA ARCHAIC SHRII

1ACF   ᫏   COMBINING DOUBLE CARON

1AD0   ᫐   COMBINING VERTICAL-LINE-ACUTE

1AD1   ᫑   COMBINING GRAVE-VERTICAL-LINE

1AD2   ᫒   COMBINING VERTICAL-LINE-GRAVE

1AD3   ᫓   COMBINING ACUTE-VERTICAL-LINE

1AD4   ᫔   COMBINING VERTICAL-LINE-MACRON

1AD5   ᫕   COMBINING MACRON-VERTICAL-LINE

1AD6   ᫖   COMBINING VERTICAL-LINE-ACUTE-GRAVE

1AD7   ᫗   COMBINING VERTICAL-LINE-GRAVE-ACUTE

1AD8   ᫘   COMBINING MACRON-ACUTE-GRAVE

1AD9   ᫙   COMBINING SHARP SIGN

1ADA   ᫚   COMBINING FLAT SIGN

1ADB   ᫛   COMBINING DOWN TACK ABOVE

1ADC   ᫜   COMBINING DIAERESIS WITH RAISED LEFT DOT

1ADD   ᫝   COMBINING DOT-AND-RING BELOW

1AE0   ᫠   COMBINING LEFT TACK ABOVE

1AE1   ᫡   COMBINING RIGHT TACK ABOVE

1AE2   ᫢   COMBINING MINUS SIGN ABOVE

1AE3   ᫣   COMBINING INVERTED BRIDGE ABOVE

1AE4   ᫤   COMBINING SQUARE ABOVE

1AE5   ᫥   COMBINING SEAGULL ABOVE

1AE6   ᫦   COMBINING DOUBLE ARCH BELOW

1AE7   ᫧   COMBINING DOUBLE ARCH ABOVE

1AE8   ᫨   COMBINING EQUALS SIGN ABOVE

1AE9   ᫩   COMBINING LEFT ANGLE CENTRED ABOVE

1AEA   ᫪   COMBINING UPWARDS ARROW ABOVE

1AEB   ᫫   COMBINING DOUBLE RIGHTWARDS ARROW ABOVE

20C1   ⃁   SAUDI RIYAL SIGN

2B96   ⮖   EQUALS SIGN WITH INFINITY ABOVE

A7CE   ꟎   LATIN CAPITAL LETTER PHARYNGEAL VOICED FRICATIVE

A7CF   ꟏   LATIN SMALL LETTER PHARYNGEAL VOICED FRICATIVE

A7D2   ꟒   LATIN CAPITAL LETTER DOUBLE THORN

A7D4   ꟔   LATIN CAPITAL LETTER DOUBLE WYNN

A7F1   ꟱   MODIFIER LETTER CAPITAL S

FBC3   ﯃   ARABIC LIGATURE JALLA WA-ALAA

FBC4   ﯄   ARABIC LIGATURE DAAMAT BARAKAATUHUM

FBC5   ﯅   ARABIC LIGATURE RAHMATU ALLAAHI TAAALAA ALAYH

FBC6   ﯆   ARABIC LIGATURE RAHMATU ALLAAHI ALAYHIM

FBC7   ﯇   ARABIC LIGATURE RAHMATU ALLAAHI ALAYHIMAA

FBC8   ﯈   ARABIC LIGATURE RAHIMAHUM ALLAAHU TAAALAA

FBC9   ﯉   ARABIC LIGATURE RAHIMAHUMAA ALLAAH

FBCA   ﯊   ARABIC LIGATURE RAHIMAHUMAA ALLAAHU TAAALAA

FBCB   ﯋   ARABIC LIGATURE RADI ALLAAHU TAAALAA ANHUM

FBCC   ﯌   ARABIC LIGATURE HAFIZAHU ALLAAH

FBCD   ﯍   ARABIC LIGATURE HAFIZAHU ALLAAHU TAAALAA

FBCE   ﯎   ARABIC LIGATURE HAFIZAHUM ALLAAHU TAAALAA

FBCF   ﯏   ARABIC LIGATURE HAFIZAHUMAA ALLAAHU TAAALAA

FBD0   ﯐   ARABIC LIGATURE SALLALLAAHU TAAALAA ALAYHI WA-SALLAM

FBD1   ﯑   ARABIC LIGATURE AJJAL ALLAAHU FARAJAHU ASH-SHAREEF

FBD2   ﯒   ARABIC LIGATURE ALAYHI AR-RAHMAH

FD90   ﶐   ARABIC LIGATURE RAHMATU ALLAAHI ALAYH

FD91   ﶑   ARABIC LIGATURE RAHMATU ALLAAHI ALAYHAA

FDC8   ﷈   ARABIC LIGATURE RAHIMAHU ALLAAH TAAALAA

FDC9   ﷉   ARABIC LIGATURE RADI ALLAAHU TAAALAA ANH

FDCA   ﷊   ARABIC LIGATURE RADI ALLAAHU TAAALAA ANHAA

FDCB   ﷋   ARABIC LIGATURE RADI ALLAAHU TAAALAA ANHUMAA

FDCC   ﷌   ARABIC LIGATURE SALLALLAHU ALAYHI WA-ALAA AALIHEE WA-SALLAM

FDCD   ﷍   ARABIC LIGATURE AJJAL ALLAAHU TAAALAA FARAJAHU ASH-SHAREEF

FDCE   ﷎   ARABIC LIGATURE KARRAMA ALLAAHU WAJHAH


Unicode 18 will add these 31 BMP characters:

0558   ՘   MODIFIER LETTER ARMENIAN SMALL EH

058B   ֋   MODIFIER LETTER ARMENIAN SMALL INI

058C   ֌   MODIFIER LETTER ARMENIAN SMALL YI

05C8   ׈   HEBREW POINT SHEVA NA MUDGASH

05C9   ׉   HEBREW POINT DAGESH HAZAQ MUDGASH 

0984   ঄   BENGALI SIGN COMBINING ANUSVARA ABOVE

09FF   ৿   BENGALI LETTER SANSKRIT BA

0B53   ୓   ORIYA SIGN DOT ABOVE

0B54   ୔   ORIYA SIGN DOUBLE DOT ABOVE

1ADE   ᫞   COMBINING GRAVE-DOT

1ADF   ᫟   COMBINING DOT-ACUTE

1AEC   ᫬   COMBINING CARON-ACUTE

1AED   ᫭   COMBINING VERTICAL-LINE-DOUBLE-ACUTE

1AEE   ᫮   COMBINING DOUBLE GRAVE ACCENT BELOW

1AEF   ᫯   COMBINING DOUBLE ACUTE ACCENT BELOW

1AF0   ᫰   COMBINING DOUBLE COMMA ABOVE

208F   ₏   MODIFIER LETTER HIGH AND LOW VERTICAL LINE

209D   ₝   LATIN SUBSCRIPT SMALL W

209E   ₞   LATIN SUBSCRIPT SMALL Y

209F   ₟   LATIN SUBSCRIPT SMALL Z

20C2   ⃂   RUFIYAA SIGN

20C3   ⃃   UAE DIRHAM SIGN

20C4   ⃄   OMANI RIAL SIGN 

2E60   ⹠   WIGGLY EXCLAMATION MARK

2E61   ⹡   INVERTED WIGGLY EXCLAMATION MARK

2E62   ⹢   LEFT PARENTHESIS WITH MIDDLE RING

2E63   ⹣   RIGHT PARENTHESIS WITH MIDDLE RING

A7DD   ꟝   LATIN CAPITAL LETTER CLOSED OMEGA

A7E2   ꟢   LATIN CAPITAL LETTER R WITH LONG LEG

AB6C   ꭬   LATIN CAPITAL LETTER SCRIPT R

AB6D   ꭭   LATIN CAPITAL LETTER SCRIPT R WITH RING

 

And maybe Unicode 19 will add these 9 BMP characters:

0C70   ౰   TELUGU SIGN SPACING CANDRABINDU

1879   ᡹   MONGOLIAN LETTER ALTERNATE UE

1AF1   ᫱   COMBINING GRAVE-ACUTE-MACRON

1AF2   ᫲   COMBINING INVERTED LAZY S ABOVE

1AF3   ᫳   COMBINING COMMA ABOVE AND ACUTE

1C8B   ᲋   CYRILLIC SMALL LETTER YERU WITH CONNECTING BAR

20C5   ⃅   BELARUSSIAN RUBLE SYMBOL

20CF   ⃏   RUBLE SIGN WITH DOUBLE VERTICAL STEM

AB6E   ꭮   LATIN CAPITAL LETTER U WITH LEFT HOOK



Greek (9):

0378   ͸   GREEK CAPITAL LETTER ANTISIGMA

0379   ͹   GREEK SMALL LETTER ANTISIGMA (also final, also fit 3legged tau)

0380   ΀   GREEK CAPITAL LETTER IOTA WITH DIALYTIKA AND TONOS

0381   ΁   GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA AND TONOS

0382   ΂   GREEK SMALL LETTER LAMDA WITH TONOS

0383   ΃   GREEK SMALL LETTER RHO WITH TONOS

038B   ΋   GREEK CAPITAL LETTER LAMDA WITH TONOS

038D   ΍   GREEK CAPITAL LETTER RHO WITH TONOS

03A2   ΢   GREEK CAPITAL LETTER FINAL SIGMA


There's been a gap left at the beginning of this block so that the script could begin at a multiple of 80h or 128. Then there were and still are gaps in the capital letters with tonos. Progressively there were some obscure characters and punctuation added. There is a blank spot between Ρ and Σ, where in the small letters is a ς. Since ß got it's capital ẞ for all-caps signage despite being originally a ligature of small letters (ſʒ or ſs), thus in capital form has always been written SS, the ς deserves it's capital too.


Armenian (2):

0530   ԰   ARMENIAN CAPITAL LETTER TURNED AYB

0557   ՗   ARMENIAN CAPITAL LIGATURE ECH YIWN


These 3 gaps in capitals have their lowercase analogues assigned, so naturally they belong to their uppercase variant. Nagorno Karabakh AKA Artsakh is gone, so no new currency. Of these 3 gaps, 0558 has been assigned some dialectological letter instead of a capital variant.


Hebrew (22):

֐   0590   HEBREW SPACE

05CA   ׊   BABYLONIAN POINT TSERE / HEBREW PUNCTUATION ALTERNATE PASEQ

05CB   ׋   BABYLONIAN POINT HIRIQ

05CC   ׌   BABYLONIAN POINT HOLAM

05CD   ׍   BABYLONIAN POINT QUBUTS

05CE   ׎   BABYLONIAN POINT HITFA

05CF   ׏   BABYLONIAN POINT SEGOL

05EB   ׫   Lost letter or ligature

05EC   ׬   Lost letter or ligature

05ED   ׭   Lost letter or ligature

05EE   ׮   Lost letter or ligature

05F5   ׵   PALESTINIAN POINT PATAH / HEBREW PUNCTUATION ELONGATED GERESH / HEBREW POINT VARIKA [1.0.1]

05F6   ׶   PALESTINIAN POINT QAMATS

05F7   ׷   PALESTINIAN POINT TSERE / HEBREW LETTER CONNECTED QOF

05F8   ׸   PALESTINIAN POINT HIRIQ

05F9   ׹   PALESTINIAN POINT HOLAM

05FA   ׺   PALESTINIAN POINT SEGOL

05FB   ׻   PALESTINIAN POINT QUBUTS

05FC   ׼   BABYLONIAN POINT DOTTED PATAH

05FD   ׽   BABYLONIAN POINT DOTTED QAMATS

05FE   ׾   BABYLONIAN POINT DOTTED QUBUTS

05FF   ׿   HEBREW LETTER WIDENER


Alternate vocalization marks are proposed, overflowing to SMP. Especially: PALESTINIAN POINT RAFE, PALESTINIAN POINT DAGESH, BABYLONIAN POINT DIGSHA, BABYLONIAN POINT MAPIQ, BABYLONIAN AND PALESTINIAN POINT SIN MARK, BABYLONIAN POINT QIFYA, BABYLONIAN AND PALESTINIAN POINT SHIN MARK. This block is unusually sparsely assigned for being located here.


Syriac (3):

070E   ܎   Punctuation

074B   ݋   Diacritic or extra letter

074C   ݌   Diacritic or extra letter


Thaana (14):

07B2   ޲   Extra letter or punctuation

07B3   ޳   Extra letter or punctuation

07B4   ޴   Vowel sign (Å)

07B5   ޵   Vowel sign (ÅÅ)

07B6   ޶   Vowel sign (Ä)

07B7   ޷   Vowel sign (ÄÄ)

07B8   ޸   Vowel sign (Y)

07B9   ޹   Vowel sign (YY)

07BA   ޺   Vowel sign (Ö)

07BB   ޻   Vowel sign (ÖÖ)

07BC   ޼   Vowel sign (Ü)

07BD   ޽   Vowel sign (ÜÜ)

07BE   ޾   Virama

07BF   ޿   THAANA SPACE


Clearly a hacked abjad, however the letter shapes used to be 2 sets of numbers. Sort of like old ciphers, such as 1312 for ACAB or 88 for HH or 241913 for BDSM. Might need some more vowel signs. Not every language is happy with just 5 short and 5 long vowels. Also the existing vowel sign character names show that long English vowels are broken.


N'Ko (2):

07FB   ߻   NKO SHORT LAJANYALAN / Some punctuation or tone mark

07FC   ߼   Another punctuation or tone mark


Possibly new currency sign for RTL writing (just banana republic things), punctuation, letters, or tone marks for a newly-adopting language. If none, then at least duodecimal digits. Some phonetic extensions were proposed to the SMP.


Samaritan (3):

082E   ࠮   Some vowel sign or modifier

082F   ࠯   Another vowel sign or modifier

083F   ࠿   SAMARITAN SPACE


Were it not used for religious purposes like Glagolitic, it would end up in the SMP along with Phoenician.


Mandaic (3):

085C   ࡜   Some mark

085D   ࡝   Another mark

085F   ࡟   MANDAIC SPACE


Syriac Supplement (5):

086B   ࡫   Some extension letter

086C   ࡬   Some extension letter

086D   ࡭   Some extension letter

086E   ࡮   Some extension letter

086F   ࡯   SYRIAC SPACE


Christian Sogdian is said to be unified with Syriac. In this block, there are some Malayalam extension letters. So if any other script is to be unified with Syriac, there are 3 free spaces in the original block and 5 more in this supplement.


Arabic Extended B (5):

0892   ࢒   Mid-level Hamzah [nonapproved]

0893   ࢓   Sindhi heh 

0894   ࢔   Kurdish heh

0895   ࢕   ARABIC DAMMA BELOW

0896   ࢖   ARABIC VOWEL SIGN SMALL V BELOW


The last few free codepoints in BMP Arabic blocks are likely for modern use case omissions, the SMP Arabic Extended C seems to be the default place to go. Not sure why Wolio went with Vs instead of damma below for O (called low waw) and sukun below for E.

There has been a proposal complaining about the letters connecting between words when not using spaces, and not connecting to a final presentation form, trying to invent a more explicit form of the final form called closing form. This looks like a job for Zero Width Non-Joiner or a Zero Width Space. 


Bengali (30):

঍঎঑঒঩঱঳঴঵঺঻৅৆৉৊৏৐৑৒৓৔৕৖৘৙৚৛৞৤৥

 

Gurmukhi (48):

਀਄਋਌਍਎਑਒਩਱਴਷਺਻਽੃੄੅੆੉੊੎੏੐੒੓੔੕੖੗੘੝੟੠੡੢੣੤੥੷੸੹੺੻੼੽੾੿

 

Gujarati (37):

઀઄઎઒઩઱઴઺઻૆૊૎૏૑૒૓૔૕૖૗૘૙૚૛૜૝૞૟૤૥૲૳૴૵૶૷૸

 

Oriya (35):

଀଄଍଎଑଒଩଱଴଺଻୅୆୉୊୎୏୐୑୒୘୙୚୛୞୤୥୸୹୺୻୼୽୾୿

 

Tamil (56):

஀஁஄஋஌஍஑஖஗஘஛஝஠஡஢஥஦஧஫஬஭஺஻஼஽௃௄௅௉௎௏௑௒௓௔௕௖௘௙௚௛௜௝௞௟௠௡௢௣௤௥௻௼௽௾௿


Telugu (26):

఍఑఩఺఻౅౉౎౏౐౑౒౓౔౗౛౞౟౤౥౱౲౳౴౵౶

 

Kannada (36):

಍಑಩಴಺಻೅೉೎೏೐೑೒೓೔೗೘೙೚೛೟೤೥೰೴೵೶೷೸೹೺೻೼೽೾೿

 

Malayalam (10):

഍഑൅൉   CANDRA E, CANDRA O, VOWEL SIGN CANDRA E, VOWEL SIGN CANDRA O

0D50   ൐   MALAYALAM CONSONANT SIGN CHILLU [nonapproved] / KETTI AEDA-PILLA

൑൒൓൤൥   DIGA AEDA-PILLA, KETTI IS-PILLA, DIGA IS-PILLA, KETTI PAA-PILLA / UTADA, ANUTADA, GRAVE, ACUTE

 

Sinhala (37):

඀඄඗඘඙඲඼඾඿෇෈෉෋෌෍෎෕෗෠෡෢෣෤෥෰෱෵෶෷෸෹෺෻෼෽෾෿


Indic scripts have gaps where full fat Devanagari has that character, a leftover from ISCII, which was published just as Unicode was being proposed. The inverse isn't always true as the gaps were filled independently. Each script also has a 16-character space for its very own symbols. Tamil has been denied a reencoding request to atomic syllables, so this form is doomed to remain in the PUA at E100h to E32Fh, which collides with some (U)CSUR scripts. Despite having the most gaps, there is a Tamil Supplement block in SMP with 51 out of 64 codepoints full. What were they thinking at the UTC I know not.


Thai (9+32):

0E00   ฀   THAI MORPHOLOGICAL BOUNDARY

0E3B   ฻   THAI VOWEL SIGN MAI KON

0E3C   ฼   THAI SEMIVOWEL SIGN LO

0E3D   ฽   THAI SEMIVOWEL SIGN NYO

0E3E   ฾   Some vowel sign or punctuation or currency sign

0E5C   ๜   THAI HO NO

0E5D   ๝   THAI HO MO

0E5E   ๞   THAI KHMU GO or duodecimal digit

0E5F   ๟   THAI KHMU NYO or duodecimal digit

0E60..0E6F   ๠ ๡ ๢ ๣ ๤ ๥ ๦ ๧ ๨ ๩ ๪ ๫ ๬ ๭ ๮ ๯

0E70   ๰   THAI PHONETIC ORDER VOWEL SIGN SARA E [1.0.1]

0E71   ๱   THAI PHONETIC ORDER VOWEL SIGN SARA AE [1.0.1]

0E72   ๲   THAI PHONETIC ORDER VOWEL SIGN SARA O [1.0.1]

0E73   ๳   THAI PHONETIC ORDER VOWEL SIGN SARA MAI MUAN [1.0.1]

0E74   ๴   THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI [1.0.1]

0E75..0E7F   ๵ ๶ ๷ ๸ ๹ ๺ ๻ ๼ ๽ ๾ ๿


Nothing was added since Unicode 1.0.0 and there are no non-approval notices either. 2 blocks of 16 wasted, but the 2nd one used to have phonetic order vowel signs. There seem to be some concerns about line breaking, which requires morphological analysis as Thai doesn't use spaces. Idon'tuderstandhowcananyonekeepwritinglikethis.


Lao (13+32):

0E80   ຀   LAO MORPHOLOGICAL BOUNDARY

0E83   ຃   LAO LETTER KHO KHUAT

0E85   ຅   LAO LETTER KHO KHON

0E8B   ຋   LAO LETTER SO SO

0EA4   ຤   LAO LETTER RU

0EA6   ຦   LAO LETTER LU

0EBE   ຾   Some vowel sign or punctuation or currency sign

0EBF   ຿   Currency sign

0EC5   ໅   LAO LAKKHANGYAO

0EC7   ໇   LAO MAITAIKHU

0ECF   ໏   LAO FONGMAN

0EDA   ໚   LAO ANGKHANKHU

0EDB   ໛   LAO KHOMUT

0EE0..0EFF   ໠ ໡ ໢ ໣ ໤ ໥ ໦ ໧ ໨ ໩ ໪ ໫ ໬ ໭ ໮ ໯

0EF0   ໰   LAO PHONETIC ORDER VOWEL SIGN E [1.0.1]

0EF1   ໱   LAO PHONETIC ORDER VOWEL SIGN EI [1.0.1]

0EF2   ໲   LAO PHONETIC ORDER VOWEL SIGN O [1.0.1]

0EF3   ໳   LAO PHONETIC ORDER VOWEL SIGN AY [1.0.1]

0EF4   ໴   LAO PHONETIC ORDER VOWEL SIGN AI [1.0.1]

0EF5..0EFF   ໵ ໶ ໷ ໸ ໹ ໺ ໻ ໼ ໽ ໾ ໿


Some letters for Sanskrit and Pali were added, and later Khmu Go and Khmu Nyo. Gaps seem to line up with Thai. 0E3E or 0EBF would have been a good place for Bitcoin sign, considering people were misusing ฿. 2 blocks of 16 wasted again, but the 2nd one used to have phonetic order vowel signs too.


Tibetan (13+32):

0F48   ཈   A companion to Nya (Lya?)

0F6D   ཭   Some letter (Dya?)

0F6E   ཮   Another letter or punctuation (Tya?)

0F6F   ཯   Yet another letter or punctuation (subjoined Tya?)

0F70   ཰   Some vowel sign (schwa?)

0F98   ྘   Subjoined companion to Nya (Lya?)

0FBD   ྽   Some subjoined letter (Dya?)

0FCD   ࿍   ying-yeng-yang-yong-yung (pentagrammical)

0FDB   ࿛   RIGHT-FACING SVASTI SIGN ROTATED 45 DEGREES

0FDC   ࿜   LEFT-FACING SVASTI SIGN ROTATED 45 DEGREES

0FDD   ࿝   RIGHT-FACING SVASTI SIGN WITH DOTS ROTATED 45 DEGREES

0FDE   ࿞   LEFT-FACING SVASTI SIGN WITH DOTS ROTATED 45 DEGREES

0FDF   ࿟   TIBETAN SPACE (or 6-yingyang)

0FE0..0FFF   ࿠࿡࿢࿣࿤࿥࿦࿧࿨࿩࿪࿫࿬࿭࿮࿯࿰࿱࿲࿳࿴࿵࿶࿷࿸࿹࿺࿻࿼࿽࿾࿿


Was reencoded in Unicode 2.0. Has swastikas, or Hakenkreuzer if you are miseducated about their historical origins. Nazis on 4chan would definitely make use of one rotated 45 degrees, maybe even in a combining circle. For Easterners, these continue to be symbols of peace. Nushu script in SMP contains 𛋇 and 𛈄. The Tibetan block also contains the BDSM symbol ࿋. Typically it's mirrored, but one has to count with looking at oneself in the mirror, and also some cameras mirror the image. 2 blocks of 16 wasted yet again, to the total of 96 codepoints, enough for a new script, or 3 Philippinic scripts, 1 per each wastage.


Georgian (8):

10C6   ჆   GEORGIAN CAPITAL LETTER FI

10C8   ჈   GEORGIAN CAPITAL LETTER ELIFI

10C9   ჉   GEORGIAN CAPITAL LETTER TURNED GAN

10CA   ჊   GEORGIAN CAPITAL LETTER AIN

10CB   ჋   Some text separator / GEORGIAN LETTER U-BRJGU [nonapproved]

10CC   ჌   MODIFIER GEORGIAN CAPITAL LETTER NAR

10CE   ჎   GEORGIAN CAPITAL LETTER HARD SIGN

10CF   ჏   GEORGIAN CAPITAL LETTER LABIAL SIGN


Some capital letters are missing. There is also a 3rd and 4th case somewhere else. Actually seems more like 2 scripts with 2 cases each. Considering case is not a typical feature in the global scale, Georgian is just mental.


Ethiopic (26):

቉ ቎ ቏ ቗ ቙ ቞ ቟ ኉ ኎ ኏ ኱ ኶ ኷ ኿ ዁ ዆ ዇ ዗ ጑ ጖ ጗

135B   ፛   ETHIOPIC SYLLABLE -YA / Some combining mark

135C   ፜   ETHIOPIC SYLLABLE -YA / Some combining mark

137D   ፽   ETHIOPIC NUMBER MILLION

137E   ፾   ETHIOPIC NUMBER HUNDRED MILLION

137F   ፿   ETHIOPIC NUMBER TEN MILLIARD


Mostly syllabic gaps. A-U-I-AA-EE-E-O-OA/WA(A). Could have used the space in the supplement blocks instead of spreading Ethiopic across planes.


Ethiopic Supplement (6):

139A   ᎚   Tone mark

139B   ᎛   Tone mark

139C   ᎜   Tone mark

139D   ᎝   Tone mark

139E   ᎞   Tone mark or punctuation

139F   ᎟   ETHIOPIC SPACE


Cherokee (4):

13F6   ᏶   Some letter

13F7   ᏷   Some puctuation

13FE   ᏾   Some small letter

13FF   ᏿   CHEROKEE SPACE


Apparently V is syllabic. Looks like a hybrid between Lisu and Deseret. Small letters are in Cherokee Supplement, but that block is full.


Ogham (3):

169D   ᚝   OGHAM ARROW MARK

169E   ᚞   OGHAM REVERSED ARROW MARK

169F   ᚟   OGHAM OUTER SPACE


If the lines look like >-----<, then why not add <----->?


Runic (7):

16F9   ᛹   Some letter

16FA   ᛺   Another letter

16FB   ᛻   Yet another letter

16FC   ᛼   Some punctuation

16FD   ᛽   Another punctuation

16FE   ᛾   SCHUTZSTAFFEL SIGN

16FF   ᛿   RUNIC SPACE


We need to capitalize on 4chan Nazis who will decorate their general threads with Nazi insignia, so the ᛋᛋ sign goes well with more Swastikas in Tibetan. May also throw in symbols of other "forbidden" "unconstitutional" organizations in Germany. How long before a single letter (not just 2) is declared a "hate symbol" and removed? Wait, that happened on Twitch already with D. And the German ban extended to ᛟ as well. And also some people were considering banning Z and V due to Ruᛋᛋia doing Mongol shit in Ukraine, painting A,Z,O,V,X on their (mostly former) vehicles. Azov logo is Ƶ, by the way, which some pointed to me looks like 3/4-swastika. As we say in Czechia, forbidden fruit tastes the best. Banilingus people are either kinky or don't understand brat logic.


Tagalog (9):

1716..171E   ᜖ ᜗ ᜘ ᜙ ᜚ ᜛ ᜜ ᜝ ᜞


Also called Baybayin.


Hanunóo (9):

1737   ᜷   HANUNOO SIGN VIRAMA

1738...173F   ᜸ ᜹ ᜺ ᜻ ᜼ ᜽ ᜾ ᜿


Buhid (12):

1754   ᝔   BUHID SIGN VIRAMA

1755   ᝕   BUHID SIGN PAMUDPOD

1756..175F   ᝖ ᝗ ᝘ ᝙ ᝚ ᝛ ᝜ ᝝ ᝞ ᝟


Tagbanwa (14):

176D   ᝭   TAGBANWA LETTER RA

1771   ᝱   TAGBANWA LETTER HA

1774   ᝴   TAGBANWA SIGN VIRAMA

1775   ᝵   TAGBANWA SIGN PAMUDPOD

1776..177F  ᝶ ᝷ ᝸ ᝹ ᝺ ᝻ ᝼ ᝽ ᝾ ᝿


Philippinic scripts work same as Indic scripts, but each script now requires only 2 16-codepoint blocks with a lot of room to spare for extension. Only 18 phonemes is abysmally little, English has 40. Maybe good scripts for toki pona, but need 2 more vowels.


Khmer (14):

17DE   ៞   KHMER SIGN CANDRABINDU

17DF   ៟   KHMER SIGN LAAK [nonapproved] (originally for U+17DD)

17EA..17EF   ៪ ៫ ៬ ៭ ៮ ៯   KHMER HEXADECIMAL DIGIT TEN..FIFTEEN

17FA   ៺   Some symbol

17FB   ៻   Some symbol

17FC   ៼   Some symbol

17FD   ៽   Some symbol

17FE   ៾   Some symbol

17FF   ៿   KHMER SPACE


We need more hexadecimal digit systems in this age of binary computers. Nyström might have been trolling the decimal commitee, but had the intuition that base-16 will be very useful. Khmer is said to have the most letters, so why not add some digits to the set of boring 10?


Mongolian (17):

181A..181F   ᠚ ᠛ ᠜ ᠝ ᠞ ᠟   Birga variations or hexadecimal digits

187A..187F   ᡺ ᡻ ᡼ ᡽ ᡾ ᡿   Extra letters

18AB..18AE   ᢫ ᢬ ᢭ ᢮   Extra letters

18AF   ᢯   MONGOLIAN SPACE


Free Variation Selector 4 was needed to get rid of a nasty M$ hack with ᠀ and a Zero Width Non-joiner. Some fonts may place the premade glyphs behind the digits.


Unified Canadian Aboriginal Syllabics Extended (10):

18F6   ᣶   Something that looks like O

18F7   ᣷   Something that looks like I (wide spacing)

18F8   ᣸   Something that resembles 6 more closely

18F9   ᣹   Something that resembles 9 more closely

18FA   ᣺   Something that resembles ð more closely

18FB   ᣻   Something that resembles e more closely

18FC   ᣼   open box up

18FD   ᣽   open box right

18FE   ᣾   open box down

18FF   ᣿   open box left


Here I need the missing shapes for my Anti-Dyslexic Pigpen-Moon Style font. Some (И/N/Z/S) have been added to the UCAS Extended Additional block in the SMP. I can't find certain shape rotations without any dots, they must be hidden somewhere. Also combining dot would be helpful. There should probably be a Pigpen block in the SMP or UCSUR.


Limbu (12):

191F   ᤟   LIMBU SPACE

192C..192F   ᤬ ᤭ ᤮ ᤯

193C..193F   ᤼ ᤽ ᤾ ᤿

1941..1943   ᥁ ᥂ ᥃


Tai Le (13):

196E   ᥮

196F   ᥯

1975..197E   ᥵ ᥶ ᥷ ᥸ ᥹ ᥺ ᥻ ᥼ ᥽ ᥾

197F   ᥿   TAI LE SPACE


New Tai Lue (13):

19AC..19AF   ᦬ ᦭ ᦮ ᦯

19CA..19CE   ᧊ ᧋ ᧌ ᧍ ᧎

19CF   ᧏   NEW TAI LUE SPACE

19DB..19DD   ᧛ ᧜ ᧝


Buginese (2):

1A1C   ᨜   BUGINESE VOWEL SIGN similar to AE in appearance

1A1D   ᨝   BUGINESE VIRAMA


Tai Tham (17):

1A5F   ᩟   Some consonant sign

1A7D   ᩽   Some sign

1A7E   ᩾   Some cryptogrammic sign

1A8A..1A8F   ᪊ ᪋ ᪌ ᪍ ᪎ ᪏   TAI THAM HORA HEXADECIMAL DIGIT TEN..FIFTEEN

1A9A..1A9F   ᪚ ᪛ ᪜ ᪝ ᪞ ᪟   TAI THAM THAM HEXADECIMAL DIGIT TEN..FIFTEEN

1AAE   ᪮   Some punctuation sign

1AAF   ᪯   TAI THAM SPACE


There are 2 sets of digits. They could be combined for base 20 as is, or for base 32 after hexadecimal extension.


Combining Diacritical Marks Extended (12):

1AF4   ᫴   COMBINING COLON ABOVE

1AF5   ᫵   COMBINING COLON BELOW

1AF6   ᫶   COMBINING HOOK-ACUTE

1AF7   ᫷   COMBINING LATIN SMALL LETTER TURNED A

1AF8   ᫸   COMBINING LATIN SMALL LETTER H WITH HOOK

1AF9   ᫹   COMBINING LATIN SMALL LETTER DOTLESS I

1AFA   ᫺   COMBINING LATIN SMALL LETTER J

1AFB   ᫻   COMBINING LATIN SMALL LETTER ENG

1AFC   ᫼   COMBINING ...

1AFD   ᫽   COMBINING LATIN SMALL LETTER TURNED R

1AFE   ᫾   COMBINING LATIN SMALL LETTER Y

1AFF   ᫿   COMBINING LATIN SMALL LETTER EZH


Seems to have overflown into SMP in Combining Diacritical Marks Extended-A block next to Latin Extended-F. As of late, each new proposal regarding phonetical orthographies seems to propose different combining characters to these BMP codepoints, so it can get very confusing. The compound tone diacritics seem to be not valid anymore for this block.


Balinese (1):

1B4D   ᭍   Some letter


Batak (8):

1BF4   ᯴   Consonant sign

1BF5   ᯵   Consonant sign

1BF6   ᯶   Consonant sign

1BF7   ᯷   Consonant sign

1BF8   ᯸   Punctuation symbol

1BF9   ᯹   Punctuation symbol

1BFA   ᯺   Punctuation symbol

1BFB   ᯻   Punctuation symbol


Lepcha (6):

1C38   ᰸   Extra sign

1C39   ᰹   Extra sign

1C3A   ᰺   Extra punctuation

1C4A   ᱊   LEPCHA DUODECIMAL DIGIT TEN

1C4B   ᱋   LEPCHA DUODECIMAL DIGIT ELEVEN

1C4C   ᱌   Extra letter


Cyrillic Extended C (4):

1C8C   ᲌   CYRILLIC CAPITAL LETTER RJE / RZHE

1C8D   ᲍   CYRILLIC SMALL LETTER RJE / RZHE

1C8E   ᲎   CYRILLIC CAPITAL LETTER DOTTED BYELORUSSIAN-UKRAINIAN I

1C8F   ᲏   CYRILLIC SMALL LETTER DOTLESS BYELORUSSIAN-UKRAINIAN I


Modifier letters and subscripts are being put in the SMP. Some obscure cyrillization orthography may be found that would need some extra letters. There aren't any precomposed acute-accented vowels. I see these in dictionaries all the time. They probably won't be accepted and you can basically put the acute above any letter that happens to be stressed, so that would need too many code points for this small gap. Letter Tje is being added to complete the ďťňľ set. We in the Western Slavonia also have have ř, which would be nice to have a cyrillic letter for so I don't have to use Ҏ or Ԗ.


Georgian Extended (2):

1CBB   ᲻

1CBC   ᲼


Mtavruli, the uppercase for lowerercase.


Sundanese Supplement (8):

1CC8   ᳈   Punctuation or symbol

1CC9   ᳉   Punctuation or symbol

1CCA   ᳊   Punctuation or symbol

1CCB   ᳋   Punctuation or symbol

1CCC   ᳌   Punctuation or symbol

1CCD   ᳍   Punctuation or symbol

1CCE   ᳎   Punctuation or symbol

1CCF   ᳏   SUNDANESE SPACE

 

Vedic Extensions (5):

1CFB   ᳻

1CFC   ᳼

1CFD   ᳽

1CFE   ᳾

1CFF   ᳿


Greek Extended (23):

1F16   ἖   GREEK SMALL LETTER EPSILON WITH PSILI AND PERISPOMENI

1F17   ἗   GREEK SMALL LETTER EPSILON WITH DASIA AND PERISPOMENI

1F1E   ἞   GREEK CAPITAL LETTER EPSILON WITH PSILI AND PERISPOMENI

1F1F   ἟   GREEK CAPITAL LETTER EPSILON WITH DASIA AND PERISPOMENI

1F46   ὆   GREEK SMALL LETTER OMICRON WITH PSILI AND PERISPOMENI

1F47   ὇   GREEK SMALL LETTER OMICRON WITH DASIA AND PERISPOMENI

1F4E   ὎   GREEK CAPITAL LETTER OMICRON WITH PSILI AND PERISPOMENI

1F4F   ὏   GREEK CAPITAL LETTER OMICRON WITH DASIA AND PERISPOMENI

1F58   ὘   GREEK CAPITAL LETTER UPSILON WITH PSILI

1F5A   ὚   GREEK CAPITAL LETTER UPSILON WITH PSILI AND VARIA

1F5C   ὜   GREEK CAPITAL LETTER UPSILON WITH PSILI AND OXIA

1F5E   ὞   GREEK CAPITAL LETTER UPSILON WITH PSILI AND PERISPOMENI

1F7E   ὾   GREEK SMALL LETTER RHO WITH VARIA (or Lambda, Mu, Nu?)

1F7F   ὿   GREEK SMALL LETTER RHO WITH OXIA

1FB5   ᾵   GREEK SMALL LETTER ALPHA WITH OXIA AND PERISPOMENI AND YPOGEGRAMMENI

1FC5   ῅   GREEK SMALL LETTER ETA WITH OXIA AND PERISPOMENI AND YPOGEGRAMMENI

1FD4   ῔   GREEK SMALL LETTER LAMBDA WITH PSILI (or Mu or Nu?)

1FD5   ῕   GREEK SMALL LETTER LAMBDA WITH DASIA

1FDC   ῜   GREEK CAPITAL LETTER LAMBDA WITH DASIA

1FF0   ῰   GREEK SMALL LETTER OMEGA WITH VRACHY

1FF1   ῱   GREEK SMALL LETTER OMEGA WITH MACRON

1FF5   ῵   GREEK SMALL LETTER OMEGA WITH OXIA AND PERISPOMENI AND YPOGEGRAMMENI

1FFF   ῿   Some polytonic mark


Clearly arranged to a table with some slots missing. No idea if it makes sense since it looks like as if Alexander the Great came all the way into Vietnam, but whatever, will make for more fun with text effects.

North Korean KPS 9566 has a complete Greek lowercase subscript and superscript set, in addition to Latin, presumably for plaintext math.


General Punctuation (1):

⁥2065       HORIZONTAL FRACTION LINE (for KPS 9566 1/2 1/3 1/4 2/3 1/4 3/4)


Can't paste this. This block is very "spacy". U+2065 could also be ARTIFICIAL INTELLIGENCE GENERATED TEXT INDICATOR, to assist with filtering AI slop from training data.


Superscripts and subscripts (2):

2072   ⁲   SUPERSCRIPT SOLIDUS

2073   ⁳   SUPERSCRIPT GREEK SMALL LETTER PI


Superscripts are called modifier letters and there are many of them except Q. They are very useful for indicating whisper or small print, despite not being meant for that. Subscripts are needed for plaintext junior high school math, despite arguments that in PhD math, a subscript can hold any expression and is therefore a markup and people should use LaTeX $_{z}$, HTML <sub>z</sub>, ECMA-48 \e[74mz\e[75m, or C1 \x8Bz\x8C, or get a proper text editor that can do half line feed up and down like old typewriters. The Amish and 1970s teletype cavemen are onto something here.


Currency symbols (9):

20C6   ⃆   Social Credit token

20C7   ⃇   EU emission allowance certificate token

20C8   ⃈   HUNGARIAN FORINT SYMBOL

20C9   ⃉   POLISH ZLOTY SYMBOL

20CA   ⃊   DOGECOIN

20CB   ⃋   WOWNERO

20CC   ⃌   ZCASH SYMBOL (also a logo of a Czech Punk band Zputnik)

20CD   ⃍   ETHEREUM SYMBOL

20CE   ⃎   SATOSHI SYMBOL


With governments inventing new fiat currencies and their fancy symbols, this block won't be the last one to be filled completely. Especially with inflation getting out of bounds. Bank account is nothing more than a custodial fiat shitcoin wallet. Also I need a symbol for Monero, something like M̶ or M=, if the font co-operates. Coming up at 1DF4A as 𝽊 along with 𝽋 for milli-Monero.

There was a proposal for an Etheteum symbol, apparently Ξ, ≡, or 𐤎 isn't used anymore, apparently due to Satoshi now using Ξ̩̍. The authors made a mistake of either keeping the poposed symbol as a brand asset on their website, as logos are forbidden in Unicode, or not straightening the bend, so that it looks like ⬧ with a gap or an inverted version of ⟠. Script Ad-Hoc stopped it before formal nonapproval, so possibly either Ethereum foundation will disclaim trademark or the proposed symbol will be simplified further. However, if I can draw it in 8x16 as 20CE:000008081C1C3E3E7F3E1C2A141C0808, it's not trademark-worthy.


Combining Diacritical Marks For Symbols (15):

20F1   ⃱

20F2   ⃲

20F3   ⃳

20F4   ⃴

20F5   ⃵

20F6   ⃶   

20F7   ⃷   

20F8   ⃸   COMBINING SHOGI PIECE TURNED 0 DEGREES

20F9   ⃹   COMBINING SHOGI PIECE TURNED 90 DEGREES

20FA   ⃺   COMBINING SHOGI PIECE TURNED 180 DEGREES

20FB   ⃻   COMBINING SHOGI PIECE TURNED 270 DEGREES

20FC   ⃼   COMBINING PROMOTED SHOGI PIECE TURNED 0 DEGREES

20FD   ⃽   COMBINING PROMOTED SHOGI PIECE TURNED 90 DEGREES

20FE   ⃾   COMBINING PROMOTED SHOGI PIECE TURNED 180 DEGREES

20FF   ⃿   COMBINING PROMOTED SHOGI PIECE TURNED 270 DEGREES


I demand combining frames for Shogi pieces for my Tai and Taikyoku Shogi projects. It would probably work via ZWJs (arbitrary number of characters) or as a binary ligator since most Shogi pieces have 2 characters on them. For rotation and promotion it may be better to use CSS, since there are variants for more than 2 players and with really whacky grids. Assuming 30 degree increments, 24 code points would be needed, which is something more suited for the SMP. Also TIP is missing some peculiar kanji, or maybe just the Wikipedia pages have not been updated yet.


Number Forms (4):

218C   ↌   TONAL DIGIT NINE NI

218D   ↍   TONAL DIGIT ELEVEN HU

218E   ↎   TONAL DIGIT TWELVE VY

218F   ↏   TONAL DIGIT FIFTEEN FY


Definitely Tonal digits, 2 of them are unifiable with the duodecimal ones. Musical clefs can go into the SMP to the Musical Symbols block.

Would appreciate if there was some space for base4^2, which needs 16 distinct and 4 unifiable (o ı ɔ c) codepoints, so that Roman numerals no longer need to be misappropriated in the font.


Control Pictures (6+16):

242A   ␪   some legacy control picture

242B   ␫   some legacy control picture

242C   ␬   some legacy control picture

242D   ␭   some legacy control picture

242E   ␮   some legacy control picture

242F   ␯   some legacy control picture

2430   ␰   SYMBOL FOR PADDING CHARACTER

2431   ␱   SYMBOL FOR HIGH OCTET PRESET

2432   ␲   SYMBOL FOR BREAK PERMITTED HERE

2433   ␳   SYMBOL FOR NO BREAK HERE

2434   ␴   SYMBOL FOR INDEX

2435   ␵   SYMBOL FOR NEXT LINE

2436   ␶   SYMBOL FOR START OF SELECTED AREA

2437   ␷   SYMBOL FOR END OF SELECTED AREA

2438   ␸   SYMBOL FOR CHARACTER TABULATION SET (Horizontal Tabulation Set)

2439   ␹   SYMBOL FOR CHARACTER TABULATION WITH JUSTIFICATION (Horiz. TwJ)

243A   ␺   SYMBOL FOR LINE TABULATION SET (Vertical Tabulation Set)

243B   ␻   SYMBOL FOR PARTIAL LINE FORWARD (Partial Line Down)

243C   ␼   SYMBOL FOR PARTIAL LINE BACKWARD (Partial Line Up)

243D   ␽   SYMBOL FOR REVERSE LINE FEED (Reverse Index)

243E   ␾   SYMBOL FOR SINGLE SHIFT TWO

243F   ␿   SYMBOL FOR SINGLE SHIFT THREE


Needs control pictures for Teletext and C1 Controls, but there's not enough space (25, need 64), so they might have to go behind Symbols For Legacy Computing into a new Control Pictures Supplement block. There are also bibliographical C1 codes (next 32), and 17 modifications of C0, so a 128 code point block is needed.


Optical Character Recognition (5+16):

244B   ⑋   

244C   ⑌  

244D   ⑍  

244E   ⑎  

244F   ⑏  

2450   ⑐   SYMBOL FOR DEVICE CONTROL STRING

2451   ⑑   SYMBOL FOR PRIVATE USE ONE

2452   ⑒   SYMBOL FOR PRIVATE USE TWO

2453   ⑓   SYMBOL FOR SET TRANSMIT STATE

2454   ⑔   SYMBOL FOR CANCEL CHARACTER

2455   ⑕   SYMBOL FOR MESSAGE WAITING

2456   ⑖   SYMBOL FOR START OF GUARDED AREA

2457   ⑗   SYMBOL FOR END OF GUARDED AREA

2458   ⑘   SYMBOL FOR START OF STRING

2459   ⑙   SYMBOL FOR CONTROL

245A   ⑚   SYMBOL FOR SINGLE CHARACTER INTRODUCER

245B   ⑛   SYMBOL FOR CONTROL SEQUENCE INTRODUCER

245C   ⑜   SYMBOL FOR STRING TERMINATOR

245D   ⑝   SYMBOL FOR OPERATING SYSTEM COMMAND

245E   ⑞   SYMBOL FOR PRIVACY MESSAGE

245F   ⑟   SYMBOL FOR APPLICATION PROGRAM COMMAND

 

Could be repurposed with the remaining space in Control Pictures for encoding C1 controls if Teletext ones are deemed inferior. However, heterodox C0 and bibliographical C1 variants still need another 17 and 28 code points, that is a block of 64.

There is a proposal to put all the Unicode dashed box character symbols into SSP for easier reference, which would make these symbols-for obsolete.


Miscellaneous Symbols and Arrows (2):

2B74   ⭴   EXTERNAL LINK SIGN [nonapproved]

2B75   ⭵   UP DOWN TRIANGLE HEADED ARROW TO BARS


There's like 1000 different arrows in BMP and SMP, yet a single sign for external link, used not just on a very popular website Wikipedia, was met with a nonaproval notice. That's why there's a bunch of UI symbols in Font Awesome.

In North Korean KPS 9566 there are 2 mutually mirrored diagonally striped triangles resembling an Adidas knock-off, possibly to be found somewhwere amongst Symbols for Legacy Computing, if not, then complete this block. The Triangleception and circled upwards manicule can be combined. The scissors just happen to point in a different direction, see directinal emoji. The hammer-brush-sickle should be an emoji ZWJ sequence. The RedStarOS 3.0 E0 to EB prefixed extensions seem to be more suited for SMP, and EC to FE prefixed seem to be OS specific.


Coptic (5):

2CF4   ⳴   Some obscure capital letter

2CF5   ⳵   Some obscure small letter

2CF6   ⳶   Another obscure capital letter

2CF7   ⳷   Another obscure small letter

2CF8   ⳸   Punctuation


Disunified from Greek. One could have thought it was just a font, but in early stages contained way more characters derived from Demotic.


Georgian Supplement (8):

2D26   ⴦   GEORGIAN SMALL LETTER FI

2D28   ⴨   GEORGIAN SMALL LETTER ELIFI

2D29   ⴩   GEORGIAN SMALL LETTER TURNED GAN

2D2A   ⴪   GEORGIAN SMALL LETTER AIN

2D2B   ⴫   Some text separator / GEORGIAN LETTER U-BRJGU [nonapproved]

2D2C   ⴬   MODIFIER GEORGIAN SMALL LETTER NAR

2D2E   ⴮   GEORGIAN SMALL LETTER HARD SIGN

2D2F   ⴯   GEORGIAN SMALL LETTER LABIAL SIGN


This is the 3rd case, and also there's Mtavruli. 2D1Bh ⴛ (small Jil) looks like ch in certain fonts.


Tifinagh (21):

2D68..2D6E   ⵨ ⵩ ⵪ ⵫ ⵬ ⵭ ⵮

2D71..2D7E   ⵱ ⵲ ⵳ ⵴ ⵵ ⵶ ⵷ ⵸ ⵹ ⵺ ⵻ ⵼ ⵽ ⵾


Despite stemming from Phoenician, it doesn't resemble it at all, much like these Indic scripts.


Ethiopic Extended (17):

2D97   ⶗

2D98..2D9F   ⶘ ⶙ ⶚ ⶛ ⶜ ⶝ ⶞ ⶟   Some syllable group

2DA7   ⶧

2DAF   ⶯

2DB7   ⶷

2DBF   ⶿

2DC7   ⷇

2DCF   ⷏

2DD7   ⷗

2DDF   ⷟


Supplemental Punctuation (30):

2E5E   ⹞   QUESTION COMMA

2E5F   ⹟   EXCLAMATION COMMA

2E64   ⹤   

2E65   ⹥   Alcanter de Brahm’s Irony Point

2E66   ⹦   HEART-SHAPED DOUBLE EXCLAMATION MARK (Love point) [Bazin]

2E67   ⹧   EXCLAMATION MARK WITH CROSSBAR (Certitude/conviction point)

2E68   ⹨   EXCLAMATION MARK WITH CAP (Authority point)

2E69   ⹩   EXCLAMATION MARK WITH TIE OVERLAY (Irony point)

2E6A   ⹪   DOUBLE EXCLAMATION MARK CONVERGING INTO SINGLE DOT (Acclamation)

2E6B   ⹫   ZIGZAG EXCLAMATION MARK (Doubt point)

2E6C   ⹬   SNARK

2E6D   ⹭   IRONIETEKEN

2E6E   ⹮   SINGLE QUASIQUOTATION MARK

2E6F   ⹯   DOUBLE QUASIQUOTATION MARK

2E70   ⹰   LEFT SINGLE QUASIQUOTATION MARK

2E71   ⹱   RIGHT SINGLE QUASIQUOTATION MARK

2E72   ⹲   LEFT DOUBLE QUASIQUOTATION MARK

2E73   ⹳   RIGHT DOUBLE QUASIQUOTATION MARK

2E74   ⹴   SINCERIOD [CollegeHumor]

2E75   ⹵   LEFT SARCASTISES

2E76   ⹶   RIGHT SARCASTISES

2E77   ⹷   HEMI-DEMI-SEMI-COLON

2E78   ⹸   ANDORPERSAND

2E79   ⹹   LEFT MOCKWOTATION MARK

2E7A   ⹺   RIGHT MOCKWOTATION MARK

2E7B   ⹻   SUPERELLIPSIS

2E7C   ⹼   JÈ-MARK

2E7D   ⹽   EL-REY

2E7E   ⹾   REVERSED INVERTED QUESTION MARK

2E7F   ⹿   HALF HYPHEN


No shortage of space here. Could use some CollegeHumor and Bazin punctuation marks. Bazin ones were actually proposed once, but UTC didn't find the evidence sufficient. Fairfax contains even more. SarcMark™ is trademarked and Morgan Freemark wouldn't pass even as an Emoji for depicting a specific person. Now mostly emoji are used instead of punctuation.


CJK Radicals Supplement (13):

2E9A   ⺚   CJK RADICAL C-SIMPLIFIED RAP

2EF4   ⻴   2nd stage simplified radicals

2EF5   ⻵

2EF6   ⻶

2EF7   ⻷

2EF8   ⻸

2EF9   ⻹

2EFA   ⻺

2EFB   ⻻

2EFC   ⻼

2EFD   ⻽

2EFE   ⻾

2EFF   ⻿

 

Here can go the components needed for proper representation with IDS.


KhangXi Radicals (10):

2FD6   ⿖

2FD7   ⿗

2FD8   ⿘

2FD9   ⿙

2FDA   ⿚

2FDB   ⿛

2FDC   ⿜

2FDD   ⿝

2FDE   ⿞

2FDF   ⿟


Spill-over area for IDCs, strokes, components, maybe even Japanese era names.


*Ideographic Description Characters Extended (16):

2FE0   ⿠   IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL CONJOINER

2FE1   ⿡   IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL CONJOINER

2FE2   ⿢   IDEOGRAPHIC DESCRIPTION CHARACTER SURROUNDING CONJOINER

2FE3   ⿣   IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL DOUBLE MULTIPLIER

2FE4   ⿤   IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL DOUBLE MULTIPLIER

2FE5   ⿥   IDEOGRAPHIC DESCRIPTION CHARACTER TRIANGLE MULTIPLIER

2FE6   ⿦   IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL TRIPLING MULTIPLIER

2FE7   ⿧   IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL TRIPLING MULTIPLIER

2FE8   ⿨   IDEOGRAPHIC DESCRIPTION CHARACTER SQUARE MULTIPLIER

2FE9   ⿩   IDEOGRAPHIC DESCRIPTION CHARACTER HORIZONTAL QUADRUPLE MULTIPLIER

2FEA   ⿪   IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL QUADRUPLE MULTIPLIER

2FEB   ⿫   IDEOGRAPHIC DESCRIPTION CHARACTER LEFT SPAN DELIMITER

2FEC   ⿬   IDEOGRAPHIC DESCRIPTION CHARACTER RIGHT SPAN DELIMITER

2FED   ⿭   IDEOGRAPHIC DESCRIPTION CHARACTER STROKE VARIATION

2FEE   ⿮   IDEOGRAPHIC DESCRIPTION CHARACTER MULTIPLY

2FEF   ⿯   IDEOGRAPHIC DESCRIPTION CHARACTER VERTICAL FLIP


This is the last unallocated block. After that and a few fill-ins, BMP will be sealed for the rest of encoding history due to backwards compatibility, thus Unicode will have completed its original purpose of a 16-bit charset. After filling the 16 astral planes, attention should be turned back to the original ISO 10646 32-bit UCS-4.

The 13 CDP Big5 IDCs have been proposed since 2002 at this exact place.

Multiplication takes an IDC and a component, and writes that component into each field specified by IDC, recursively. Better than these CDP multipliers.

Stroke subtraction, now just subtraction, has been moved to 31EF at the end of strokes.


Hiragana (3):

3040   ぀   KANA SPACE

3097   ゗   COMBINING HIRAGANA-KATAKANA OVERLINE

3098   ゘   COMBINING HIRAGANA-KATAKANA HALF OF VOICED SOUND MARK


Small letters are now going into Small Kana Extended block in SMP, Gojúon fillers in Kana Extended Additional (HIRAGANA LETTER ARCHIAIC YI is still missing, reportedly same as a hentaigana letter derived from the same manyougana character), and precomposed letters with (han)dakuten are refused. So I came up with a single dot from the dakuten (ejective consonants maybe) and remembered some Taiwanese overlined extensions. Also some white space may come in handy with all these variable-width fonts.


Bopomofo (5):

3100   ㄀   BOPOMOFO SPACE

3101   ㄁   BOPOMOFO TONE-1

3102   ㄂   BOPOMOFO TONE-2

3103   ㄃   BOPOMOFO TONE-3

3104   ㄄   BOPOMOFO TONE-4


For some reason a Bopomofo Extended-A block is being added to the SMP. These 5 codepoints are reserved probably for something very special.


Hangul Compatibility Jamo (2):

3130   ㄰   HANGUL JAMO SPACE

318F   ㆏  


CJK Strokes (9):

31E6   ㇦

31E7   ㇧

31E8   ㇨

31E9   ㇩

31EA   ㇪

31EB   ㇫

31EC   ㇬

31ED   ㇭

31EE   ㇮


IDCs are creeping into here instead of taking the last free block. Hangul srokes were proposed here once, but Hangul is really a post-Brahmic abugida.


Enclosed CJK Letters and Months (1):

321F   ㈟   SQUARE ERA NAME [classified]


After recent Japanese emperor resignation, Unicode 12.1 promptly added a square ㋿ era. Weirdly enough, Heisei matches with the Velvet revolution (1989), and Reiwa with Anno Covidiam (2019). Since there needs to be space for more era names, Enclosed Ideographic Supplement in SMP seemingly not being the place, this 321Fh codepoint is the only free one in this range.


*** Ocean of CJKV Ideographs, briefly interrupted by Yijing hexagrams ***


Hangul precomposed syllables originally occupied the space of CJK Extension A from 3400 to 3D2D in 1.0.1 and then to 4DFF in 1.1.


Yi Syllables (3):

A48D   ꒍   YI ITERATION MARK

A48E   ꒎   YI PLACEHOLDER

A48F   ꒏   YI SPACE


This is just the Modern Yi, there's also Classical Yi which consists of 88591 ideographs according to an enormeous proposal from probably some mainland China govermental organization. There is a clear similarity with Clerical script. It looks like the UTC just gave up since there is no way to efficiently check this collection for unifiable characters, having introduced some mistakes in A6D6h (42710) character CJK ExtB and realizing the TARFU during work on ExtC.


Yi Radicals (9):

A4C7   ꓇

A4C8   ꓈

A4C9   ꓉

A4CA   ꓊

A4CB   ꓋

A4CC   ꓌

A4CD   ꓍

A4CE   ꓎

A4CF   ꓏


Vai (4+16):

A62C   ꘬

A62D   ꘭

A62E   ꘮

A62F   ꘯   VAI SPACE

A630..A63F   ꘰ ꘱ ꘲ ꘳ ꘴ ꘵ ꘶ ꘷ ꘸ ꘹ ꘺ ꘻ ꘼ ꘽ ꘾ ꘿


1 block of 16 wasted. That's a nice space for some hexadecimal digit set.


Bamum (8):

A6F8   ꛸   Some punctuation

A6F9   ꛹   Some punctuation

A6FA   ꛺   Some punctuation

A6FB   ꛻   Some punctuation

A6FC   ꛼   Some punctuation

A6FD   ꛽   Some punctuation

A6FE   ꛾   Some punctuation

A6FF   ꛿   BAMUM SPACE


Latin Extended D (18):

A7DE   ꟞   LATIN CAPITAL LETTER THETA

A7DF   ꟟   LATIN SMALL LETTER THETA

A7E0   ꟠   REVERSED CAPITAL G

A7E1   ꟡   REVERSED SMALL G

A7E3   ꟣   REVERSED CAPITAL J

A7E4   ꟤   REVERSED SMALL J

A7E5   ꟥   REVERSED CAPITAL Z

A7E6   ꟦   REVERSED SMALL Z

A7E7   ꟧   REVERSED CAPITAL B

A7E8   ꟨   REVERSED CAPITAL P

A7E9   ꟩   REVERSED CAPITAL Q

A7EA   ꟪   REVERSED SMALL A

A7EB   ꟫   REVERSED SMALL F

A7EC   ꟬   REVERSED SMALL L

A7ED   ꟭   REVERSED SMALL N

A7EE   ꟮   REVERSED SMALL R

A7EF   ꟯   REVERSED SMALL T

A7F0   ꟰   LATIN LETTER SMALL CAPITAL X


This section is probably reserved for medievalists. Some missing capitals. The missing pieces for fancy Unicode Latin text can go to the SMP, but since they would be used quite frequently in plain text chats and comments for emphasis, more than those medieval variants, it makes sense to squeeze them in BMP. Lisu or Fraser script provides many turned capital letters and reversed capital letters that are symmetrical along a horizontal axis. There are 4 more codepoints in Latin Extended E and 3 in Superscripts and Subscripts.

To Latin Extended-G in SMP: SUBSCRIPT LETTER SMALL B, SUBSCRIPT LETTER SMALL C, SUBSCRIPT LETTER SMALL D, SUBSCRIPT LETTER SMALL F, SUBSCRIPT LETTER SMALL G, SUBSCRIPT LETTER SMALL Q, TURNED CAPITAL Q, TURNED SMALL J

Maybe could fit the rest of letters for Unifon, but not small caps.


Siloti Nagri (3):

A82D   ꠭   Punctuation

A82E   ꠮   Punctuation

A82F   ꠯   SILOTI NAGRI SPACE


Common Indic Number Forms (6):

A83A   ꠺

A83B   ꠻

A83C   ꠼

A83D   ꠽

A83E   ꠾

A83F   ꠿


Phags-Pa (8):

A878   ꡸   PHAGS-PA MARK TRIPLE SHAD

A879   ꡹

A87A   ꡺

A87B   ꡻

A87C   ꡼

A87D   ꡽

A87E   ꡾

A87F   ꡿   PHAGS-PA SPACE


This is said to be the inspiration behind Hangul, whilst not being a direct predecessor nor an evolution, just like Aramaic isn't a direct predecessor of Brahmi. Considerable eclecticism and originality went into Brahmi and Hangul. The further from original Phoenician, the more featural the writing systems get in order to disambiguate shapes for different sounds, however the shapes have degenerated and merged, so more disambiguation is required. The most dramatic change seems to come with transitioning from stone or wood carving to writing on paper, a Chinese invention, or leaves. On the other hand, the invention of printing press has set the letter shapes in stone (pun not intended).


Saurashtra (14):

A8C6   ꣆

A8C7   ꣇

A8C8   ꣈

A8C9   ꣉

A8CA   ꣊

A8CB   ꣋

A8CC   ꣌

A8CD   ꣍

A8DA..A8DF   ꣚ ꣛ ꣜ ꣝ ꣞ ꣟   SAURASHTRA HEXADECIMAL DIGIT TEN..FIFTEEN


Again some hexadecimal digits, get creative if you got through the trouble of having different digit shapes.


Rejang (11):

A954   ꥔

A955   ꥕

A956   ꥖

A957   ꥗

A958   ꥘

A959   ꥙

A95A   ꥚

A95B   ꥛

A95C   ꥜

A95D   ꥝

A95E   ꥞


Hangul Jamo Extended A (3):

A97D   ꥽   HANGUL CHOSEONG NORTH KOREAN LIEU / YEORINRIEUL

A9FE   ꥾   HANGUL CHOSEONG NORTH KOREAN NNIEUL / NIONRIEUL

A97F   ꥿   HANGUL CHOSEONG NORTH KOREAN WIEUP


Apparently, North Korea tried some extensions, while South Korea tried the colonial alphabet, before returning to common 1946 version.


Javanese (5):

A9CE   ꧎

A9DA   ꧚

A9DB   ꧛

A9DC   ꧜

A9DD   ꧝


Myanmar Extended B (1):

A9FF   ꧿   MYANMAR SPACE


There's and Extension C in SMP adding various forms of decimal digits. Like who the hell actually uses them in >currentYear? The latinized forms are basically universal and digitizing old records into them instead would simplify processing considerably.


Cham (13):

AA37   ꨷

AA38   ꨸

AA39   ꨹

AA3A   ꨺

AA3B   ꨻

AA3C   ꨼

AA3D   ꨽

AA3E   ꨾

AA3F   ꨿

AA4E   ꩎

AA4F   ꩏   CHAM SPACE

AA5A   ꩚   CHAM DUODECIMAL DIGIT TEN

AA5B   ꩛   CHAM DUODECIMAL DIGIT ELEVEN


Tai Viet (24):

AAC3..AACF   ꫃ ꫄ ꫅ ꫆ ꫇ ꫈ ꫉ ꫊ ꫋ ꫌ ꫍ ꫎ ꫏

AAD0..AADA   ꫐ ꫑ ꫒ ꫓ ꫔ ꫕ ꫖ ꫗ ꫘ ꫙ ꫚


Meetei Mayek Extensions (9):

AAF7   ꫷

AAF8   ꫸

AAF9   ꫹

AAFA   ꫺

AAFB   ꫻

AAFC   ꫼

AAFD   ꫽

AAFE   ꫾

AAFF   ꫿   MEETEI MAYEK SPACE


Ethiopic Extended A (16):

AB00   ꬀

AB07   ꬇

AB08   ꬈

AB0F   ꬏

AB10   ꬐

AB17   ꬗

AB18..AB1F   ꬘ ꬙ ꬚ ꬛ ꬜ ꬝ ꬞ ꬟   Extra syllable group

AB27   ꬧

AB2F   ꬯


Latin Extended E (1):

AB6F   ꭯   LATIN SUBSCRIPT LETTER SMALL U WITH LEFT HOOK


Seems to be an IPA block, but new IPA characters seem to go to ExtG in SMP now. Latin Theta is a subject of heated discussion, however it doesn't look like a certain style of Greek Theta encoded separately as ϑ. Also there is both Z and Ƶ, g and ɡ, and many more, yet Cyrillic cursive forms are ignored.


Meetei Mayek (8):

ABEE   ꯮   Some sign

ABEF   ꯯   Another sign

ABFA..ABFF   ꯺ ꯻ ꯼ ꯽ ꯾ ꯿   MEETEI MAYEK HEXADECIMAL DIGIT TEN..FIFTEEN


Space for hexadecimal found again.


*** Ocean of Hangul Syllables ***

Some 12 leftover spaces at D7A4..D7AF, maybe for North Korean stylized dictator names from KPS 9566:

힤 힥 힦   Kim Il Song (Kim Ir Sen)

힧 힨 힩   Kim Chong Il

힪 힫 힬   Kim Chong Un

힭 힮 힯   Kim ...


Hangul Jamo Extended B (8):

D7C7   ퟇   HANGUL JUNGSEONG NORTH KOREAN YI / YEORINI

D7C8   ퟈   HANGUL JONGSEONG NORTH KOREAN LIEU / YEORINRIEUL

D7C9   ퟉   HANGUL JONGSEONG NORTH KOREAN NNIEUL / NIONRIEUL

D7CA   ퟊   HANGUL JONGSEONG NORTH KOREAN WIEUP

D7FC   ퟼   HANGUL JONGSEONG NORTH KOREAN RIEUL-YEORINGHIEUH

D7FD   ퟽   HANGUL JONGSEONG NORTH KOREAN YESIEUNG-KIEUK

D7FE   ퟾   HANGUL CHOSEONG NORTH KOREAN RIEUL-YEORINGHIEUH

D7FF   ퟿   HANGUL CHOSEONG NORTH KOREAN YESIEUNG-KIEUK


The North Korean extensions fit almost too perfectly. Han-gul might be quite a misnomer as Koreans are not Han, with North Koreans wanting to use Chosongul or "Korean Character" instead, but anyway. There are many characters that could have been named better in Unicode.


*** Surrogates and Private Use ***

Should be (inter)nationalized according to UCSUR. LINCUA and SIL leftovers are either precomposed characters, font variants, or can be represented with a higher-level protocol like HTML, LaTeX, or ECMA-48.

Private Use used to be from E800 to FDFF in 1.0.0, then it was moved a bit earlier to E000 to F7FF in 1.0.1 to make way for CJK compatibility ideographs, and seemingly gone from 1.1, before reappearing extended to F8FF in 2.0, being immediately preceded by surrogates from D800 to DFFF.


CJK Compatibility Ideographs (8+32):

FA6E   﩮

FA6F   﩯

FADA..FADF   﫚 﫛 﫜 﫝 﫞 﫟

FAE0..FAFF   﫠﫡﫢﫣﫤﫥﫦﫧﫨﫩﫪﫫﫬﫭﫮﫯﫰﫱﫲﫳﫴﫵﫶﫷﫸﫹﫺﫻﫼﫽﫾﫿


12 ideographs in this block are actually regular ideographs. May be used for urgently needed ones or components for IDS. Requested disunifications could also go there. Formerly used by Nerd Fonts in its entirety as a PUA spillover. Since version 3, Nerd Fonts use astral PUA, which isn't supported by Windows charmap. 


Alphabetic Presentation Forms (22):

FB07   ﬇   LATIN SMALL LIGATURE CT

FB08   ﬈   LATIN SMALL LIGATURE FJ

FB09   ﬉   LATIN SMALL LIGATURE FFJ

FB0A   ﬊   LATIN SMALL LIGATURE LONG S EZH

FB0B   ﬋   LATIN SMALL LIGATURE LONG S H

FB0C   ﬌   LATIN SMALL LIGATURE LONG S I

FB0D   ﬍   LATIN SMALL LIGATURE LONG S L

FB0E   ﬎   LATIN SMALL LIGATURE LONG S LONG S

FB0F   ﬏   LATIN SMALL LIGATURE T H

FB10   ﬐   LATIN CAPITAL LETTER CH

FB11   ﬑   LATIN SMALL LETTER CH

FB12   ﬒   LATIN CAPITAL LETTER C WITH SMALL LETTER H

FB18   ﬘   ARABIC LETTER SEEN ISOLATED FORM WITH TAIL

FB19   ﬙   ARABIC LETTER SHEEN ISOLATED FORM WITH TAIL

FB1A   ﬚   ARABIC LETTER SAD ISOLATED FORM WITH TAIL

FB1B   ﬛   ARABIC LETTER DAD ISOLATED FORM WITH TAIL

FB1C   ﬜   RIGHT HALF ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM

FB37   ﬷  RIGHT HALF ARABIC LIGATURE LAM WITH ALEF FINAL FORM

FB3D   ﬽  LEFT HALF ARABIC LIGATURE LAM WITH ALEF

FB3F   ﬿  LEFT HALF ARABIC LIGATURE LAM WITH ALEF WITH HAMZA ABOVE

FB42   ﭂  LEFT HALF ARABIC LIGATURE LAM WITH ALEF WITH HAMZA BELOW

FB45   ﭅  LEFT HALF ARABIC LIGATURE LAM WITH ALEF WITH MADDA ABOVE


Hebrew wide letters (use Letter Widener instead): HEBREW LETTER WIDE BET, HEBREW LETTER WIDE GIMEL, HEBREW LETTER WIDE VAV, HEBREW LETTER WIDE HET, HEBREW LETTER WIDE TET, HEBREW LETTER WIDE MEM, HEBREW LETTER WIDE NUN, HEBREW LETTER WIDE SAMEKH, HEBREW LETTER WIDE AYN, HEBREW LETTER WIDE PE, HEBREW LETTER WIDE TSADI, HEBREW LETTER WIDE QOF, HEBREW LETTER WIDE SHIN, HEBREW WIDE LIGATURE ALEF LAMED

Precomposed Ch is needed for Czech crosswords and word searches, it sticks out like a sore thumb. Initial Teaching Alphapet has the lowercase, which is to be encoded in the SMP Latin ExtG. Also Ch was in a variant of KOI8-CS. However, a similar zł proposal from the Polish National Bank itself with relevant historical references still got rejected, as well as the Hungarian Ft one (could also be in CJK compatibility as feet square, there's ㏎ after all). Meanwhile, ₧ from CP437 still lingers around, as well as ₯.

Uighur presentation forms were probably rejected in part due to complaints from Mainland China.

The arabic letters and the other 3 in Arabic Presentation Forms-B block are from a proposal for Win9x Arabic codepages compatibility, which seem to be a bespoke proprietary solution, like WinCyrillic, that was only popular because of market dominance. A slightly different encoding model was chosen for Unicode 1.0.0 back in 1991, even before Win3.0, that replaces the Win9x one in WinNT. Win10 has also finally began defaulting to UTF-8 instead of "ANSI". Better place to fit these compatibility characters would be alongside the wide Hebrew letters to not fragment the writing direction zones.


*** 32 noncharacters, maybe mechanism for more UCS-4 planes ***


Once the 17 planes run out, original ISO 10646 UCS-4 would want its revenge. Could work like this: once FDDx for uppermost 4 bits to make it self-synchronizing, and then 7 times FDEx for lower 28 bits, with overlong encodings forbidden. That makes it 16 octets for a character above 10:FFFFh, hopefully a valid reason for switching to UTF-8, originally capable of 31 bits in 6 octets.


Vertical Forms (6):

FE1A   ︚   PRESENTATION FORM FOR VERTICAL LESSER THAN

FE1B   ︛   PRESENTATION FORM FOR VERTICAL GREATER THAN

FE1C   ︜   PRESENTATION FORM FOR VERTICAL LESSER OR EQUAL THAN

FE1D   ︝   PRESENTATION FORM FOR VERTICAL GREATER OR EQUAL THAN

FE1E   ︞   PRESENTATION FORM FOR VERTICAL EQUALS

FE1F   ︟   PRESENTATION FORM FOR VERTICAL NOT EQUALS


Also let's not forget about arrows, which should be rotated adequately.


Small Form Variants (6):

FE53   ﹓   SMALL APOSTROPHE

FE67   ﹧   SMALL SOLIDUS

FE6C   ﹬   SMALL QUOTATION MARK

FE6D   ﹭   SMALL LOW LINE / SMALL CIRCUMFLEX ACCENT

FE6E   ﹮   SMALL VERTICAL LINE / SMALL GRAVE ACCENT

FE6F   ﹯   SMALL TILDE


Not the same as subscripts. The names of some ASCII symbols are weird. Who the hell uses "solidus" instead of slash, or "low line" instead of underscore?


Arabic Presentation Forms B (3):

FE75   ﹵   ARABIC LIGATURE SHADDA WITH KASRATAN MEDIAL FORM

FEFD   ﻽   ARABIC LIGATURE SHADDA WITH FATHATAN ISOLATED FORM

FEFE   ﻾   ARABIC LIGATURE SHADDA WITH FATHATAN MEDIAL FORM


Halfwidth and Fullwidth Forms (15):

FF00   ＀   FULLWIDTH SPACE (not necessarily same as Ideographic Space)

FFBF   ﾿   HALFWIDTH HANGUL LETTER NORTH KOREAN LIEU / YEORINRIEUL

FFC0   ￀   HALFWIDTH HANGUL LETTER NORTH KOREAN NNIEUL / NIONRIEUL

FFC1   ￁   HALFWIDTH HANGUL LETTER NORTH KOREAN WIEUP

FFC8   ￈   HALFWIDTH HANGUL LETTER NORTH KOREAN RIEUL-YEORINGHIEUH

FFC9   ￉   HALFWIDTH HANGUL LETTER NORTH KOREAN YESIEUNG-KIEUK

FFD0   ￐   HALFWIDTH SPACE

FFD1   ￑   HALFWIDTH HANGUL LETTER NORTH KOREAN YI / YEORINI

FFD8   ￘   HALFWIDTH HANGUL LETTER ARAE-A

FFD9   ￙   HALFWIDTH HANGUL LETTER ARAE-A YI

FFDD   ￝   SQUARE WITH SPECKLES FILL [nonapproved]

FFDE   ￞   FULLWIDTH PERIOD AND RIGHT PARENTHESIS (KPS 9566)

FFDF   ￟   FULLWIDTH PERIOD AND RIGHT DOUBLE ANGULAR BRACKET (KPS 9566)

FFE7   ￧   IDEOGRAPHIC BLACK SQUARE [nonapproved] (IDEOGRAPHIC FULL BLOCK)

FFEF   ￯   IDEOGRAPHIC WHITE SQUARE [nonapproved] (4-bit little endian BOM shenanigans)


Specials (9):

FFF0   *   FORCE MIRRORING DOWNWARD BOUSTROPHEDON (Attic Greek)

FFF1   *   FORCE MIRRORING UPWARD BOUSTROPHEDON

FFF2   *   FORCE ROTATING DOWNWARD BOUSTROPHEDON

FFF3   *   FORCE ROTATING UPWARD BOUSTROPHEDON (Rongorongo)

FFF4   *   TEXT TURN LEFT (like Befunge)

FFF5   *   TEXT TURN RIGHT

FFF6   *   TEXT TRAMPOLINE (jump accross a cell)

FFF7   *   INHIBIT BOUSTROPHEDON

FFF8   *   CHARACTER OVERSTRIKE


I can't paste these here, despite being perfectly regular Unicode characters. The noncharacter FFFE is Wrong Byte Order Mark and FFFF is EOF in 16-bit character type. In UTF-8, EOF is FF, and FE means it's maybe UTF-16, UTF-32, or some legacy 8-bit encoding. Also possible is that FE indicates a non-standard 36-bit codepoint and FF a 42-bit codepoint, to allow for even bigger character sets and font banks. However, a ROM for a 16-bit 16x16 monochrome charset is already 2 MB (34 MB for all 17 Unicode planes), and for a 32-bit one it would be 128 GB.


Total remaining space in BMP (18.0+provisional apr 2026): 9+2+22+3+14+2+3+3+5+5+30+48+37+35+56+26+36+10+37+41+45+45+8+26+6+4+3+7+9+9+12+14+14+17+10+12+13+13+2+17+12+1+8+6+4+2+8+5+23+1+2+9+15+4+22+21+2+5+8+21+17+30+13+10+16+3+5+2+9+1+3+9+20+8+18+3+6+8+14+11+3+5+1+13+24+9+16+1+8+12+8+40+22+6+6+3+15+9 = 1296


That's all the space left in the BMP, the original scope of Unicode. Still quite plenty, but fragmented and bound to specific scripts.

The SMP is almost fully laid out in the Roadmap. The largest gaps are:

300h (768) characters between Egyptian and Mayan hieroglyphs,
400h (1024) between Khitan Large Script and Pau Cin Hau Syllabary,
300h (768) between Linear Elamite and Oromo (Rongorongo gone),
270h (624) between Jianzi and Latin ExtG,
290h (656) between Adlam and Persian Siyaq Numbers.

In SIP there are 3 free ranges of 20h (32), 9A0h (2464) and 5DEh (1502). Could be used for hybrid ideographs, innovations, and non-radical components.

The TIP is 1/5 full in Unicode 17, which with annual ~5000 sinogram repertoire means about 10 years before filling up, pushing the long-planned old Hanzi scripts out of this plane. Also there was a Classical Yi proposal of 88613 characters, which needs 2 dedicated planes (65534+23079), which is not extraordinary given the Han characters span 3 planes already.

The SSP is still mostly empty. There are some requests to add obscure control codes and dashed boxes there.

I can see the planes turning out like this:

0 - Basic Multilingual Plane
1 - Supplementary Multilingual Plane
2 - Supplementary Ideographic Plane - Han + hybrid + components
3 - Tertiary Ideographic Plane - Han + Sawndip + Old Han (Seal, Bronze, Bone)
4 - Archaic Ideographic Plane - Old Han (Warring States)
5 - Quaternary Ideographic Plane - Obscure Han place and personal names
6 - Pentenary (?) Ideographic Plane - Classical Yi
7 - Sexenary (?) Ideographic Plane - Classical Yi
8 - Tertiary Multilingual Plane - unproposed historical scripts as per SEI
9 - Quaternary Multilingual Plane - neographies used for living languages
A - Intelectual Property Encumbered Plane - paid advertising, conscripts
B - Supplementary Pictographic Plane - more Emoji, flags finally
C - Supplementary Pseudographics Plane - 4x4 (+65280)
D - Tertiary Pseudographics Plane - 2x8 (+65280), 1x6 (64)
E - Supplementary Special-Purpose Plane - boxes, super-surrog8, 3x5 (32766)
F - Supplementary Private Use Plane A - USCUR, rejected scripts
10 - Supplemental Private Use Plane B - direct glyph address, font tech stuff
beyond - alien scripts

Still, according to the SEI, there are many scripts not yet even proposed (taken from old website, probably outdated):

Ariyaka, Avoiuli, Aztec Pictograms, Badaga, Bimanese, Borama (Gadabuursi), Bowen (Lao Baiwen), Bronze script*, Byblos*, Chak, Chola*, Cretan Hieroglyphs, Demotic (Egyptian)**, Epi-Olmec, Fula-1 (Fula Dita), Fula-2 (Ba), Gabelsberger Shorthand, Gbékoun, Gregg Shorthand, HamNoSys, Hausa-3 (Tafi), Hieratic (Egyptian)**, Iban, Isibheqe Sohlamvu, Kadamba, Kaddare, Kaida*, Karani, Khom, Kurux Banna, Kushan, Kwekor, Lontara Bilang Bilang*, Marchung*, Ma-sa-ba (Bambara), Micmac Hieroglyphs, Minangkabau, Mixtec, Nasu, Nisu, Numidian*, Nwagu Aneke Igbo, Old Minahasa, Olmec, Otomaung, Pabuchi, Pahawh Hmong First Stage, Pungchung*, Punic***, Ranjana (Landzya)*, Satavahana, Shankha (Shell script), Shavian Quikscript*, Stokoe (Stokoe Notation), Tai Pao, Teotihuacan, Veso Bey, Visible Speech****, Yi - Chuxiong, Yi - Sani, Zapotec, Zhuang Square, Archaic Miao Square Script, Asho Chin, Cacaxtla, Duota, Fakkham, Izapa, Jing (Zinan), Kaminaljuyu, Khamyang, Lik Hto Ngouk, Old Khmer, Old Mon, Pale Palaung, Rakhawunna, Rencong, Savara, Tajin, Takalik Abaj, Thai Nithet, Thai Noi (Lao Buhan), Thai Yo(r), Tula

(no star) (69)

* Has tentative allocations in the roadmap (9)

** Said to be unified with Hieroglyphs or Meroitic (2)

*** Said to be unified with Phoenician (1)

**** Already encoded in UCSUR (1)

People were making scripts up like crazy even then, only it was before copyright, so these conscripts could actually be used widely enough. Once Tolkien's copyright expires in 2044 (18 years left as of 2026) if not extended somehow, Cirth, Tengwar, and Sarati are likely coming, with more on the way, courtesy of the LOTR fandom. Klingon pI'qaD won't get into public domain before 2065, which would make it one of final additions. (Inter)Nationalizing the PUA is more likely. I doubt anyone cares about Mexico's life + 100 years nonsense, as if vampires were real.