UTF-12

Published: 2010-06-16
Last changed: 2010-09-19

License

This document is copyrighted but may be used for any purpose, without any conditions. In particular, you may redistribute it, create new documents based on it or develop software using the ideas expressed here at no charge and without asking for permission. Mentioning the original author’s name and providing a link to this web page is recommended but not required.

The author hopes that this document will be useful but gives no warranty of any kind and will accept no liability. The entire risk is with you. In case of doubt, do not read this document any further.

Introduction

UTF-12 is a proposed system for representing Unicode characters with a stream of 12-bit units.

The primary use for UTF-12 would be in combination with the Base64 scheme, because a 12-bit unit is very conveniently represented with exactly two Base64 symbols.

In its binary form, UTF-12 could be useful in a computer where the minimal addressable memory units are 12 bits. In 1960s there indeed existed minicomputers with 12-bit machine words (one example is PDP-8).

In the following discussion, 12-bit units will be called “slabs”. Slab values are unsigned.

The proposed encoding system possesses the following good properties:

Each Unicode character is represented with either a single 12-bit slab, or with a sequence of two slabs.
Given a slab value, one can unambiguously determine whether the slab is single, leading (in a two-slab sequence) or trailing.
The trivial string sorting on slab values gives the same result as sorting on the proper Unicode values.

Description

Depending on their values, slabs are distributed to UTF-12 types as follows:

Slab type	Value (hex.)		Binary representation (the x bits may have any value)
Slab type	from	to	Binary representation (the x bits may have any value)
single	000	7BF	0yyyyyxxxxxx, where at least one of the y is 0
leading	7C0	BFF	011111xxxxxx or 10xxxxxxxxxx
trailing	C00	FFF	11xxxxxxxxxx

Unicode characters from 0000 to 07BF are encoded with a single slab of the same value. In particular, Latin, Greek, Cyrillic, Armenian, Hebrew and Arabic letters are all encoded with one slab per character.

Unicode characters from 07C0 to 10FFFF are encoded with a sequence of two slabs in the following manner:

The 21-bit Unicode character is split into the left 11 bits (value A) and the right 10 bits (value B).
Adding 0x7C0 to the value A gives the first slab.
Prefixing the value B with two ‘11’ bits (= bitwise OR with 0xC00) gives the second slab.

This may also be depicted as follows (in binary notation):

Unicode character
in the range 0000—07BF

Unicode character
in the range 07C0—10FFFF

aaaaaaaaaaabbbbbbbbbb

↓

a
1

a
0

single slab

leading slab

trailing slab

Values above 10FFFF cannot be represented in UTF-12.

Examples

Unicode character	UTF-12
Unicode character	binary	hex.	octal	Base64
U+0000	000000000000	000	0000	AA
U+07BF	011110111111	7BF	3677	e/
U+07C0	011111000001 111111000000	7C1 FC0	3701 7700	fB/A
U+0800	011111000010 110000000000	7C2 C00	3702 6000	fCwA
U+FEFF	011111111111 111011111111	7FF EFF	3777 7377	f/7/
U+FFFF	011111111111 111111111111	7FF FFF	3777 7777	f///
U+10000	100000000000 110000000000	800 C00	4000 6000	gAwA
U+10FFFF	101111111111 111111111111	BFF FFF	5777 7777	v///

Efficiency

Comparing with UTF-8 and UTF-16:

Unicode range		Bits per character
from	to	UTF-8	UTF-12	UTF-16
0000	007F	8	12	16
0080	07BF	16	12	16
07C0	07FF	16	24	16
0800	FFFF	24	24	16
10000	10FFFF	32	24	32

So, UTF-12 is the most efficient way of representing Unicode characters in the range 0080—07BF (this range includes Greek, Cyrillic, Armenian, Hebrew and Arabic letters), as well as in the range 10000—10FFFF (various exotic characters).

UTF-12 is inferior to UTF-8 only in the ranges 0000—007F (plain ASCII) and 07C0—07FF (NKo).

Invalid slab sequences

Invalid sequences in UTF-12 are generally of the same nature as in UTF-8.

Malformed UTF-12 sequences are produced when a trailing slab occurs after another trailing slab or after a single slab, or a leading or a single slab occurs where a trailing slab is expected. An UTF-12 decoder should be prepared for this.

Another type of invalid sequences are artificially lengthened sequences (overlong forms). One may encode Unicode characters from 0000 to 07BF with not a single slab, but with a two: for instance, a null character (0000) may be encoded not only as 000, but also as 7С0+С00.

Another problem is with surrogate pair characters (symbols from the range D800—DFFF). They are used in UTF-16, but are unnecessary in UTF-12.

In UTF-12, overlong forms and surrogate characters are forbidden. It means that an UTF-12 stream should not contain slabs with values 7C0, 7F6 and 7F7. Where security is important, it is better not to accept any malformed strings.

In the procedures below, the phrase “Raise an error” may be interpreted in different ways, e. g.:

Abort processing (“hard mode”), or
Write a replacement character (usually FFFD) to the output stream and return to the cycle head to continue (“soft mode”).

Encoding procedure

Let the input stream be of 21-bit Unicode characters, and the output stream be of 12-bit slabs. Start the cycle:

Read the next Unicode character from the input stream.
Check that the input character belongs to the ranges 0000—D7FF or E000—10FFFF.

If not, raise an error.

If the input character is less than 0x7C0, then write a slab of the same value to the output stream.
If the input character is greater than or equal to 0x7C0, then:

Shift the original Unicode value 10 bits to the right; add 0x7C0 to the result; write the result to the output stream.
Apply bitwise AND with 0x3FF to the original Unicode value; apply bitwise OR with 0xC00 to the result; write the result to the output stream.

Repeat the cycle.

Decoding procedure

Let the input stream be of 12-bit slabs, and the output stream be of 21-bit Unicode characters. Start the cycle:

Read the next slab from the input stream.
Check that the input slab value is less than 0xC00.

If not, raise an error.

If the input slab value is less than 0x7C0, then write an Unicode character of the same value to the output stream.
If the input slab value is greater than or equal to 0x7C0, then:

Subtract 0x7C0 from the slab value; shift the result 10 bits to the left; store the result in the register A.
Read the next slab from the input stream.
Check that the value of this slab is greater than or equal to 0xC00.

If not, raise an error.

Apply bitwise AND with 0x3FF to this slab; combine the result with the value in the register A using bitwise OR.
Check that the resulting Unicode character belongs to ranges 07C0—D7FF or E000—10FFFF.

If not, raise an error.

Write the Unicode character to the output stream.

Repeat the cycle.

Representing UTF-12 in standard computers

It is possible that someone will want to implement UTF-12 (in binary form) on a common computer with 8-bit bytes (e. g. for storing Greek or Russian texts more efficiently). For this case, the following system is proposed:

Two 12-bit slabs are stored in three 8-bit bytes.
Bytes and slabs are considered big-endian, i. e. the most significant bit comes first.
If a superfluous half-byte is left at the end, it must be set to zero.

Suppose we have a small text consisting of three Latvian letters ģ (Unicode value 0123): ģģģ. UTF-12 stored in 8-bit bytes will look like this:

Slabs	1	2	3	1	2	3	1	2	3
Bytes	1	2	3	1	2	3	1	2	3	0

The same method may be used to transmit UTF-12 over protocols designed to transmit 8-bit bytes.

In an environment that permits only 7-bit data or restricts the use of controls and spaces (as with MIME), one may use UTF-12 additionally encoded with Base64. UTF-12 perfectly fits to Base64, because each slab is represented with exactly two Base64 characters, and each Unicode character will be represented with two or four Base64 characters. The above letter sequence ģģģ encoded in UTF-12 over Base64 will become EjEjEj.

Base64 should encode slabs directly, without prior conversion to bytes.