UTF-12


© 2010 by Andrey V. Lukyanov // e-mail: land@long.yar.ru


Published: 2010-06-16
Last changed: 2010-09-19

► Русский ►

◄ Site contents ◄


License

This document is copyrighted but may be used for any purpose, without any conditions. In particular, you may redistribute it, create new documents based on it or develop software using the ideas expressed here at no charge and without asking for permission. Mentioning the original author’s name and providing a link to this web page is recommended but not required.

The author hopes that this document will be useful but gives no warranty of any kind and will accept no liability. The entire risk is with you. In case of doubt, do not read this document any further.

Introduction

UTF-12 is a proposed system for representing Unicode characters with a stream of 12-bit units.

The primary use for UTF-12 would be in combination with the Base64 scheme, because a 12-bit unit is very conveniently represented with exactly two Base64 symbols.

In its binary form, UTF-12 could be useful in a computer where the minimal addressable memory units are 12 bits. In 1960s there indeed existed minicomputers with 12-bit machine words (one example is PDP-8).

In the following discussion, 12-bit units will be called “slabs”. Slab values are unsigned.

The proposed encoding system possesses the following good properties:

Description

Depending on their values, slabs are distributed to UTF-12 types as follows:

Slab type Value (hex.) Binary representation
(the x bits may have any value)
from to
single 000 7BF 0yyyyyxxxxxx, where at least one of the y is 0
leading 7C0 BFF 011111xxxxxx or 10xxxxxxxxxx
trailing C00 FFF 11xxxxxxxxxx

Unicode characters from 0000 to 07BF are encoded with a single slab of the same value. In particular, Latin, Greek, Cyrillic, Armenian, Hebrew and Arabic letters are all encoded with one slab per character.

Unicode characters from 07C0 to 10FFFF are encoded with a sequence of two slabs in the following manner:

This may also be depicted as follows (in binary notation):

Unicode character
in the range 0000—07BF
Unicode character
in the range 07C0—10FFFF
0 0 0 0 0 0 0 0 0 0 a a a a a a a a a a a aaaaaaaaaaabbbbbbbbbb
+ a
1
a
1
a
1
a
1
a
1
a
0
a
0
a
0
a
0
a
0
a
0
b b b b b b b b b b
0 a a a a a a a a a a a c c c c c c c c c c c c 1 1 b b b b b b b b b b
single slab leading slab trailing slab

Values above 10FFFF cannot be represented in UTF-12.

Examples

Unicode character UTF-12
binary hex. octal Base64
U+0000 000000000000 000 0000 AA
U+07BF 011110111111 7BF 3677 e/
U+07C0 011111000001 111111000000 7C1 FC0 3701 7700 fB/A
U+0800 011111000010 110000000000 7C2 C00 3702 6000 fCwA
U+FEFF 011111111111 111011111111 7FF EFF 3777 7377 f/7/
U+FFFF 011111111111 111111111111 7FF FFF 3777 7777 f///
U+10000 100000000000 110000000000 800 C00 4000 6000 gAwA
U+10FFFF 101111111111 111111111111 BFF FFF 5777 7777 v///

Efficiency

Comparing with UTF-8 and UTF-16:

Unicode range Bits per character
from to UTF-8 UTF-12 UTF-16
0000 007F 8 12 16
0080 07BF 16 12 16
07C0 07FF 16 24 16
0800 FFFF 24 24 16
10000 10FFFF 32 24 32

So, UTF-12 is the most efficient way of representing Unicode characters in the range 0080—07BF (this range includes Greek, Cyrillic, Armenian, Hebrew and Arabic letters), as well as in the range 10000—10FFFF (various exotic characters).

UTF-12 is inferior to UTF-8 only in the ranges 0000—007F (plain ASCII) and 07C0—07FF (NKo).

Invalid slab sequences

Invalid sequences in UTF-12 are generally of the same nature as in UTF-8.

Malformed UTF-12 sequences are produced when a trailing slab occurs after another trailing slab or after a single slab, or a leading or a single slab occurs where a trailing slab is expected. An UTF-12 decoder should be prepared for this.

Another type of invalid sequences are artificially lengthened sequences (overlong forms). One may encode Unicode characters from 0000 to 07BF with not a single slab, but with a two: for instance, a null character (0000) may be encoded not only as 000, but also as 7С0+С00.

Another problem is with surrogate pair characters (symbols from the range D800—DFFF). They are used in UTF-16, but are unnecessary in UTF-12.

In UTF-12, overlong forms and surrogate characters are forbidden. It means that an UTF-12 stream should not contain slabs with values 7C0, 7F6 and 7F7. Where security is important, it is better not to accept any malformed strings.

In the procedures below, the phrase “Raise an error” may be interpreted in different ways, e. g.:

Encoding procedure

Let the input stream be of 21-bit Unicode characters, and the output stream be of 12-bit slabs. Start the cycle:

Decoding procedure

Let the input stream be of 12-bit slabs, and the output stream be of 21-bit Unicode characters. Start the cycle:

Representing UTF-12 in standard computers

It is possible that someone will want to implement UTF-12 (in binary form) on a common computer with 8-bit bytes (e. g. for storing Greek or Russian texts more efficiently). For this case, the following system is proposed:

Suppose we have a small text consisting of three Latvian letters ģ (Unicode value 0123): ģģģ. UTF-12 stored in 8-bit bytes will look like this:

Slabs 1 2 3 1 2 3 1 2 3
Bytes 1 2 3 1 2 3 1 2 3 0

The same method may be used to transmit UTF-12 over protocols designed to transmit 8-bit bytes.

In an environment that permits only 7-bit data or restricts the use of controls and spaces (as with MIME), one may use UTF-12 additionally encoded with Base64. UTF-12 perfectly fits to Base64, because each slab is represented with exactly two Base64 characters, and each Unicode character will be represented with two or four Base64 characters. The above letter sequence ģģģ encoded in UTF-12 over Base64 will become EjEjEj.

Base64 should encode slabs directly, without prior conversion to bytes.