Null
Or
Next
Encoding

none is a universal variable-length character encoding
and code-page standard.

The goal is that none becomes the only text encoding needed and used;
to make both programming and working with texts less frustrating.

Text encoding finally has to become a none question.

Encoding-Scheme

none encoding is lightening simple and regular.

  1. Each character is identified by a single code, its character-code.
  2. Any character-code is encoded as a sequence of one or more bytes.
  3. If the MSB of a byte is 1 the character-code continues.
  4. If the MSB of a byte is 0 it is the last byte of a character-code.
  5. The bits and bytes are in Big-Endian order.

The different character-codes can be encoded with one or more bytes:

Byte 1 Byte 2 Byte 3 Byte 4 Code Space
0xxxxxxx  2 = 128
1xxxxxxx 0xxxxxxx  214 = 16,384
1xxxxxxx 1xxxxxxx 0xxxxxxx  221 = 2,097,152
1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx  228 = 268,435,456
...

The qualities of this encoding scheme are:

Code-Page

none encoding and decoding from and to character-codes is as straightforward as it can get. All character-codes are allocated on the code page in such a way that the encoded sequence of bits of a character’s bytes interpreted as one unsigned integer is the character code (code point).

Example: Byte 1 Byte 2 Code
Bytes 11100000 01100101
uint32  00000000 00000000 11100000 01100101 = 57445

The code for a 2-byte character is e.g. extracted from the character’s bytes to a 32-bit integer by loading them into a register - vice versa a code could be split into encoded bytes by simply removing leading zero bytes.

All codes (numeric values) that cannot be used in such a way aren’t allocated.

A


The above table shows the distribution of usable ( ) and unusable ( ) regions within the first 2 bytes each having a code space of 128 characters. The first region is assigned to the ASCII A character set.

This comb like pattern reoccurs 128 times within the 3-byte range in the second half of its code space. This again reoccurs 128 times within for 4-byte range and so forth. The ASCII region though is an artefact of single byte characters and special to the first comb. It does not reoccur in later combs.

In addition no codes with a 1000 0000 byte are assigned. These are reserved for an alternative fixed length encoding only used in memory.

Properties

The assignment of characters to codes is withal not arbitrary. With ASCII as basis any other character gets assigned to a code so that a handful of crucial properties emerge that make arithmetic of common text processing tasks simpler and more efficient.

Definiteness: There is one and only one sequence of bytes to represent a particular character. Characters do encode meaning, never presentation aspects: 1 character = 1 code = 1 sequence of bytes.

Analogousness: If two sequences of bytes are equal they always do represent the same sequence of characters: bytes of a sequence of characters = sequence of bytes of those characters.

Reducibility: Reduction to Roman/ASCII is done by dropping all but a character’s last byte. This implies that characters that are not based on Roman letters or digits never have a final byte that would indicate such a relation.

Composability: composition of letters and diacritics is done by adding bytes or codes. It follows that characters that are not independent (and as such also not members of a alphabet) do use the NUL final byte.

Computations

Given the encoding-scheme, the code-page arrangement and the properties of character assignment common text processing tasks have a low computational complexity and are straight forward to implement.

Character…

Character sequence (of length n; encoded as bytes)…

Extensions

Encoding is a balancing act between compact representation (what benefits from a variable-length encoding) and computational efficiency (what benefits from a fixed-length encoding).

While the encoding-scheme is a variable-length encoding made for compact representation it can be extended to different fixed length forms without contradicting the properties or introduce multiple distinct encodings.

To extend the variable-length encoding to a fixed with of 2, 3 or 4 bytes per character the characters encoded with less than the chosen length are filled will the reserved byte 1000 0000.

	variable                                  0110 0000
	fixed-2                        1000 0000  0110 0000
	fixed-3             1000 0000  1000 0000  0110 0000
	fixed-4  1000 0000  1000 0000  1000 0000  0110 0000

A fixed-length of less than 4 bytes means that characters encoded in 3 or 4 bytes cannot be represented. This is useful anyhow as most scripts are represented in 1 or 2 bytes per character.

As the byte 1000 0000 does never occur as part of the actual code-point value a extended form is easily recognisable and can be stripped away or added on the borders of a system or program or module therein.

Motivation

History

Encoding is a necessity to give bits and bytes the meaning of characters and text. Historically a variety of encodings have coexisted on different systems for different scripts. Each language space had its own predominant encoding that often also differed for the diverse systems.

When text content started to cross system borders more frequently due to growing interconnection of computer systems their users were faced with a new kind of incompatibly as programs could not handle the bits and bytes of alien text correctly.

Interconnected programs and systems that shared text also needed to share a commonly understood encoding. But a universal encoding comes with new problems: Instead of one script it now had to encode characters of all the world’s scripts and a wide variety of other symbols while still being efficient.

The Unicode Dilemma

Today Unicode as UTF-8 or UTF-16 encoding is used for more and more text documents. The idea of a universal encoding is widely established and very appreciated among programmers.

The vision of a common encoding and as a consequence thereof the disappearance of encoding as a source of complexity and failure seems to be within one’s grasp.

This is an illusion though. On closer inspection Unicode unfolds critical flaws. Modifier characters corrupt a equivalence between a code-point and a visible character whereby character count and indexed access is either incorrect or inefficient. Normalisation forms grotesquely exhibit how the same character can be encoded differently; as a consequence a banal equality check is far from being trivial or efficient. Presentation forms, surrogate pairs or byte order marks are likewise unfortunate.

Paradoxically Unicode includes multiple encodings so that programs still have to make their best effort to guess text encoding right - exactly what a universal encoding should have corrected. To make matters worse the popular UTF-8 and UTF-16 encoding schemata do not render the occurrence of malformed byte sequences impossible. Constantly validating text IO, however, is unreasonable inefficient, wherefore it is often dropped, what in turn is incorrect. Considering all that makes the unpleasant cascades of conditional constructs needed to encode or decode UTF-8 almost appear like a triviality.

Regrettably Unicode is not one text encoding, it does not end encoding incompatibilities and constant recoding but rather gives birth to complex code trapped in a dilemma to choose between efficient or correct.

The Future

It’s text, we all know how to handle that.

Contribution

It requires great effort and knowledge about the worlds scripts to arrange the code-page in such a way that the properties described above arise. The project requires the help of language experts to take this step.

From the numbers I know of it should be possible. Roughly there are about 200 scripts worldwide. Most of them are not logographic. I made a rough estimate of about 40 characters per script on average:

	200 x 40 = 8000

That is roughly halve of the code-space available using 2 bytes, what is the goal for these scripts. Logographic scripts are encoded using 3 or 4 bytes. There is plenty of space available - the main question here is: What is a good way to arrange them on the code-page so that typical text-processing tasks are simple to implement.

Maybe I missed something? Maybe you have ideas of more properties that are important or could be added?

If you want to contribute in any way please contact me or use the project’s issue system.