Null
|
none is a universal variable-length character encoding and code-page standard. The goal is that none becomes the only text encoding needed and used; to make both programming and working with texts less frustrating. Text encoding finally has to become a none question. |
none encoding is lightening simple and regular.
1
the character-code continues.0
it is the last byte of a character-code.The different character-codes can be encoded with one or more bytes:
Byte 1 | Byte 2 | Byte 3 | Byte 4 | Code Space |
---|---|---|---|---|
0xxxxxxx | 27 = 128 | |||
1xxxxxxx | 0xxxxxxx | 214 = 16,384 | ||
1xxxxxxx | 1xxxxxxx | 0xxxxxxx | 221 = 2,097,152 | |
1xxxxxxx | 1xxxxxxx | 1xxxxxxx | 0xxxxxxx | 228 = 268,435,456 |
... |
The qualities of this encoding scheme are:
ASCII
compatible (single byte with 0
MSB)214 = 16,384
possible 2-byte characters (that is 8 times UTF-8
)none encoding and decoding from and to character-codes is as straightforward as it can get. All character-codes are allocated on the code page in such a way that the encoded sequence of bits of a character’s bytes interpreted as one unsigned integer is the character code (code point).
Example: | Byte 1 | Byte 2 | Code | ||
---|---|---|---|---|---|
Bytes | 11100000 | 01100101 | |||
uint32 | 00000000 | 00000000 | 11100000 | 01100101 | = 57445 |
The code for a 2-byte character is e.g. extracted from the character’s bytes to a 32-bit integer by loading them into a register - vice versa a code could be split into encoded bytes by simply removing leading zero bytes.
All codes (numeric values) that cannot be used in such a way aren’t allocated.
The above table shows the distribution of usable ( )
and unusable ( ) regions within the first 2 bytes each having
a code space of 128 characters.
The first region is assigned to the ASCII
A
character set.
This comb like pattern reoccurs 128 times within the 3-byte range in the
second half of its code space. This again reoccurs 128 times within for
4-byte range and so forth.
The ASCII
region though is an artefact of single byte characters and special
to the first comb. It does not reoccur in later combs.
In addition no codes with a 1000 0000
byte are assigned. These are
reserved for an alternative fixed length encoding only used in memory.
The assignment of characters to codes is withal not arbitrary.
With ASCII
as basis any other character gets assigned to a code so that a
handful of crucial properties emerge that make arithmetic of common text
processing tasks simpler and more efficient.
Definiteness: There is one and only one sequence of bytes to represent a particular character. Characters do encode meaning, never presentation aspects: 1 character = 1 code = 1 sequence of bytes.
Analogousness: If two sequences of bytes are equal they always do represent the same sequence of characters: bytes of a sequence of characters = sequence of bytes of those characters.
Reducibility: Reduction to Roman/ASCII is done by dropping all but a character’s last byte. This implies that characters that are not based on Roman letters or digits never have a final byte that would indicate such a relation.
Composability: composition of letters and diacritics is done by adding
bytes or codes. It follows that characters that are not independent
(and as such also not members of a alphabet) do use the NUL
final byte.
Given the encoding-scheme, the code-page arrangement and the properties of character assignment common text processing tasks have a low computational complexity and are straight forward to implement.
Character…
Character sequence (of length n; encoded as bytes)…
Encoding is a balancing act between compact representation (what benefits from a variable-length encoding) and computational efficiency (what benefits from a fixed-length encoding).
While the encoding-scheme is a variable-length encoding made for compact representation it can be extended to different fixed length forms without contradicting the properties or introduce multiple distinct encodings.
To extend the variable-length encoding to a fixed with of 2, 3 or 4 bytes per
character the characters encoded with less than the chosen length are filled
will the reserved byte 1000 0000
.
variable 0110 0000
fixed-2 1000 0000 0110 0000
fixed-3 1000 0000 1000 0000 0110 0000
fixed-4 1000 0000 1000 0000 1000 0000 0110 0000
A fixed-length of less than 4 bytes means that characters encoded in 3 or 4 bytes cannot be represented. This is useful anyhow as most scripts are represented in 1 or 2 bytes per character.
As the byte 1000 0000
does never occur as part of the actual code-point value
a extended form is easily recognisable and can be stripped away or added on the
borders of a system or program or module therein.
Encoding is a necessity to give bits and bytes the meaning of characters and text. Historically a variety of encodings have coexisted on different systems for different scripts. Each language space had its own predominant encoding that often also differed for the diverse systems.
When text content started to cross system borders more frequently due to growing interconnection of computer systems their users were faced with a new kind of incompatibly as programs could not handle the bits and bytes of alien text correctly.
Interconnected programs and systems that shared text also needed to share a commonly understood encoding. But a universal encoding comes with new problems: Instead of one script it now had to encode characters of all the world’s scripts and a wide variety of other symbols while still being efficient.
Today Unicode
as UTF-8
or UTF-16
encoding is used for more and more
text documents. The idea of a universal encoding is widely established and very
appreciated among programmers.
The vision of a common encoding and as a consequence thereof the disappearance of encoding as a source of complexity and failure seems to be within one’s grasp.
This is an illusion though. On closer inspection Unicode
unfolds critical flaws.
Modifier characters
corrupt a equivalence between a code-point and a visible character whereby
character count and indexed access is either incorrect or inefficient.
Normalisation forms grotesquely exhibit
how the same character can be encoded differently; as a consequence a
banal equality check is far from being trivial or efficient.
Presentation forms,
surrogate pairs
or byte order marks
are likewise unfortunate.
Paradoxically Unicode
includes multiple encodings so that programs still
have to make their best effort to guess text encoding right -
exactly what a universal encoding should have corrected.
To make matters worse the popular UTF-8
and UTF-16
encoding schemata do not
render the occurrence of malformed byte sequences impossible.
Constantly validating text IO, however, is unreasonable inefficient,
wherefore it is often dropped, what in turn is incorrect.
Considering all that makes the unpleasant cascades of conditional constructs needed
to encode or decode UTF-8
almost appear like a triviality.
Regrettably Unicode
is not one text encoding, it does not end encoding
incompatibilities and constant recoding but rather gives birth to complex code
trapped in a dilemma to choose between efficient or correct.
It’s text, we all know how to handle that.
It requires great effort and knowledge about the worlds scripts to arrange the code-page in such a way that the properties described above arise. The project requires the help of language experts to take this step.
From the numbers I know of it should be possible. Roughly there are about 200 scripts worldwide. Most of them are not logographic. I made a rough estimate of about 40 characters per script on average:
200 x 40 = 8000
That is roughly halve of the code-space available using 2 bytes, what is the goal for these scripts. Logographic scripts are encoded using 3 or 4 bytes. There is plenty of space available - the main question here is: What is a good way to arrange them on the code-page so that typical text-processing tasks are simple to implement.
Maybe I missed something? Maybe you have ideas of more properties that are important or could be added?
If you want to contribute in any way please contact me or use the project’s issue system.