Version 0 (prerelease)
This document describes the Concise Binary Encoding (CBE) format, and how codecs for this format must behave.
- The text variant Concise Text Encoding (CTE) is described in cte-specification.md.
- The logical structure of Concise Encoding is described in ce-structure.md.
- Concise Binary Encoding
- Version
- This Document
- Contents
- Terms and Conventions
- What is Concise Binary Encoding?
- Document Structure
- Version Header
- Object Encoding
- Numeric Types
- Temporal Types
- Array and String Types
- Container Types
- Other Types
- Peudo-Objects
- Invisible Objects
- Structural Objects
- Empty Document
- Smallest Possible Size
- Alignment
- Version History
- License
The following bolded, capitalized terms have specific meanings in this document:
Term | Meaning |
---|---|
MUST (NOT) | If this directive is not adhered to, the document or implementation is invalid. |
SHOULD (NOT) | Every effort should be made to follow this directive, but the document/implementation is still valid if not followed. |
MAY (NOT) | It is up to the implementation to decide whether to do something or not. |
CAN | Refers to a possibility which MUST be accommodated by the implementation. |
CANNOT | Refers to a situation which MUST NOT be allowed by the implementation. |
OPTIONAL(LY) | The implementation MUST support both the existence and the absence of the specified item. |
OPTION(S) | Configuration option(s) that implementations MUST provide. |
Data descriptions and samples will generally be represented as follows:
- Character sequences are enclosed within backticks:
this is a character sequence
- Byte sequences are represented as a series of two-digit hex values, enclosed within backticks and square brackets: [
f1 33 91
] - Data placeholders are put
(between parentheses)
- Some explanations will include excerpts from cbe.dogma (in Dogma notation).
Concise Binary Encoding (CBE) is the binary variant of Concise Encoding: a general purpose, human and machine friendly, compact representation of semi-structured hierarchical data.
The binary format aims for compactness and machine processing efficiency while maintaining 1:1 compatibility with the text format (which aims to present data in a human friendly way).
A CBE document is composed of the following parts:
- A version header
- A series of optional intangible objects
- A top-level data object
document = version_header
& (padding | record_type)*
& (data_object ! local_reference)
;
The version header is composed of the byte [81
], followed by an unsigned LEB128 encoded version number representing which version of this specification the document adheres to.
version_header = u8(0x81) & version;
version = uleb128(uint(~,~));
uleb128(v: bits): bits = """https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128""";
Example:
[81 01] = CBE version 1
Object encoding in CBE follows these principles:
- All objects are composed of a 1 or 2-byte type code and a possible payload.
- All objects end on an 8-bit boundary.
- All variable-length discrete types begin with length fields.
- All containers and arrays can be built incrementally (you don't need to know their final size before you start encoding their contents).
- The most commonly used types and values encode into the smallest space while still remaining zero-copy friendly in most places on little endian systems.
All objects encoded in CBE begin with a type code, followed by a possible payload (depending on the type).
Note: Integers from -100 to 100 can be encoded directly into the type code with no payload.
Hex | Type | Payload |
---|---|---|
00 | Integer value 0 | |
01 | Integer value 1 | |
... | ... | |
64 | Integer value 100 | |
65 | UID | (128 bits of data, big endian) |
66 | Positive Integer | (byte count) (little endian bytes) |
67 | Negative Integer | (byte count) (little endian bytes) |
68 | Positive Integer (8 bit) | (8-bit unsigned integer) |
69 | Negative Integer (8 bit) | (8-bit unsigned integer) |
6a | Positive Integer (16 bit) | (16-bit unsigned integer, little endian) |
6b | Negative Integer (16 bit) | (16-bit unsigned integer, little endian) |
6c | Positive Integer (32 bit) | (32-bit unsigned integer, little endian) |
6d | Negative Integer (32 bit) | (32-bit unsigned integer, little endian) |
6e | Positive Integer (64 bit) | (64-bit unsigned integer, little endian) |
6f | Negative Integer (64 bit) | (64-bit unsigned integer, little endian) |
70 | Binary Float (16 bit) | (16-bit bfloat16, little endian) |
71 | Binary Float (32 bit) | (32-bit ieee754 binary float, little endian) |
72 | Binary Float (64 bit) | (64-bit ieee754 binary float, little endian) |
73 | RESERVED | |
74 | RESERVED | |
75 | RESERVED | |
76 | Decimal Float | (compact float) |
77 | Local Reference | (identifier) |
78 | Boolean False | |
79 | Boolean True | |
7a | Date | (compact date) |
7b | Time | (compact time) |
7c | Timestamp | (compact timestamp) |
7d | Null | |
7e | RESERVED | |
7f | Plane 7f | (See Plane 7f) |
80 | String: 0 bytes | |
81 | String: 1 byte | (1 byte of UTF-8 data) |
82 | String: 2 bytes | (2 bytes of UTF-8 data) |
83 | String: 3 bytes | (3 bytes of UTF-8 data) |
84 | String: 4 bytes | (4 bytes of UTF-8 data) |
85 | String: 5 bytes | (5 bytes of UTF-8 data) |
86 | String: 6 bytes | (6 bytes of UTF-8 data) |
87 | String: 7 bytes | (7 bytes of UTF-8 data) |
88 | String: 8 bytes | (8 bytes of UTF-8 data) |
89 | String: 9 bytes | (9 bytes of UTF-8 data) |
8b | String: 11 bytes | (11 bytes of UTF-8 data) |
8a | String: 10 bytes | (10 bytes of UTF-8 data) |
8c | String: 12 bytes | (12 bytes of UTF-8 data) |
8d | String: 13 bytes | (13 bytes of UTF-8 data) |
8e | String: 14 bytes | (14 bytes of UTF-8 data) |
8f | String: 15 bytes | (15 bytes of UTF-8 data) |
90 | String | (chunk header) (UTF-8 data) ... |
91 | Resource Identifier | (chunk header) (UTF-8 data) ... |
92 | Custom Type | (type code) (chunk header) (data) ... |
93 | Array: Unsigned Int8 | (chunk header) (8-bit elements) ... |
94 | Array: Bit | (chunk header) (1-bit elements) ... |
95 | Padding | |
96 | Record | (identifier) (value) ... (end container) |
97 | Edge | (source) (description) (destination) (end container) |
98 | Node | (value) (child node) ... (end container) |
99 | Map | (key, value) ... (end container) |
9a | List | (value) ... (end container) |
9b | End of Container | |
9c | Integer value -100 | |
... | ... | |
fe | Integer value -2 | |
ff | Integer value -1 |
Bulkier or less common types are encoded into a secondary type plane, which adds a second byte to the type code. Most of the array types encoded in this plane also include a short form encoding, where the length is encoded into the type field.
Types from plane 7f begin with the type code prefix [7f
], followed by their type code in that plane. For example, the type for signed 16-bit array with 8 elements is [7f 28
], and the type for media is [7f f3
].
Hex | Type | Elems | Payload |
---|---|---|---|
00 | Array: UID | 0 | (128-bit big endian element) x0 |
... | ... | ... | ... |
0f | Array: UID | 15 | (128-bit big endian element) x15 |
10 | Array: Signed Int8 | 0 | (8-bit element) x0 |
... | ... | ... | ... |
1f | Array: Signed Int8 | 15 | (8-bit element) x15 |
20 | Array: Unsigned Int16 | 0 | (16-bit little endian element) x0 |
... | ... | ... | ... |
2f | Array: Unsigned Int16 | 15 | (16-bit little endian element) x15 |
30 | Array: Signed Int16 | 0 | (16-bit little endian element) x0 |
... | ... | ... | ... |
3f | Array: Signed Int16 | 15 | (16-bit little endian element) x15 |
40 | Array: Unsigned Int32 | 0 | (32-bit little endian element) x0 |
... | ... | ... | ... |
4f | Array: Unsigned Int32 | 15 | (32-bit little endian element) x15 |
50 | Array: Signed Int32 | 0 | (32-bit little endian element) x0 |
... | ... | ... | ... |
5f | Array: Signed Int32 | 15 | (32-bit little endian element) x15 |
60 | Array: Unsigned Int64 | 0 | (64-bit little endian element) x0 |
... | ... | ... | ... |
6f | Array: Unsigned Int64 | 15 | (64-bit little endian element) x15 |
70 | Array: Signed Int64 | 0 | (64-bit little endian element) x0 |
... | ... | ... | ... |
7f | Array: Signed Int64 | 15 | (64-bit little endian element) x15 |
80 | Array: BFloat16 | 0 | (16-bit little endian element) x0 |
... | ... | ... | ... |
8f | Array: BFloat16 | 15 | (16-bit little endian element) x15 |
90 | Array: Binary Float32 | 0 | (32-bit little endian element) x0 |
... | ... | ... | ... |
9f | Array: Binary Float32 | 15 | (32-bit little endian element) x15 |
a0 | Array: Binary Float64 | 0 | (64-bit little endian element) x0 |
... | ... | ... | ... |
af | Array: Bin Float64 | 15 | (64-bit little endian element) x15 |
... | RESERVED | ||
e0 | Array: UID | ∞ | (chunk header) (128-bit B-E elements) ... |
e1 | Array: Signed Int8 | ∞ | (chunk header) (8-bit elements) ... |
e2 | Array: Unsigned Int16 | ∞ | (chunk header) (16-bit L-E elements) ... |
e3 | Array: Signed Int16 | ∞ | (chunk header) (16-bit L-E elements) ... |
e4 | Array: Unsigned Int32 | ∞ | (chunk header) (32-bit L-E elements) ... |
e5 | Array: Signed Int32 | ∞ | (chunk header) (32-bit L-E elements) ... |
e6 | Array: Unsigned Int64 | ∞ | (chunk header) (64-bit L-E elements) ... |
e7 | Array: Signed Int64 | ∞ | (chunk header) (64-bit L-E elements) ... |
e8 | Array: BFloat16 | ∞ | (chunk header) (16-bit L-E elements) ... |
e9 | Array: Binary Float32 | ∞ | (chunk header) (32-bit L-E elements) ... |
ea | Array: Binary Float64 | ∞ | (chunk header) (64-bit L-E elements) ... |
... | RESERVED | ||
f0 | Marker | 1 | (byte length) (UTF-8 data) |
f1 | Record Type | ∞ | (ID) (key) ... (end container) |
f2 | Remote Reference | 1 | (chunk header) (UTF-8 data) ... |
f3 | Media | ∞ | (byte length) (UTF-8 data) (chunk header) (bytes) ... |
... | RESERVED |
Represents true or false.
false = u8(0x78);
true = u8(0x79);
Examples:
[78] = false
[79] = true
CBE encoders MUST by default output integer values in the smallest type they'll fit into:
Values | Best Fit Type |
---|---|
± 0 - 100 |
small integer |
± 0x65 - 0xff |
8-bit integer |
± 0x100 - 0xffff |
16-bit integer |
± 0x10000 - 0xffffffff |
32-bit integer |
± 0x100000000 - 0xffffffffffff |
variable width integer |
± 0x1000000000000 - 0xffffffffffffffff |
64-bit integer |
± 0x10000000000000000 - ∞ |
variable width integer |
Integers are encoded in three possible ways:
Values from -100 to +100 ("small int") are encoded into the type code itself, and can be read directly as 8-bit signed two's complement integers.
int_small = s8(-100~100);
Fixed width integers are stored as their absolute values in widths of 8, 16, 32, and 64 bits (in little endian byte order). The type code holds the sign of the integer.
int_8_positive = u8(0x68) & u8(~);
int_8_negative = u8(0x69) & u8(~);
int_16_positive = u8(0x6a) & u16(~);
int_16_negative = u8(0x6b) & u16(~);
int_32_positive = u8(0x6c) & u32(~);
int_32_negative = u8(0x6d) & u32(~);
int_64_positive = u8(0x6e) & u64(~);
int_64_negative = u8(0x6f) & u64(~);
Note: Because the sign is encoded into the type code, it's possible to encode the value 0 with a negative sign. -0
is not representable as an integer, and MUST be passed on to the application as a floating point type.
Variable width integers are encoded as a block of little endian ordered bytes, prefixed with a length header. The length header is encoded as an unsigned LEB128, denoting how many bytes of integer data follows. The sign is encoded in the type code.
int_vlength_positive = u8(0x66) & var(pref,length_prefix) & ordered(uint(pref.length*8, ~));
int_vlength_negative = u8(0x67) & var(pref,length_prefix) & ordered(uint(pref.length*8, ~));
length_prefix = uleb(var(length, 1~));
uleb(v) = uleb128(uint(0,v));
uleb128(v: bits): bits = """https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128""";
Examples:
[60] = 96
[00] = 0
[ca] = -54
[68 7f] = 127
[68 ff] = 255
[69 ff] = -255
[6c 80 96 98 00] = 10000000
[67 0f ff ee dd cc bb aa 99 88 77 66 55 44 33 22 11] = -0x112233445566778899aabbccddeeff
Decimal floating point values are stored in Compact Float format.
decimal_float = u8(0x76) & compact_float(~);
compact_float(v: number): bits = """https://github.com/kstenerud/compact-float/blob/master/compact-float-specification.md""";
Examples:
[76 07 4b] = -7.5
[76 ac 02 d0 9e 38] = 9.21424e+80
Binary floating point values are stored in 32 or 64-bit ieee754 binary floating point format, or in 16-bit bfloat format, in little endian byte order.
binary_float_16 = u8(0x70) & f16(~);
binary_float_32 = u8(0x71) & f32(~);
binary_float_64 = u8(0x72) & f64(~);
bfloat(v: number): bits = """https://en.wikipedia.org/wiki/Bfloat16_floating-point_format""";
Examples:
[70 af 44] = 0x1.5ep+10
[71 00 e2 af 44] = 0x1.5fc4p+10
[72 00 10 b4 3a 99 8f 32 46] = 0x1.28f993ab41p+100
A unique identifier, stored according to rfc4122 binary format.
Note: This is the only data type in CBE that is stored in big endian byte order (as required by rfc4122).
uid = u8(0x65) & uint(128, ~};
Example:
[65 12 3e 45 67 e8 9b 12 d3 a4 56 42 66 55 44 00 00] = UID 123e4567-e89b-12d3-a456-426655440000
Temporal types are stored in compact time format.
Note: compact time zero values are not allowed in CBE.
Dates are stored in compact date format.
date = u8(0x7a) & compact_date;
compact_date: bits = """https://github.com/kstenerud/compact-time/blob/master/compact-time-specification.md#compact-time""";
Example:
[7a 56 cd 00] = Oct 22, 2051
Time values are stored in compact time format.
time = u8(0x7b) & compact_time;
compact_time: bits = """https://github.com/kstenerud/compact-time/blob/master/compact-time-specification.md#compact-date""";
Example:
[7b f7 58 74 fc f6 a7 fd 10 45 2f 42 65 72 6c 69 6e] = 13:15:59.529435422/E/Berlin
Timestamps are stored in compact timestamp format.
timestamp = u8(0x7c) & compact_timestamp;
compact_timestamp: bits = """https://github.com/kstenerud/compact-time/blob/master/compact-time-specification.md#compact-timestamp""";
Example:
[7c 81 ac a0 b5 03 8f 1a ef d1] = Oct 26, 1985 1:22:16 at location 33.99, -117.93
An array is a contiguous sequence of identically sized elements, stored in length delimited chunks. The array type determines the size of each element and how the data is to be interpreted. String types are implemented as arrays with an element size of 1 byte.
Array elements have a fixed type and size, determined by the array type. Length fields in array chunks represent the number of elements, so for example a uint32 array chunk of length 3 contains 12 bytes of array data (3 elements x 4 bytes per element), and a bit array chunk of length 10 would contain 2 bytes of array data (10 elements, 8 elements per byte, zero-padded to an 8-bit boundary).
All array types have a chunked form, and many also have a short form.
Examples:
- [
82 61 62
] = string (short form, length 2) with elements 'a', 'b' - [
93 04 01 02
] = unsigned 8-bit array (chunked form, length 2 - see chunk header) with elements 1, 2. - [
7f 22 01 00 02 00
] = unsigned 16-bit array (plane 7f, short form, length 2) with elements 1, 2
Short form arrays have their length encoded in the lower 4 bits of the type code itself in order to save space when encoding arrays with lengths from 0 to 15 elements.
Note: Not all array types have a short form.
array_short_uid = u8(0x7f) & u4(0) & array_short(uid(~));
array_short_s8 = u8(0x7f) & u4(1) & array_short(s8(~));
array_short_u16 = u8(0x7f) & u4(2) & array_short(u16(~));
array_short_s16 = u8(0x7f) & u4(3) & array_short(s16(~));
array_short_u32 = u8(0x7f) & u4(4) & array_short(u32(~));
...
array_short(type) = u4(var(count, ~)) & type{count};
CBE encoders MUST use the short form whenever it is possible to do so, unless explicitly configured to do otherwise.
Examples:
- [
83 61 62 63
] = the string "abc" (short form - length is part of the type code) - [
90 06 61 62 63
] = the string "abc" (chunked form - length is a separate field)
In chunked form, array data is represented as a series of chunks of data, each with its own header containing the number of elements in the chunk and a continuation bit that tells if more chunks follow the current one.
array_chunked_uid = plane7f(0xe0) & array_chunk(uid(~));
array_chunked_s8 = plane7f(0xe1) & array_chunk(s8(~));
array_chunked_u16 = plane7f(0xe2) & array_chunk(u16(~));
array_chunked_s16 = plane7f(0xe3) & array_chunk(s16(~));
array_chunked_u32 = plane7f(0xe4) & array_chunk(u32(~));
...
array_chunk(type) = var(header, array_chunk_header)
& type{header.count}
& [header.continuation = 1: array_chunk(type);]
;
array_chunk_header = uleb128(uany(var(count, ~)) & u1(var(continuation, ~)));
An array CAN contain any number of chunks, and the chunks don't have to be the same length. The most common use case would be to represent the entire array as a single chunk, but there might be cases where you need multiple chunks, such as when the array length is not known at the time when encoding has started (for example if it's being built progressively).
Example:
chunk header chunk header
| |
[93 1d 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 08 01 02 03 04]
| | |
type data data
In this example (unsigned 8-bit int array), the first chunk of the array has 14 elements and has a continuation bit of 1 (chunk header [1d
]). The second chunk has 4 elements and a continuation bit of 0 (chunk header [08
]). The total length of the array is thus 18 elements, split across two chunks.
All array chunks are preceded by an unsigned LEB128 encoded header containing the chunk length and a continuation bit (in the low bit of the fully decoded header). Chunk processing continues until the end of a chunk with a continuation bit of 0.
array_chunk_header = uleb128(uany(var(count, ~)) & u1(var(continuation, ~)));
Examples:
- [
03
] = Chunk length 1 with the continuation bit set - [
80 02
] = Chunk length 256 with the continuation bit cleared - [
00
] = Chunk length 0 with the continuation bit cleared (terminates any array)
Bit array chunks with continuation=1 MUST have a length that is a multiple of 8 so that subsequent chunk data will begin on an 8-bit boundary. Only the final chunk (continuation=0) of a bit array CAN be of arbitrary size (the last chunk of array data will be zero-padded to an 8-bit boundary).
To ensure compatibility across all platforms, array chunks for string data (UTF-8) MUST always end on a codepoint boundary (do not split multi-byte codepoints between chunks).
The chunk header [00
] indicates a chunk of length 0 with continuation 0, effectively terminating any array. It's no coincidence that 0x00 also acts as the NUL terminator for C-style strings. An encoder could use this feature to artificially NUL-terminate strings in order to create immutable-friendly zero-copy documents that support C-style string implementations.
[90 20 m i s u n d e r s t a n d i n g ...]
vs
[90 21 m i s u n d e r s t a n d i n g 00 ...]
This technique will only work for the general string type (0x90), not for the short string types 0x80 - 0x8f (which have no chunk headers).
Note: If the source buffer in your decoder is mutable, you could achieve C-style zero-copy without requiring the above technique, using a scheme whereby you pre-cache the type code of the next value, overwrite that type code's memory location in the buffer with 0 (effectively "terminating" the string), and then process the next value using the pre-cached type code:
... // buffer = [84 t e s t 6a 10 a0 ...]
case string (length 4): // 0x84 = string (length 4)
cachedType = buffer[5] // 0x6a (16-bit positive int type)
buffer[5] = 0 // buffer = [84 t e s t 00 10 a0 ...]
notifyString(buffer+1) // [t e s t 00 ...] = NUL-terminated string "test"
next(cachedType, buffer+6) // 0x6a, [10 a0 ...] = 16-bit positive int value 40976
The following types are encoded using array encoding:
Array Type | Element Size (bits) | Byte Order | Type Codes |
---|---|---|---|
Unsigned int | 8 | - | 93 |
Unsigned int | 16 | Little Endian | 7f:20-2f, 7f:e2 |
Unsigned int | 32 | Little Endian | 7f:40-4f, 7f:e4 |
Unsigned int | 64 | Little Endian | 7f:60-6f, 7f:e6 |
2's complement signed int | 8 | - | 7f:10-1f, 7f:e1 |
2's complement signed int | 16 | Little Endian | 7f:30-3f, 7f:e3 |
2's complement signed int | 32 | Little Endian | 7f:50-5f, 7f:e5 |
2's complement signed int | 64 | Little Endian | 7f:70-7f, 7f:e6 |
Bfloat16 | 16 | Little Endian | 7f:80-8f, 7f:e8 |
IEEE754 binary float | 32 | Little Endian | 7f:90-9f, 7f:e9 |
IEEE754 binary float | 64 | Little Endian | 7f:a0-af, 7f:ea |
RFC4122 UUID | 128 | Big Endian | 7f:00-0f, 7f:e0 |
String | 8 | - | 80-8f, 90 |
Resource ID | 8 | - | 91 |
Bit | 1 | Little Endian | 94 |
Media | 8 | - | 7f:f3 |
Custom Type | 8 | - | 92 |
See cbe.dogma for complete array encoding descriptions.
Strings are encoded as UTF-8.
The chunked string encoding form is:
string_chunked = u8(0x90) & string_chunk;
string_chunk = var(header, array_chunk_header)
& sized(header.count*8, char_string*)
& [header.continuation = 1: string_chunk;]
;
array_chunk_header = uleb128(unsigned(0,var(count, ~)) & u1(var(continuation, ~));
char_string = unicode(C,L,M,N,P,S,Z);
Strings also have a short form length encoding using types 0x80-0x8f:
string_short = u4(8) & u4(var(count, ~)) & sized(count*8, char_string*);
Examples:
[8b 4d 61 69 6e 20 53 74 72 65 65 74] = Main Street
[8d 52 c3 b6 64 65 6c 73 74 72 61 c3 9f 65] = Rödelstraße
[90 2a e8 a6 9a e7 8e 8b e5 b1 b1 e3 80 80 e6 97 a5 e6 b3 b0 e5 af ba] = 覚王山 日泰寺
Resource identifiers are encoded similarly to a long-form string, but with type [91
].
resource_id = u8(0x91) & string_chunk;
Example:
[91 aa 01 68 74 74 70 73 3a 2f 2f 6a 6f 68 6e 2e 64 6f 65 40 77 77 77
2e 65 78 61 6d 70 6c 65 2e 63 6f 6d 3a 31 32 33 2f 66 6f 72 75 6d 2f
71 75 65 73 74 69 6f 6e 73 2f 3f 74 61 67 3d 6e 65 74 77 6f 72 6b 69
6e 67 26 6f 72 64 65 72 3d 6e 65 77 65 73 74 23 74 6f 70]
= https://john.doe@www.example.com:123/forum/questions/?tag=networking&order=newest#top
Bit array elements are stored in little endian bit order (the first element is stored in the least significant bit of the first byte of the encoded array). Array chunks MUST have a length such that length % 8 == 0
(failure to do so is an error condition), except for the last chunk which can have any length. Unused trailing (upper) bits in the last chunk MUST be cleared to 0 by an encoder, and MUST be discarded by a decoder.
array_bit = u8(0x94) & array_bit_chunk* array_bit_chunk_last;
array_bit_chunk = uleb128(uany(var(count, mult_8(~))) & u1(1))
& reversed(1, u1(~){count})
;
array_bit_chunk_last = uleb128(uany(var(count, ~)) & u1(0))
& aligned(8, reversed(1, u1(~){count}), u1(0)*)
;
mult_8(v) = [v%8 = 0: v;];
For example, the bit array {0,0,1,1,1,0,0,0,0,1,0,1,1,1,1}
would encode to [1c 7a
] with a length of 15
. The encoded value can be directly read on little endian architectures into the multi-byte unsigned integer value 0b111101000011100
(0x7a1c
), such that the least significant bit of the unsigned integer representation is the first element of the array.
Example:
[94 16 76 06] = bit array {0,1,1,0,1,1,1,0,0,1,1}
A media object has type [7f f3
] and is composed of a length-prefixed media type, followed by a byte array containing the media data.
media = plane7f(0xf3)
& var(pref,length_prefix)
& sized(pref.length*8, media_type)
& array_chunk(u8(~))
;
media_type = media_type_word & '/' & media_type_word;
media_type_word = char_media_first & char_media*;
char_media_first = 'a'~'z' | 'A'~'Z';
char_media = ('!' ~ '~')! ( '(' | ')' | '<' | '>'
| '@' | ',' | ';' | ':'
| '\\' | '"' | '/' | '['
| ']' | '?' | '='
)
;
Example:
*1 *2 *3 *4 *5
[7f f3 10 61 70 70 6c 69 63 61 74 69 6f 6e 2f 78 2d 73 68 38
*6
23 21 2f 62 69 6e 2f 73 68 0a 0a 65 63 68 6f 20 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a]
Points of interest:
Point | Description |
---|---|
*1 | Primary type: 0x7f = Plane 7f |
*2 | Plane 7f subtype: 0xf3 = Media |
*3 | Media Type length: 0x10 = 16 bytes |
*4 | String Data: application/x-sh |
*5 | Chunk Header: 0x38 = length 28, no continuation |
*6 | Media bytes |
The media in this example is the shell script (media type "application/x-sh"):
#!/bin/sh
echo hello world
Custom type values have type code [92
], followed by a custom type code (encoded as an unsigned LEB128), followed by a byte array containing the custom data.
custom_type = u8(0x92) & custom_type_code & array_chunk(u8(~));
custom_type_code = uleb(~);
Note: Custom data in text form MUST be converted to binary form before being encoded into CBE, as CBE does not support the text form.
Example: a fictional "complex number" custom type with real and imaginary components represented using float32, assigned to custom type code 1
{
real: float32 = 2.94 (40 3c 28 f6)
imaginary: float32 = 3.0 (40 40 00 00)
}
Encoded as a custom type (note: the multi-byte values are encoded in little endian byte order):
*1 *2 *3 *4 *5
[92 01 10 f6 28 3c 40 00 00 40 40]
Points of interest:
Point | Description |
---|---|
*1 | Primary type: 0x92 = Custom |
*2 | Custom type: 0x01 |
*3 | Chunk Header: 0x10 = length 8, no continuation |
*4 | Custom data (the first 4 bytes contain the "real" portion) |
*5 | The "imaginary" portion of our fictional complex number type |
A list has type code [9a
], followed by a series of zero or more objects, and is terminated with [9b
] (end of container).
list = u8(0x9a) & data_object* & end_container;
end_container = u8(0x9b);
Example:
[9a 01 6a 88 13 9b] = A list containing integers (1, 5000)
A map has type code [99
], followed by a series of zero or more key-value pairs, and is terminated with [9b
] (end of container).
map = u8(0x99) & key_value* & end_container;
key_value = keyable_object & data_object;
end_container = u8(0x9b);
Example:
[99 81 61 01 81 62 02 9b] = A map containing the key-value pairs ("a" = 1) ("b" = 2)
A record has type code [96
], followed by an identifier, followed by a series of values matching the order that the keys are defined in the associated record type, and is terminated with [9b
] (end of container).
record = u8(0x96) & identifier & data_object* & end_container;
end_container = u8(0x9b);
Example:
A record built from the record type identified by "a" (which must be defined at the top of the document), with the first key's associated value set to 5:
[96 01 61 05 9b]
An edge has type code [97
], followed by a source object, then a description object, then a destination object, and is terminated with [9b
] (end of container).
edge = u8(0x97) & non_null_object & data_object & non_null_object & end_container;
end_container = u8(0x9b);
Example:
[97
91 24 68 74 74 70 3a 2f 2f 73 2e 67 6f 76 2f 68 6f 6d 65 72
91 22 68 74 74 70 3a 2f 2f 65 2e 6f 72 67 2f 77 69 66 65
91 24 68 74 74 70 3a 2f 2f 73 2e 67 6f 76 2f 6d 61 72 67 65
9b]
= the relationship graph: @(@"http://s.gov/homer" @"http://e.org/wife" @"http://s.gov/marge")
A node has type code [98
], followed by a value object and zero or more child nodes, and is terminated with [9b
] (end of container).
node = u8(0x98) & data_object & (node | data_object)* & end_container;
end_container = u8(0x9b);
Example:
[98 01 98 03 98 05 9b 98 04 9b 9b 98 02 9b 9b]
= the binary tree:
1
/ \
2 3
/ \
4 5
null = u8(0x7d);
This type is reserved for future expansion of the format, and MUST NOT be used. If a decoder encounters a reserved type code, it is an error condition.
A local reference has type code [0x77
], followed by a marker identifier.
local_reference = u8(0x77) & identifier;
Examples:
[77 01 61] = reference to the object marked with ID "a"
A remote reference is encoded in the same manner as a resource identifier, except with a different type code ([7f f2
]).
remote_reference = plane7f(0xf2) & string_chunk;
Examples:
[7f f2 24 63 6f 6d 6d 6f 6e 2e 63 65 23 6c 65 67 61 6c 65 73 65]
= reference to relative file "common.ce", ID "legalese" (common.ce#legalese)
[7f f2 4e 68 74 74 70 73 3a 2f 2f 65 78 61 6d 70 6c 65 2e 6f 72 67 2f 63 69 74 69 65 73 2f 66 72 61 6e 63 65 23 70 61 72 69 73]
= remote reference to https://example.org/cities/france#paris, where "paris" is the local marker ID in that document
Comments are not supported in CBE. An encoder MUST skip all comments when converting CTE to CBE.
Padding is encoded as type [95
]. Repeat as many times as needed.
padding = u8(0x95);
Example:
[95 95 95 6c 00 00 00 8f] = 0x8f000000, padded such that the 32-bit integer begins on a 4-byte boundary.
A record type has type code [7f f1
], followed by an identifier, followed by a series of keys, and is terminated with [9b
] (end of container).
record_type = plane7f(0xf1) & identifier & keyable_object* & end_container;
end_container = u8(0x9b);
Example:
A record type named "a", containing the key "b":
[7f f1 01 61 81 62 9b]
A marker has type code [7f f0
], followed by a marker identifier, and then the marked object.
marked_object(type) = plane7f(0xf0) & identifier & type;
Example:
[7f f0 01 61
99
8a 73 6f 6d 65 5f 76 61 6c 75 65
90 22 72 65 70 65 61 74 20 74 68 69 73 20 76 61 6c 75 65
9b]
= the map {"some_value" = "repeat this value"}, tagged with the ID "a".
Identifiers begin with an unsigned LEB128 length field (min length 1 byte), followed by that many bytes of UTF-8 data.
The length field CANNOT be 0.
identifier = var(pref,length_prefix) & sized(pref.length*8, char_identifier*);
length_prefix = uleb(var(length, 1~));
char_identifier = unicode(Cf,L,M,N) | '_' | '.' | '-';
Note: Identifiers are not standalone objects; they are always part of another object.
Examples:
[07 73 6f 6d 65 5f 69 64] = some_id
[0f e7 99 bb e9 8c b2 e6 b8 88 e3 81 bf ef bc 95] = 登録済み5
An empty document in CBE is signified by using null as the top-level object:
[81 01 7d]
Preservation of the original numeric data type information is not considered important by default. Encoders SHOULD use the smallest encoding that stores a value without data loss.
Specialized applications MAY wish to preserve more numeric type information to distinguish floats from integers, or even to distinguish between data sizes. This is allowed, as it will make no difference to a generic decoder (although it will bloat the document).
Applications might require data to be aligned in some cases for optimal decoding performance. For example, some processors might not be able to read unaligned multi-byte data types without special (costly) intervention. An encoder could in theory be tuned to insert padding when encoding certain data, trading document size for encoding/decoding efficiency:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
95 | 95 | 95 | 67 | 00 | 00 | 00 | 8f |
Alignment tuning is usually only useful when the target decoding environment is known prior to encoding (It's mostly an optimization for closed systems).
Date | Version |
---|---|
July 22, 2018 | Draft |
TBD | Version 1 |
Copyright (c) 2018-2023 Karl Stenerud. All rights reserved.
Distributed under the Creative Commons Attribution License (license deed.