Binary Data Format
DAPHNE defines its own binary representation for the serialization of in-memory data objects (matrices/frames). This representation is intended to be used by default whenever we need to transfer or persistently store these in-memory objects, e.g., for
- the data transfer in the distributed runtime
- a custom binary file format
- the eviction of in-memory data to secondary storage
Disclaimer: The current specification is a first draft and will likely be refined as we proceed. At the moment, we focus on the case of a single block per data object.
Endianess: For now, we assume little endian.
Images: In the images below, all addresses and sizes are specified in bytes ([B]
).
Binary Representation of a Whole Data Object
The binary representation of a data object (matrix/frame) starts with a header containing general and data type-specific information. The data object is partitioned into rectangular blocks (in the extreme case, this can mean a single block). All blocks are represented individually (see binary representation of a single block below) and stored along with their position in the data object.
Header
The header consists of the following information:
- DAPHNE binary format version number (
1
for now) (uint8) - data type
dt
(uint8) - number of rows
#r
(uint64) - number of columns
#c
(uint64)
We currently support the following data types:
code | data type |
---|---|
0 |
(reserved) |
1 |
DenseMatrix |
2 |
CSRMatrix |
3 |
Frame |
We currently support the following value types:
code | C++ value type |
---|---|
0 | (reserved) |
1 | uint8_t |
2 | uint16_t |
3 | uint32_t |
4 | uint64_t |
5 | int8_t |
6 | int16_t |
7 | int32_t |
8 | int64_t |
9 | float |
10 | double |
Depending on the data type, there are more information in the header:
For DenseMatrix
and CSRMatrix
:
- value type
vt
(uint8)
addr[B] 0 0 1 1 2 9 10 17 18 18
+---+----+----+-----+-----+
| 1 | dt | #r | #c | vt |
+---+----+----+-----+-----+
size[B] 1 1 8 8 1
For Frame
:
- value type
vt
(uint8), for each column - length of the label
len
(uint16) and labellbl
(character string), for each column
addr[B] 0 0 1 1 2 9 10 17 18 18+#c-1 18+#c *
+---+----+----+-----+-------+ +----------+--------+--------+ +-----------+-----------+
| 1 | dt | #r | #c | vt[0] | ... | vt[#c-1] | len[0] | lbl[0] | ... | len[#c-1] | lbl[#c-1] |
+---+----+----+-----+-------+ +----------+--------+--------+ +-----------+-----------+
size[B] 1 1 8 8 1 1 2 len[0] 2 len[#c-1]
Body
The body consists of a sequence of:
- a pair of
- row index
rx
(uint64) - column index
cx
(uint64) - a binary block representation
For the special case of a single block, this looks as follows:
addr[B] 0 7 8 15 16 *
+---+----+----------+
| 0 | 0 | block[0] |
+---+----+----------+
8 8 *
size[B]
Binary Representation of a Single Block
A single data block is a rectangular partition of a data object. In the extreme case, a single block can span the entire data object in both dimensions (one block per data object).
General block header
- number of rows
#r
(uint32) - number of columns
#c
(uint32) - block type
bt
(uint8) - block type-specific information (see below)
addr[B] 0 3 4 7 8 8 9 *
+----+----+----+--------------------------+
| #r | #c | bt | block type-specific info |
+----+----+----+--------------------------+
size[B] 4 4 1 *
Block types
We define different block types to allow for a space-efficient representation depending on the data.
When serializing a data object, the block types are not required to match the in-memory representation (e.g., the blocks of a DenseMatrix
could use the sparse binary representation).
Moreover, different blocks may be represented as different block types (e.g., some blocks might use the dense binary representation and others the sparse one).
We currently support the following block types:
code | block type |
---|---|
0 |
empty |
1 |
dense |
2 |
sparse (CSR) |
3 |
ultra-sparse (COO) |
Most block types store their value type as part of the block type-specific information.
Note that the value type used for the binary representation is not required to match the value type of the in-memory object (e.g., DenseMatrix<uint64_t>
may be represented as a dense block with value type uint8_t
, if the value range permits).
Furthermore, each block may be represented using its individual value type.
Empty block
This block type is used to represent blocks that contain only zeros of the respective value type very space-efficiently.
Block type-specific information: none
Dense block
Block type-specific information:
- value type
vt
(uint8) - values
v
in row-major (value typevt
)
Below, S
denotes the size (in bytes) of a single value of type vt
.
addr[B] 0 3 4 7 8 8 9 9 10 10+#r*#c*S
+----+----+---+----+---------+---------+ +---------------+
| #r | #c | 1 | vt | v[0, 0] | v[0, 1] | ... | v[#r-1, #c-1] |
+----+----+---+----+---------+---------+ +---------------+
size[B] 4 4 1 1 S S S
Sparse block (compressed sparse row, CSR)
Block type-specific information:
- value type
vt
(uint8) - number of non-zeros in the block
#nzb
(uint64) - for each row
- number of non-zeros in the row
#nzr
(uint32) - for each non-zero in the row
- column index
cx
(uint32) - value
v
(value typevt
)
- column index
Note that both a row and the entire block might contain no non-zeros.
Below, S
denotes the size (in bytes) of a single value of type vt
.
18 + 4*#r +
addr[B] 0 3 4 7 8 8 9 9 10 17 18 #nzb*(4+S)
+----+----+---+----+------+--------+ +--------+ +-----------+
| #r | #c | 2 | vt | #nzb | row[0] | ... | row[i] | ... | row[#r-1] |
+----+----+---+----+------+--------+ +--------+ +-----------+
size[B] 4 4 1 1 8 4+#nzr[i]*(4+S)
______________________________________|_______________________
/ \
+---------+----------+ +----------+ +------------------+
| #nzr[i] | nz[i, 0] | ... | nz[i, j] | ... | nz[i, #nzr[i]-1] |
+---------+----------+ +----------+ +------------------+
4 4+S 4+S 4+S
______________|____________
/ \
+----------+----------------+
| cx[i, j] | v[i, cx[i, j]] |
+----------+----------------+
4 S
Ultra-sparse block (coordinate, COO)
Ultra-sparse blocks contain almost no non-zeros, so we want to keep the overhead of the meta data low. Thus, we distinguish blocks with a single column (where we don't need to store the column index) and blocks with more than one column.
Blocks with a single column
Block type-specific information:
- value type
vt
(uint8) - number of non-zeros in the block
#nzb
(uint32) - for each non-zero
- row index
rx
(uint32) - value
v
(value typevt
)
Below, S
denotes the size (in bytes) of a single value of type vt
.
addr[B] 0 3 4 7 8 8 9 9 10 13 14 14+#nzb*(4+S)
+----+----+---+----+------+-------+ +-------+ +------------+
| #r | #c | 3 | vt | #nzb | nz[0] | ... | nz[i] | ... | nz[#nzb-1] |
+----+----+---+----+------+-------+ +-------+ +------------+
size[B] 4 4 1 1 4 4+S 4+S 4+S
__________|__________
/ \
+-------+-------------+
| rx[i] | v[rx[i], 0] |
+-------+-------------+
4 S
Blocks with more than one column
Block type-specific information:
- value type
vt
(uint8) - number of non-zeros in the block
#nzb
(uint32) - for each non-zero
- row index
rx
(uint32) - column index
cx
(uint32) - value
v
(value typevt
)
Below, S
denotes the size (in bytes) of a single value of type vt
.
addr[B] 0 3 4 7 8 8 9 9 10 13 14 14+#nzb*(8+S)
+----+----+---+----+------+-------+ +-------+ +------------+
| #r | #c | 3 | vt | #nzb | nz[0] | ... | nz[i] | ... | nz[#nzb-1] |
+----+----+---+----+------+-------+ +-------+ +------------+
size[B] 4 4 1 1 4 8+S 8+S 8+S
________________|________________
/ \
+-------+-------+-----------------+
| rx[i] | cx[i] | v[rx[i], cx[i]] |
+-------+-------+-----------------+
4 4 S