Data Properties: Representation, Inference, and Exploitation

DAPHNE data objects (matrices/frames) can be described by various data properties (synonym: data characteristics), such as the data type and value type, the shape, the sparsity, the number of distinct values, the value distribution, etc.

Information about the data properties can be exploited for optimizing runtime and memory consumption in several ways, e.g., through:

Simplification rewrites (e.g., eliminate transpose on a symmetric matrix, eliminate row-wise sum on a single-column matrix, etc.)
Physical data/value type selection (e.g., sparse representations like CSR for sparse data, compressed representations for small integers, etc.)
Physical operator/algorithm selection (e.g., special algorithms on sparse data, special algorithms on sorted data, etc.)

This document explains how DAPHNE handles interesting data properties and justifies why it is done this way. Furthermore, it explains how to extend DAPHNE with additional data properties.

Currently Supported Data Properties

DAPHNE supports an individual set of data properties for each data type as described in the list below. We view the data and value type as a special kind of data property. The fact that DAPHNE supports a certain data property does not mean that the data property will be known for every SSA value; some data properties may not be inferred because (a) they simply cannot be known in a certain situation, or (b) there are gaps due to the current state of the implementation.

Matrix
- value type: the common type of all matrix elements
- shape (number of rows, number of columns): the logical size of the matrix
- sparsity: the number of non-zero elements divided by the total number of elements
- physical representation: the way the matrix is stored in memory (e.g., dense or sparse)
Frame
- value types: the array of types for each column
- shape (number of rows, number of columns): the logical size of the frame
- column labels: the array of textual labels for each column (labels must be unique for each frame)
Column
- value type: the common type of all column elements
- shape (number of rows): the logical size of the column
Scalar
- (none so far)

Representation of Data Properties

Data properties are represented in the DaphneIR and, thus, available at compile-time. The rationale for a compile-time representation (as opposed to a pure run-time representation) is twofold:

Some ways of inferring data properties require a global view on the program. This global view is only available at compile-time. For instance, the result of t(X) @ X is a symmetric matrix for any matrix X, but that cannot easily be derived from looking at the individual ops.
Some ways of exploiting data properties for optimizations need to take place at compile-time, because they apply global optimizations to the IR (e.g., simplification rewrites, operator ordering, code generation, and operator fusion).

Data properties are stored as parameters of custom MLIR types (see src/ir/daphneir/DaphneTypes.td) (e.g., Matrix/mlir::daphne::MatrixType and Frame/mlir::daphne::FrameType).

There would have been some alternative approaches to using MLIR type parameters. Those alternatives would not necessarily be impossible to use for the purpose of representing data properties. However, we decided against them. We considered the following alternatives:

MLIR attributes on MLIR operations. We decided against this approach, because attributes attached to ops typically denote the data properties of the op's result(s). We want to be able to represent data properties of any SSA value (mlir::Value), but not all SSA values are produced by an MLIR operation (some are block arguments). An operation can have multiple results, would lead to duplication of attributes.
MLIR analysis passes. TODO Explain why not.

Nevertheless, the chosen approach of representing data properties as MLIR type parameters also has some downsides:

Data properties technically become a part of the MLIR type. Hence, we cannot use MLIR's type inference interface unless we make type inference always infer all data properties (which might be costly). However, this point is not too bad, since we have our own type inference interface.
As a consequence, type comparisons can have, at first glance, unexpected results (e.g., a matrix of value type f64 with sparsity 0.5 is not the same MLIR type as matrix of value type f64 with sparsity 0.1). As a workaround, we introduced the withSameValueType() utility function on our custom MLIR types. This functions sets all data properties to unknown except the value type(s). After invoking this utility function, type comparisons will only consider the data/value type, so the comparison result should be as expected.

Each data property can have an unknown value. After DaphneDSL parsing, most (almost all) data properties are unknown in the initial DaphneIR representation. Through data property inference, some/many data properties can become known at compile-time, but not necessarily all. At any point in time during compilation, a data property may not be known (yet). By the end of the DAPHNE compiler (more precisely, by RewriteToCallKernelOpPass) the data/value type of all SSA values must be known (otherwise DAPHNE cannot select the right runtime kernel for the operation), except for the column value types of frames, which may remain unknown at compile-time (because the column value types of frames are interpreted dynamically at run-time).

For each data property, there is some special value that represents unknown:

Data/value type: DAPHNE's custom Unknown type (mlir::daphne::UnknownType)
Non-negative numeric data properties (e.g., the number of rows, the number of columns, sparsity, etc.): typically -1

Complex object-valued data properties (e.g., the array of column labels of a frame): typically nullptr

Currently, information about data properties is primarily available at compile-time. The information is not passed to runtime kernels yet, but we plan to make the information available to the runtime soon.

Nevertheless, some fundamental data properties are also available at run-time, because they are part of the fundamental meta data of the runtime data structures:

Matrix
- value type (usually as a template parameter of a kernel)
- shape (number of rows, number of columns)
Frame
- column value types
- column labels

These data properties may not be known at compile-time (except for the value type of a matrix), but they are always known at run-time.

Data Property Inference

At compile-time, DAPHNE tries to find out the data properties of intermediate results in the IR. This process is called property inference (synonyms: property propagation, property estimation). The basic idea is to infer the data properties of an op's results from the data properties of the op's arguments (and, perhaps, from compile-time constant argument values). Currently, DAPHNE does this inference only at the level of each individual op. In the future, it could also infer data properties for patterns/DAGs of ops (e.g., the result of t(X) @ X is a symmetric matrix for any matrix X, which can hardly be inferred based on the individual ops t and @).

The compiler pass responsible for data property inference is the InferencePass (src/compiler/inference/InferencePass.cpp). Besides that, SelectMatrixRepresentationsPass (src/compiler/inference/SelectMatrixRepresentationsPass.cpp) also plays a role as it selects the physical matrix representation (dense/sparse) based on the estimated sparsity, but maybe we will factor it into InferencePass in the future.

Procedure for data property inference. The InferencePass walks the IR op by op and tries to infer all data properties for all results of the current op before moving on to the next op.

The rationale for this "for each op: for each prop: infer"-procedure rather than a "for each prop: for each op: infer"-procedure is that a data property X of an op's result does not always depend on the data property X of the op's argument(s), but sometimes on another data property Y of the argument(s). For instance, the number of rows of the result of FilterRowOp depends on the sparsity of the input bit vector. Thus, before inferring any data property of the result of an op, we should have as much information on all data properties of the argument(s) of the op as possible.

Given a certain op, the InferencePass considers the supported data properties one by one. If the value of a data property is unknown for any result of the op, the inference logic for this data property on this op is invoked. The rationale for not re-inferring already known data properties is efficiency.

The op-specific inference of data properties is handled by custom MLIR interfaces and traits. By default, there is one custom MLIR interface per supported data property. The functions defined by the interface become part of the op the interface is attached to. The interface returns the inferred value of the data property (usually some kind of list, e.g., a std::vector, because an op can have multiple results).

The rationale for not making the interface set the data property on the results of the op are as follows: (1) The logic for setting the data property value would need to be redundantly implemented by every single interface implementation. Even though this logic is usually just a one-liner, there is a potential for bugs. Moreover, one would hard-code the way data properties are represented (e.g., type parameters vs. attributes) in every single interface implementation. (2) We want a single point of insertion into the IR, because data properties could also be inferred by traits (see below).

The source files involved in the inference interface of a data property ${prop} are the following (look at the files of some existing data property for the details):

src/ir/daphneir/DaphneInfer${prop}OpInterface.td
- A TableGen-based declaration of the custom MLIR interface.
- Typically the interface defines a single function infer${prop}().
- In justified cases, there can be multiple interfaces. For instance, the interface for shape inference comes with separate interfaces for the inference of the number of rows and columns (because sometimes, just one of the two dimensions can be expressed by a trait, see below).
src/ir/daphneir/DaphneInfer${prop}OpInterface.h
- Defines the custom MLIR traits for this data property (C++ part).
- Include the TableGen'erated header file ir/daphneir/DaphneInfer${prop}OpInterface.h.inc
- Declares the function tryInfer${prop}(mlir::Operation *op). (Type inference also has a function setInferredTypes(), because we want to set the data/value type also in the parser and compiler passes.)
src/ir/daphneir/DaphneInfer${prop}OpInterface.cpp
- Op-specific implementations of the inference interface, i.e., functions ${op}::infer${prop}() (for an op ${op}).
- Trait implementations.
- Definition of tryInfer${prop}(mlir::Operation *op). This function first checks if op has the respective inference interface; if so, it calls infer${prop}() on the op. Otherwise: it checks if the op has exactly one result and a related inference trait; if so, it applies the trait. Otherwise: it sets the data property to unknown for all results.

In fact, many ops have a quite similar implementation of an inference interface. Thus, we use custom MLIR traits to express common inference behavior. Traits can refer to the entire data property or an aspect of it. For instance, type inference has some traits that express how to infer the complete data and value type, and some traits for either the data or the value type (for better reuse of traits).

The source files involved in the inference traits of a data property ${prop} are the following (again, see concrete examples in the code for more details):

src/ir/daphneir/DaphneInfer{$prop}Traits.td
- TableGen-based definition of custom (parametric) MLIR traits; essentially just references to the C++ part in src/ir/daphneir/DaphneInfer${prop}OpInterface.h.

Property inference and control flow constructs.

TODO

Relation to constant folding. In the DAPHNE compilation chain, data property inference is executed multiple times and interleaved with the MLIR CanonicalizationPass.

The rationale for this repetition and interleaving is that canonicalization includes constant folding, i.e., calculating the result of an op at compile-time if all inputs are compile-time constants. On the one hand, the result value of some ops depends on a data property of an argument (e.g., for NumRowsOp, the result value is the argument's number of rows). On the other hand, a result data property of some ops depends on an argument value (e.g., for FillOp, the result shape is given by argument values). Thus, data property inference and constant folding depend on each other.

Viewing the data properties inferred by the DAPHNE compiler is possible through DAPHNE's explain functionality. When invoking DAPHNE with the argument --explain property_inference, the IR after data property inference is printed. The data properties of an SSA value are displayed as a part of its type, whereby the concrete syntax is determined by mlir::daphne::DaphneDialect::printType() (see src/ir/daphneir/DaphneDialect.cpp).

Command-line arguments related to data property inference.

--select-matrix-repr Makes DAPHNE use the estimated sparsities of intermediate results for selecting a suitable physical representation of a matrix (especially DenseMatrix vs. CSRMatrix) and, hence, the respective kernels, e.g., those for CSRMatrix. Currently, DAPHNE will only use sparse representations and kernels if this argument is provided, because sparsity support is still experimental in some respects. In the future, sparsity support will be on by default.
--explain property_inference Makes DAPHNE print the IR after data property inference (see above).

Relation to file meta data.

TODO

Data Property Exploitation

TODO

Extending DAPHNE's Support for Data Properties

DAPHNE's support for data properties can be extended by modifying the source code straightforwardly. This section provides step-by-step instructions for typical tasks, summarizing what needs to be done in which files. We focus on giving high-level explanations and pointers; for the details, the existing code (which contains several examples of data properties) can be used as a blueprint. Furthermore, the MLIR documentation provides more detailed information on how to do various things in MLIR.

Adding a New Data Property

New data properties can be added to DAPHNE's abstract data types (matrix, frame, etc.). Before adding a new data property called ${prop}, think about (a) which DAPHNE data type you want to add the data property to (e.g., matrix or frame), and (b) of what C++ type the new data property shall be (e.g., ssize_t, double, mlir::Type, some custom C++ enum class, or some custom C++ class). If the new data property shall have a small set of discrete values, add a new C++ enum class in namespace mlir::daphne in src/ir/Daphne.h (or use one of the existing enum classes there) and use this enum class as the C++ type of the new data property. In the following, we assume the new data property is of C++ type ${pt} and we want to add it to the DAPHNE data type ${dt}. The following steps are required (commit da35f1eb197c5e993167d6c2862d98a513b40852 is an example for adding a new data property):

Add a new parameter ${prop} to the custom MLIR type ${dt} in src/ir/daphneir/DaphneTypes.td.
Adapt mlir::daphne::detail::${dt}TypeStorage (see src/ir/daphneir/DaphneDialect.cpp) by adding the new data property. In particular, update the constructor, the member functions operator==(), hashKey(), and construct(), and add a member variable for the new data property.
Add the default value (usually the unknown value) of the new data property to the custom builder in the builders section of the custom MLIR type ${dt}.
Update the custom methods in the extraClassDeclaration of the custom MLIR type ${dt}:
- Add a method with${prop}(${pt} ${prop}) that gets a value for the new data property and returns an instance of the MLIR type that sets the value of the new data property to the given value and retains the current value for all other data properties of ${dt}.
- Add a call to get${prop}() to all other with*(...) methods such that they retain the current value of the new data property ${prop}.
- Take the new data property into account in isSpecializationOf(...).
Update the method mlir::daphne::${dt}Type::verify(...) (see src/ir/daphneir/DaphneDialect.cpp) by adding a validity check for the value of the new data property ${prop}. This step is necessary only if the C++ type ${pt} contains any values that are not valid for the data property ${prop} or if any values might conflict with the values of other data properties.
Make sure the new data property gets represented in the textual IR, such that it is displayed, e.g., when running bin/daphne --explain property_inference, (optional step, but highly recommended). To that end, take the data property into account in mlir::daphne::DaphneDialect::printType() and mlir::daphne::DaphneDialect::parseType() (see src/ir/daphneir/DaphneDialect.cpp).
Create the inference interface for ${prop}.
- Add the files src/ir/daphneir/DaphneInfer${prop}OpInterface.td, src/ir/daphneir/DaphneInfer${prop}OpInterface.h, and src/ir/daphneir/DaphneInfer${prop}OpInterface.cpp. See the description above for more details on the meaning of these files. The existing files of other data properties can be used as blueprints (up to this point, you don't need any interface implementations for specific ops and you don't need any inference traits).
- Update src/ir/daphneir/CMakeLists.txt: Add a line add_mlir_interface(DaphneInfer${prop}OpInterface) and add the .cpp file.
- Add the include #include <ir/daphneir/DaphneInfer${prop}OpInterface.h> to src/ir/daphneir/Daphne.h.
- Add the line include "ir/daphneir/DaphneInfer${prop}OpInterface.td" to the top of src/ir/daphneir/DaphneOps.td for future use.
Make the InferencePass consider the new data property (see src/compiler/inference/InferencePass.cpp):
- Update daphne::InferenceConfig::InferenceConfig. Update struct InferenceConfig as well as the line std::unique_ptr<Pass> createInferencePass(...) in src/ir/daphneir/Passes.h accordingly. Update the lines pm.addPass(daphne::createInferencePass(...)); in src/compiler/lowering/SpecializeGenericFunctionsPass.cpp accordingly.
- Update the function getTypeWithCommonInfo().
- Update the lambda walkOp: In the then-branch of if (!isScfOp) (if the current op is not a control flow operation), there is a sequence of if-statements, one for each supported data properties. Add an if-statement for the new data property ${prop} akin to the existing ones here. In that context, you will most likely want to add a new helper function returnsUnknown${prop}() for the new data property akin to the existing ones.

Making a New Op Support Inference for an Existing Data Property

To enable the op ${op} to infer the data property ${prop} for its result(s), we can either add the respective inference interface or an existing related inference trait to the TableGen record of the op ${op} in src/ir/daphneir/DaphneOps.td.

Adding an Existing Inference Interface to the Op

Add the item DeclareOpInterfaceMethods<Infer${prop}OpInterface> to the definition of ${op} in src/ir/daphneir/DaphneOps.td (see other ops for examples).
Make sure that src/ir/daphneir/DaphneOps.td has an include include "ir/daphneir/DaphneInfer${prop}OpInterface.td".
Add the C++ implementation of the inference interface in src/ir/daphneir/DaphneInfer${prop}OpInterface.cpp.
- Remember that any data property of any argument could have an unknown value.
- Take inspiration from the existing inference interface implementations (of this or another data property).

Commit 4b120162ae26a2af1906f51fed48661926c87a48 is an example.

Adding an Existing Inference Trait to the Op

To add the inference trait ${trait}, add the item ${trait} to the definition of ${op} in src/ir/daphneir/DaphneOps.td (see other ops for examples).
Make sure that src/ir/daphneir/DaphneOps.td has an include include "ir/daphneir/DaphneInfer${prop}Traits.td".

Adding a New Inference Trait for an Existing Data Property

It is not uncommon that multiple ops share a quite similar way of inferring the value of a data property ${prop} of their result(s). Rather than writing essentially the same inference interface implementation multiple times for these ops, we can optionally add a custom MLIR trait that expresses how the inference works. That way, we can make the source code simpler and easier to maintain. The following steps are required:

Create a new trait.
- Add the C++ part of the trait in src/ir/daphneir/DaphneInfer${prop}OpInterface.h. That is, something like:
```
// Example: Always infer ${prop} as 123 (non-parametric trait).
template <class ConcreteOp> class ${prop}123 : public TraitBase<ConcreteOp, ${prop}123> {};

// Example: Retain the property value from the i-th argument (parametric trait).
namespace mlir::OpTrait {
template <size_t i> struct ${prop}FromIthArg {
    template <class ConcreteOp> class Impl : public TraitBase<ConcreteOp, Impl> {};
};
} // namespace mlir::OpTrait
```
  - Create the file src/ir/daphneir/DaphneInfer${prop}Traits.td, if it doesn't exist yet. Furthermore, add an include of this .td file in src/ir/daphneir/DaphneOps.td such that the inference traits for data property ${prop} can be attached to ops. - Write the TableGen part of the trait in src/ir/daphneir/DaphneInfer${prop}Traits.td. That is, something like:
```
// For non-parametric traits.
def ${prop}123 : NativeOpTrait<"${prop}123">;

// For parametric traits.
class ${prop}FromIthArg<int i> : ParamNativeOpTrait<"${prop}FromIthArg", !cast<string>(i)>;
def ${prop}FromArg : ${prop}FromIthArg<0>; // short form for a concrete value of i
```
  1. Implement the effect the trait should have during data property inference. We need to make the DAPHNE compiler know how to infer data property ${prop} for ops that have the newly added trait. To that end, we need to:
- Update the function tryInfer${prop}(mlir::Operation *op) (see src/ir/daphneir/DaphneInfer${prop}OpInterface.cpp) such that it checks for the presence of the new trait and reacts accordingly. You can check for a non-parametric trait by, e.g., op->hasTrait<${prop}123>(). For parametric traits, you can only check for the trait including a concrete value of the parameter, e.g., op->hasTrait<${prop}FromIthArg<0>>. To check for a range of values of a trait parameter, we frequently use a custom utility tryParamTraitUntil; see the code for details.