# Multidimensional analysis 🔗

In the Crunch system, any analysis is also referred to as a “cube”. Cubes are the mechanical means of representing analyses to and from the Crunch system; you can think of them as spreadsheets that might have other than two dimensions. A cube consists of two primary parts: “dimensions” which supply the cube axes, and “measures” which populate the cells. Although both the request and response include dimensions and measures, it is important to distinguish between them. The request supplies expressions for each, while the response has data (and metadata) for each. The request declares what variables to use and what to do with them, while the response includes and describes the results. See Object Reference:Cube for complete details.

## Dimensions 🔗

Each dimension of an analysis can be simply one variable, a function over it, a traversal of its subvariables (for array variables), or even a combination of multiple variables (e.g. A + B). Any expression you can use in a “select” command can be used as a dimension. The big difference is that the system will consider the distinct values rather than all values of the result. Variables which are already “categorical” or “enumerated” will simply use their “categories” or “elements” as the extent. Other variables form their extents from their distinct values.

For example, if “3ffd45” is a categorical variable with three categories (one of which is “No Data”: -1), then the following dimension expressions:

```{
"dimensions": [
{"variable": "datasets/ab8832/variables/3ffd45/"},
{"function": "+", "args": [{"variable": "datasets/ab8832/variables/2098f1/"}, {"value": 5}]}
]
}
```

…would form a result cube with two dimensions: one using the categories of variable “3ffd45”, and one using the distinct values of (variable “2098f1” + 5). If variable “2098f1” has the distinct values [5, 15, 25, 35], then we would obtain a cube with the following extents:

1 2 -1
5
15
25
35

Each dimension used in a cube query needs to be reduced to distinct values. For categorical or enumerated variables, we only need to refer to the variable, and the system will automatically use the “categories” or “elements” metadata to determine the distinct values. For other types, the default is to scan the variable’s data to find the unique values present and use those. Often, however, we want a more sophisticated approach: numeric variables, for example, are usually more useful when binned into a handful of ranges, like “0 to 10, 10 to 20, …90 to 100” rather than 100 distinct points (or many more when dealing with non-integers). The available dimensioning functions vary from type to type; the most common are:

• categorical: {“variable”: url}
• text: {“variable”: url}
• numeric: Group the distinct values into a smaller number of bins via:
• {“function”: “bin”, “args”: [{“variable”: url}]}
• datetime: Roll up seconds into hours, days into months, or any other grouping via:
• {“function”: “rollup”, “args”: [{“variable”: url}, {“value”: variable.rollup_resolution}]}
• categorical_array:
• One dimension for the subvariables: {“each”: url}
• One dimension for the categories: {“variable”: url}
• multiple response:
• One dimension for the subvariables: {“each”: url}
• One dimension for the selected-ness, which means transforming the array from a set of arbitrary categories to a standard “selected” set of categories (1, 0, -1) via:
• {“function”: “selections”, “args”: [{“variable”: url}]}

## Measures 🔗

A set of named functions to populate each cell of the cube. You can request multiple functions over the same dimensions (such as “cube_mean” and “cube_stddev”) or more commonly just one (like “cube_count”). For example:

```{"measures": {"count": {"function": "cube_count", "args": []}}}
```

or:

```{"measures": {
"mean": {"function": "cube_mean", "args": [{"variable": "datasets/1/variables/3"}]},
"stddev": {"function": "cube_stddev", "args": [{"variable": "datasets/1/variables/3/"}]}
}}
```

When applied to the dimensions we defined above, this second example might fill the table thusly for the “mean” measure:

mean 1 2 -1
5 4.3 12.3 8.1
15 13.1 0.0 9.2
25 72.4 4.2 55.5
35 8.9 9.1 0.4

…and produce a similar one for the “stddev” measure. You can think of multiple measures as producing “overlays” over the same dimensions. However, the actual output format (in JSON) is more compact in that the dimensions are not repeated; see Object Reference:Cube output for details.

ZCL expressions are composable. If you need, for example, to find the mean of a categorical variable’s “numeric_value” attributes, cast the variable to the “numeric” type class before including it as the cube argument:

```{"measures": {
"mean": {
"function": "cube_mean",
"args": [{
"function": "cast",
"args": [
{"variable": "datasets/1/variables/3"},
{"class": "numeric"}
]
}]
}
}}
```

## Comparisons 🔗

Occasionally, it is useful to compare analyses from different sources. A common example is to define “benchmarks” for a given analysis, so that you can quickly compare an analysis to an established target. These are, in effect, one analysis laid over another in such a way that at least one of their dimensions lines up (and typically, using the same measures). These are also therefore defined in terms of cubes: one set which defines the base analyses, and another which defines the overlay.

For example, if we have an analysis over two categorical variables “88dd88” and “ee4455”:

```{
"dimensions": [
{"variable": "../variables/88dd88/"},
{"variable": "../variables/ee4455/"}
],
"measures": {"count": {"function": "cube_count", "args": []}}
}
```

then we might obtain a cube with the following output:

1 2 -1
1 15 12 9
2 72 8 3
3 23 4 17

Let’s say we then want to overlay a comparison showing benchmarks for 88dd88 as follows:

1 2 -1 benchmarks
1 15 12 9 20
2 72 8 3 70
3 23 4 17 10

Our first pass at this might be to generate the benchmark targets in some other system, and hand-enter them into Crunch. To accomplish this, we need to define a comparison. First, we need to define the “bases”: the cube(s) to which our comparison applies, which in our case is just the above cube:

```{
"name": "My benchmark",
"bases": [{
"dimensions": [{"variable": "88dd88"}],
"measures": {"count": {"function": "cube_count", "args": []}}
}]
}
```

Notice, however, that we’ve left out the second dimension. This means that this comparison will be available for any analysis where “88dd88” is the row dimension. The base cube here is a sort of “supercube”: a superset of the cubes to which we might apply the comparison. We include the measure to indicate that this comparison should apply to a “cube_count” (frequency count) involving variable “88dd88”.

Then, we need to define target data. We are supplying these in a hand-generated way, so our measure is simply a static column instead of a function:

```{
"overlay": {
"dimensions": [{"variable": "88dd88"}],
"measures": {
"count": {
"column": [20, 70, 10],
"type": {"function": "typeof", "args": [{"variable": "88dd88"}]}
}
}
}
}
```

Note that our overlay has to have a dimension, too. In this case, we simply re-use variable “88dd88” as the dimension. This ensures that our target data is interpreted with the same category metadata as our base analysis.

We POST the above to datasets/{id}/comparisons/ and can obtain the overlay output at datasets/{id}/comparisons/{comparison_id}/cube/. See the Comparisons endpoint reference for details.

## Multitables 🔗

```GET datasets/{id}/multitables/ HTTP/1.1

200 OK
{
"element": "shoji:catalog",
"index": {
"1/": {"name": "Major demographics"},
"2/": {"name": "Political tendencies"}
}
}

POST datasets/{id}/multitables/ HTTP/1.1

{
"element": "shoji:entity",
"body": {
"name": "Geographical indicators",
"template": [
{
"query": [
{
"variable": "../variables/de85b32/"
}
]
},
{
"query": [
{
"variable": "../variables/398620f/"
}
]
},
{
"query": [
{
"function": "bin",
"args": [
{
"variable": "../variables/398620f/"
}
]
}
]
}
],
"is_public": false
}
}

201 Created
Location: datasets/{id}/multitables/3/
```

Analyses as described above are truly multidimensional; when you add another variable, the resulting cube obtains another dimension. Sometimes, however, you want to compare analyses side by side, typically looking at several (even all) variables against a common set of conditioning variables. For example, you might nominate “Gender”, “Age”, and “Race” as the conditioning variables and cross every other variable with those, in order to quickly discover common correlations.

Multi-table definitions mainly provide a `template` member that clients can use to construct a valid query with the variable(s) of interest.

Crunch provides a separate catalog where you can define and manage these common sets of variables. Like most catalogs, you can GET it to see which multitables are defined.

### Template query 🔗

A multitable is a set of queries that form groups of ‘columns’ for different later chosen ‘row’ variables. It is defined by a name and a template. At minimum the template must contain a `query` fragment: this will be later inserted after some function of a row variable to form the dimension <analyzing-dimensions> of a result. Each template dimension can currently only be a function of one variable.

```GET datasets/{id}/multitable/3/ HTTP/1.1

{
"element": "shoji:entity",
"body": {
"name": "Geographical indicators",
"template": [
{
"query": [
{
"variable": "../variables/de85b32/"
}
]
},
{
"query": [
{
"variable": "../variables/398620f/"
}
]
},
{
"query": [
{
"function": "bin",
"args": [
{
"variable": "../variables/398620f/"
}
]
}
]
}
]
}
}
```

Each multi-table template may be a list of variable references and other information used to construct the dimension and transform its output.

## Transforming analyses for presentation 🔗

The `transform` member of an analysis specification (or multitable definition) is a declarative definition of what the dimension should look like after computation. The cube result dimension itself will always be derived from the `query` part of the request (`{variable: \$variableId})`, `{function: f, args: [\$variableId, …]}`, &c., after which clients should do what is necessary to arrive at the transformed result — changing element names, orders, etc.

For historical reasons, the JSON schema for transforms on a multitable definition differ somewhat from the schema used for slide analyses. The following schema description applies to transforms appearing on an analysis. The (legacy) schema used for a multitable template appears in the following section.

### Slide-analysis transforms schema 🔗

The transforms affecting a slide in a deck are on the analysis entity corresponding to the slide.

Unlike a multitable-template transform, transforms on a slide-analysis can apply to both the rows and columns of the data matrix underlying a visualization.

Example:

A saved analysis can include a “transforms”: object. This example makes a univariate table of a multiple response variable, applies an explicit reordering, relabels one row, and hides another, and specifies the color of one bar when displayed as a chart.

```{
"query": {
"dimensions": [
{"variable": "../variables/398620f/"},
{
"function": "selections",
"args": [{"variable": "../variables/398620f/"}]
}
],
"measures": {
"count": {"function": "cube_count", "args": []}
}
},
"transforms": {
"rows_dimension": {
"elements": {
"0003": {
"fill": "#AB12EF",
"name": "Toastie"
},
"0002": {
"hide": true
}
},
"order": {
"type": "explicit",
"element_ids": ["0003", "0001", "0002", "0005", "0004"],
}
}
}
}
```

### Schema - General principles 🔗

The following principles underlie the design of this schema for analysis transforms.

transforms cascade. Transforms can be defined at multiple levels. For example, default transforms can be defined on a variable and apply to all analyses deriving from that variable. The analysis-transforms described here inherit from the default transforms. When a transform specified at the default level is also specified at the analysis level, the analysis-transform overrides the inherited value. Any default-transforms not specified at the analysis level apply without change.

Not all inherited values arise from other transforms. For example, element name and element ordering are inherited from the variable definition itself.

strictly-sparse. The transforms schema is strictly sparse, meaning only those items that must be specified should be specified. This is critical to the proper cascading behavior of transforms.

In particular, adding an element transform for all elements in the dimension, setting `"name":` to the existing element name in each case, will produce misbehavior. While this would produce a visualization that looked right initially, it defines transforms which the user did not ask for and almost certainly did not intend. In this case, if the underlying element name were later changed on the variable definition, that change would appear to “not take” on this analysis because the new name is being transformed back to the old name.

A consequence of the strictly-sparse ethic is that everything is optional. Every value present specifies a transform and no values need be present. Container objects that are empty should not appear.

In charts, rows are series and columns are categories. Because transforms describe the underlying data matrix, we use the names “rows” and “columns” to give the dimensions an intuitive interpretation that corresponds to a tabular visualization. The mapping of rows and columns to chart items has some added complication though, depending on the dimensionality of the analysis.

In a bivariate chart, each row maps to a series (set of same-color bars) and each column maps to a chart category (cluster of bars). In a univariate chart, the overall chart maps to the rows-dimension and each row element represents a chart category.

### `transforms` object 🔗

The analysis entity can contain an optional `"transforms":` object. If omitted, no transforms are specified on the analysis.

The `"transforms":` object can contain a `"rows_dimension":` item and a `"columns_dimension":` item. Both are optional. A `"columns_dimension":` object can only appear on a bivariate analysis (a univariate analysis has no columns dimension).

### `transforms/rows_dimension` object 🔗

The `"rows_dimension":` item in the transforms object is optional. If omitted, no transforms are specified on the rows dimension.

The `"rows_dimension":` object can contain a `"name":` item, a `"description":` item, an `"elements":` object and an `"order":` object. All are optional.

• The `"name":` item contains a string display-name that will override the dimension name (generally the variable name) wherever that appears in the analysis or visualization.
• The `"description":` item contains a string display-name that will override the dimension description (generally the variable description) wherever that appears in the analysis or visualization.
• The `"elements":` object contains an element-transforms object for each rows-dimension element having one or more transforms. Each element-transforms object is keyed by its element-id, converted to a string when numeric. The form of an element-transforms object is described in a following section. The elements container object should be omitted when empty.
• The `"order":` object, if present, specifies the sequence in which rows should appear. The form of the order object is described in a following section.
```"transforms": {
"rows_dimension": {
"elements": {
"3": {...},
"2": {...},
},
"order": {
...
}
}
```

### `transforms/columns_dimension` object 🔗

The `"columns_dimension":` item in the transforms object is optional. If omitted, no transforms are specified on the columns dimension.

A `"columns_dimension":` object is only meaningful for a bivariate analysis (a univariate analysis has no columns-dimension). It should not appear on a univariate analysis.

The `"columns_dimension":` object can contain a `"name":` item, a `"description":` item, an `"elements":` object and an `"order":` object. All are optional.

• The `"name":` item contains a string display-name that will override the dimension name (generally the variable name) wherever that appears in the analysis or visualization.
• The `"description":` item contains a string display-name that will override the dimension description (generally the variable description) wherever that appears in the analysis or visualization.
• The `"elements":` object contains an element-transforms object for each columns-dimension element having one or more transforms. Each element-transforms object is keyed by its element-id, converted to string when numeric. The form of the element-transforms object is the same as that appearing in the rows-dimension object and is described in a following section. The elements container object should be omitted when empty.
• The `"order":` object, if present, specifies the sequence in which columns should appear. The form of the order object is the same as that appearing in the rows-dimension object and is described in a following section.
```"transforms": {
"columns_dimension": {
"elements": {
"1": {...},
"4": {...},
},
"order": {
...
}
}
```

### Element-transforms objects 🔗

Certain transforms can be applied to an individual element, including rename and hide. These transforms are specified in an element-transforms object, contained in the `"elements:"` collection for a row or column dimension. Each element-transforms object is keyed by its element-id (category-id on certain variable types). The id-value is converted to string when numeric as JSON only allows string keys.

An element-transforms object can contain a `"fill":` item (sometimes), a `"hide":` item and a `"name":` item. All are optional.

• The `"fill":` item is an RGB color value like `"#ffd0ac"` and only applies to visualizations that use color, such as a chart; it is ignored by a table visualization. The color value must begin with a hash character (“#”) followed by six hexadecimal digits (no 3-digit color codes). Case is insignificant.

Note that the fill item can appear only in a rows-dimension element. Fill color behavior for a columns dimension element is not defined.

The behavior of a fill transform varies by dimensionality. In a univariate chart, it specifies the color of the single bar (sector in a donut chart) associated with the row-element. In a bivariate chart it specifies the color of all bars in the chart series associated with the row-element.

• A `"hide":` item takes the value `true` or `false`. Omitted implies “inherit”. By default, all non-missing (aka. valid) elements in a dimension appear in the visualization. The element-hide transform allows appearance of an element to be suppressed. This translates to an “invisible” row or column in a table, or a category or series that does not appear in a chart.

Hiding of this sort does not change the calculation of the underlying data matrix in any way. All values are the same, including summary values and totals. Only the display of the values associated with the element in the visualization is suppressed.

In general, a hide item should only appear when its value is `true`. A value of `false` overrides any setting cascaded from a higher level, such as a default transform, and would restore visibility to an element hidden there. The hide-transform should be thought of as a three-valued setting (show, hide, or inherit-setting). Make sure you’re reflecting the user’s intent when setting this value, in particular when choosing between `false` and omitting the item entirely.

• A `"name":` item changes the name displayed on the analysis without changing the underlying element name. Do not add a name item where the user has not explicitly specified a name change, in particular, do not write the element-name into this item for each element simply because it is more convenient than distinguishing actual name-overrides. Doing so would shadow element-name changes made in the variable definition.

```"elements": {
"1": {
"fill": "#ffd0ac",
"hide": true,
"name": "T-Mobility"
}
}
```

### Element-order objects 🔗

By default, dimension elements appear in cube-result document order (where they are represented as a sequence). An element-order transform applies a different ordering to those elements, for this analysis only. The new order is specified as a list of element-ids, possibly partial.

This transform must be tolerant of both element additions and removals. Since the transform is stored separately from the variable definition, there is no assurance that a once exhaustive sequence of element-ids will fully match the elements in a later cube-result. The following processing rules provide this accommodation:

• Any element ids in the order sequence that do not correspond to an element in the cube-result are ignored.

• Any elements in the cube result that are not present in the order-sequence appear at the end, in cube-result document order.

Note that this rule provides support for “partial” ordering, such as placing a given brand first, followed by the remaining brands in document order, without needing to specify the remaining brands.

Currently, the only sort type is “explicit”. Others, such as sort by value are expected to follow.

```"order": {
"type": "explicit",
"element_ids": [42, 7, 3]
}
```

## Multitable Transforms Schema 🔗

A `transform` can contain `elements` or `categories`, which is an array of target transforms for output-dimension elements. Therefore to create a valid element/category `transform` it is generally necessary to make a cube query, inspect the result dimension, and proceed from there. For categorical and multiple response variables, elements may also be obtained from the variable entity.

Transforms are designed for variables that are more stable than not, with element ids that inhere in the underlying elements, such as category or subvariable ids. Dynamic elements such as results of `bin`ning a numeric variable, may not be transformed.

Transformations stored on a variable’s `view` are the default transforms for that variable. They may be shorter, alternate versions of category names, or contain insertions, described below.

### Insertions 🔗

In addition to transforming the categories or elements already defined on a cube ‘dimension’, it is possible to insert headings and subtotals to the result. These `insertions` are attached after an `anchor` element/category id.

Insertions are processed last, after renaming, reordering, or sorting elements according to the elements/categories transform specification. They are “attached” to their anchor, always following it in the result; or, simply appended to the end of the result. If the result is sorted by some column’s value, it may make the most sense to choose to display insertions last, rather than inserting them into a result table because their values will not be considered when sorting the non-inserted elements themselves.

An insertion is defined by an anchor and a name, which will be displayed alongside the names of categories/elements. It may also contain `"function": "subtotal"` and `"args": []`, where the array of `args` are the category `id`s of elements to combine as “subtotals”.

Use an anchor of `top` to indicate an insertion before other results. Use an anchor of `bottom` to indicate an insertion after other results. Any anchor that does not match an id in the elements/categories will be included at the end of results.

### Example 🔗

Consider the following example result dimension:

Name missing id
Element A   0
Element B   1
Element C   2
Don’t know   3

An element transform can specify a new order of output elements, new names, and in the future, bases for hypothesis testing, result sorting, and aggregation of results. A `transform` has elements that look generally like the dimension’s extent, with some optional properties:

• id: (required) id of the target element/category
• name: name of new target element/category
• sort: `-1` or `1` indicating to sort results descending or ascending by this element
• compare: `neq`, `leq`, `geq` indicating to test other rows/columns against the hypothesis that they are ≠, ≤, or ≥ to the present element
• hide: suppress this element’s row/column from displaying at all. Defaults to false for valid elements, true for missing, so that if an element is added, it will be present until a transform with `hide: true` is added to suppress it.

A `transform` with object members can do lots of things. Suppose we want to put Element C first, hide the Don’t know, and more compactly represent the result as just C, A, B:

```{
"transform": {"categories": [
{"id": 2, "name": "C"},
{"id": 0, "name": "A"},
{"id": 1, "name": "B"},
{"id": 3, "hide": true}
]}
}
```

### Example transform in a multitable template 🔗

In a multitable, the `transform` is part of each dimension definition object in the `template` array.

```{
"template": [
{
"query": [
{"variable": "A"}
],
"transform": [{}, {}]
},
{
"query": [
{
"function": "rollup",
"args": [
{"value": "M"},
{"variable": "B"}
]
}
]
}
]
}
```

## More complex multitable templates 🔗

The template may contain in addition to variable references and their query arguments, an optional `transform`: To obtain their multiple output cubes, you `GET datasets/{id}/cube?query=<q>` where `<q>` is a ZCL object in JSON format (which must then be URI encoded for inclusion in the querystring). Use the “each” function to iterate over the overview variables’ `query`, producing one output cube for each one as “variable x”. For example, to cross each of the above 3 variables against another variable “449b421”:

```{
"function": "each",
"args": [
{"value": "x"},
[
{"variable": "de85b32"},
{"variable": "398620f"},
{"variable": "c116a77"}
]
],
"block": {
"function": "cube",
"args": [
[
{"variable": "449b421"},
{"variable": "x"}
],
{
"map": {
"count": {
"function": "cube_count",
"args": []
}
}
},
{"value": null}
]
}
}
```

The result will be an array of output cubes:

```{
"element": "shoji:view",
"value": [
{
"query": {},
"result": {
"element": "crunch:cube",
"dimensions": [
{
"references": "449b421",
"type": "etc."
},
{
"references": "de85b32",
"type": "etc."
}
],
"measures": {
"count": {
"function": "cube_count",
"args": []
}
}
}
},
{
"query": {},
"result": {
"element": "crunch:cube",
"dimensions": [
{
"references": "449b421",
"type": "etc."
},
{
"references": "398620f",
"type": "etc."
}
],
"measures": {
"count": {
"function": "cube_count",
"args": []
}
}
}
},
{
"query": {},
"result": {
"element": "crunch:cube",
"dimensions": [
{
"references": "449b421",
"type": "etc."
},
{
"references": "c116a77",
"type": "etc."
}
],
"measures": {
"count": {
"function": "cube_count",
"args": []
}
}
}
}
]
}
```