Deriving Variables 🔗

Derived variables are variables that, instead of having a column of values backing them, are functionally dependent on other variables. In Crunch, users with view-only permissions on a dataset can still make derived variables of their own–just as they can make filters. Dataset editors can also derive other types of variables as permanent additions to the dataset, available for all viewers.

Note

In this section, sample variable documents have been trimmed to the attributes essential to the derivation examples. Actual variable documents in the wild will have more attributes than those shown here.

Combining categories 🔗

The “combine_categories” function takes two arguments:

  • A reference to the categorical or categorical_array variable to be combined
  • A definition of the categories of the new variable, including all members found in categories, plus a “combined_ids” key that maps the derived category to one or more categories (by id) in the input variable.

Given a variable such as:

{
    "element": "shoji:entity",
    "self": "https://app.crunch.io/api/datasets/3ad42c/variables/0000f5/",
    "body": {
        "name": "Education",
        "alias": "educ",
        "type": "categorical",
        "categories": [
            {"numeric_value": null, "missing": true, "id": -1, "name": "No Data"},
            {"numeric_value": 1, "missing": false, "id": 1, "name": "No HS"},
            {"numeric_value": 2, "missing": false, "id": 2, "name": "High school graduate"},
            {"numeric_value": 3, "missing": false, "id": 3, "name": "Some college"},
            {"numeric_value": 4, "missing": false, "id": 4, "name": "2-year"},
            {"numeric_value": 5, "missing": false, "id": 5, "name": "4-year"},
            {"numeric_value": 6, "missing": false, "id": 6, "name": "Post-grad"},
            {"numeric_value": 8, "missing": true, "id": 8, "name": "Skipped"},
            {"numeric_value": 9, "missing": true, "id": 9, "name": "Not Asked"}
        ],
        "description": "Education"
    }
}

POST’ing to the private variables catalog a Shoji Entity containing a ZCL function like:

{
    "element": "shoji:entity",
    "body": {
        "name": "Education (3 category)",
        "description": "Combined from six-category education",
        "alias": "educ3",
        "derivation": {
            "function": "combine_categories",
            "args": [
                {
                    "variable": "https://app.crunch.io/api/datasets/3ad42c/variables/0000f5/"
                },
                {
                    "value": [
                        {"name": "High school or less", "numeric_value": null, "id": 1, "missing": false, "combined_ids": [1, 2]},
                        {"name": "Some college", "numeric_value": null, "id": 2, "missing": false, "combined_ids": [3, 4]},
                        {"name": "4-year college or more", "numeric_value": null, "id": 3, "missing": false, "combined_ids": [5, 6]},
                        {"name": "Missing", "numeric_value": null, "id": 4, "missing": true, "combined_ids": [8, 9]},
                        {"name": "No data", "numeric_value": null, "id": -1, "missing": true, "combined_ids": [-1]}
                    ]
                }
            ]
        }
    }
}

…results in a private categorical variable with three valid categories.

Combining the categories of a categorical array is the same as it is for categorical variables. The resulting variable is also of type “categorical_array”. This variable type also has a “subvariables_catalog”, like the variable from which it is derived, and the subvariables contained in it are derived “combine_categories” categorical variables.

Combining responses 🔗

For multiple response variables, you may combine responses rather than categories.

Given a variable such as:

{
    "element": "shoji:entity",
    "self": "https://app.crunch.io/api/datasets/455288/variables/3c2e57/",
    "body": {
        "name": "Aided awareness",
        "alias": "aided",
        "subvariables": [
            "../870a2d/",
            "../a8b0eb/",
            "../dc444f/",
            "../8e6279/",
            "../f775ab/",
            "../6405c2/"
        ],
        "type": "multiple_response",
        "categories": [
            {"numeric_value": 1, "selected": true, "id": 1, "name": "Selected", "missing": false},
            {"numeric_value": 2, "id": 2, "name": "Not selected", "missing": false},
            {"numeric_value": 8, "id": 3, "name": "Skipped", "missing": true},
            {"numeric_value": 9, "id": 4, "name": "Not asked", "missing": true},
            {"numeric_value": null, "id": -1, "name": "No data", "missing": true}
        ],
        "description": "Which of the following coffee brands do you recognize? Check all that apply."
    }
}

POSTing to the variables catalog a Shoji Entity containing a ZCL function like:

{
    "element": "shoji:entity",
    "body": {
        "name": "Aided awareness by region",
        "description": "Combined from aided brand awareness",
        "alias": "aided_region",
        "derivation": {
            "function": "combine_responses",
            "args": [
                {
                    "variable": "https://app.crunch.io/api/datasets/455288/variables/3c2e57/"
                },
                {
                    "value": [
                        {"name": "San Francisco", "combined_ids": ["../870a2d/", "../a8b0eb/","../dc444f/"]},
                        {"name": "Portland", "combined_ids": ["../8e6279/", "../f775ab/"]},
                        {"name": "Chicago", "combined_ids": ["../6405c2/" ]}
                    ]
                }
            ]
        }
    }
}

…results in a multiple response variable with three responses. The “selected” state of the responses in the derived variable is an “OR” of the combined subvariables.

Case statements 🔗

The “case” function derives a variable using values from the first argument. Each of the remaining arguments contains a boolean expression. These are evaluated in order in an IF, ELSE IF, ELSE IF, …, ELSE fashion; the first one that matches selects the corresponding value from the first argument. For example, if the first two boolean expressions do not match (return False) but the third one matches, then the third value in the first argument is placed into that row in the output. You may include an extra value for the case when none of the boolean expressions match; if not provided, it defaults to the system “No Data” missing value.

{
    "element": "shoji:entity",
    "body": {
        "name": "Market segmentation",
        "description": "Super-scientific classification of people",
        "alias": "segments",
        "derivation": {
            "function": "case",
            "args": [
                {
                    "column": [1, 2, 3, 4],
                    "type": {
                        "value": {
                            "class": "categorical",
                            "categories": [
                                {"id": 3, "name": "Hipsters", "numeric_value": null, "missing": false},
                                {"id": 1, "name": "Techies", "numeric_value": null, "missing": false},
                                {"id": 2, "name": "Yuppies", "numeric_value": null, "missing": false},
                                {"id": 4, "name": "Other", "numeric_value": null, "missing": true}
                            ]
                        }
                    }
                },
                {
                    "function": "and",
                    "args": [
                        {"function": "in", "args": [{"variable": "55fc29/"}, {"value": [5, 6]}]},
                        {"function": "<=", "args": [{"variable": "673dde/"}, {"value": 30}]}
                    ]
                },
                {
                    "function": "and",
                    "args": [
                        {"function": "in", "args": [{"variable": "889dc3/"}, {"value": [4, 5, 6]}]},
                        {"function": ">", "args": [{"variable": "673dde/"}, {"value": 40}]}
                    ]
                },
                {"function": "==", "args": [{"variable": "13cbf4/"}, {"value": 1}]}
            ]
        }
    }
}

Making ad hoc arrays 🔗

It is possible to create derived arrays reusing subvariables from other arrays using the array function and indicating the reference for each of its subvariables.

The subvariables of an array are specified using the select function, with its first map argument indicating the IDs for each of these virtual subvariables. These IDs are user defined and can be any string. They remain unique inside the parent variable so they can be reused between different arrays. The second argument of the select function indicates the order of the subvariables in the array. They are referenced by the user defined IDs.

Each of its variables must point to a variable expression, which can take an optional (but recommended) references attribute to specify a particular name and alias for the subvariable, if not specified, the same name from the original will be used and the alias will be padded to ensure uniqueness.

{
  "CA3": {
    "name": "cat array 3",
    "derivation": {
      "function": "array",
      "args": [
        {
          "function": "select",
          "args": [
            {
              "map": {
                "var1": {
                  "variable": "ca2-subvar-2",
                  "references": {"alias": "subvar2", "name": "Subvar 2"}
                },
                "var0": {
                  "variable": "ca1-subvar-1",
                  "references": {"alias": "subvar1", "name": "Subvar 1"}
                }
              }
            },
            {"value": ["var1", "var0"]}
          ]
        }
      ]
    }
  },
  "CA2": {
    "subvariables": [
      {"alias": "ca2-subvar-1", "name": "ca2-subvar-1"},
      {"alias": "ca2-subvar-2", "name": "ca2-subvar-2"}
    ],
    "type": "categorical_array",
    "name": "cat array 2",
    "categories": [
      {"numeric_value": null, "missing": false, "id": 1, "name": "yes"},
      {"numeric_value": null, "missing": false, "id": 2, "name": "no"},
      {"numeric_value": null, "missing": true, "id": -1, "name": "No Data"}
    ]
  },
  "CA1": {
    "subvariables": [
      {"alias": "ca1-subvar-1", "name": "ca1-subvar-1"},
      {"alias": "ca1-subvar-2", "name": "ca1-subvar-2"},
      {"alias": "ca1-subvar-3", "name": "ca1-subvar-3"}
    ],
    "type": "categorical_array",
    "name": "cat array 1",
    "categories": [
      {"numeric_value": null, "missing": false, "id": 1, "name": "yes"},
      {"numeric_value": null, "missing": false, "id": 2, "name": "no"},
      {"numeric_value": null, "missing": true, "id": -1, "name": "No Data"}
    ]
  }
}

On the above example, the array CA3 uses the array function and uses subvariables ca1-subvar-1 and ca2-subvar-2 from CA1 and CA2 respectively. The references attribute is used to indicate specific name/alias for these subvariables.

Weights 🔗

A numeric variable suitable for use as row weights can be constructed from one or more categorical variables and target proportions of their categories. The sample distribution is “raked” iteratively to each categorical marginal target to produce a set of joint values that can be used as weights. Note that available weight variables are shared by all; you may not create private weights. To create a weight variable, POST a JSON variable definition to the variables catalog describing the properties of the weight variable, with an “derivation” member indicating to use the “rake” function, which takes arguments containing an array of variable targets:

POST /api/datasets/{datasetid}/variables/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 739
{
    "name": "weight",
    "description": "my raked weight",
    "derivation": {
        "function": "rake",
        "args": [{
            "variable": variabl1.id,
            "targets": [[1, 0.491], [2, 0.509]]
        }]
    }
}
---------
201 Created
Location: /api/datasets/{datasetid}/variables/{variableid}/

Multiple Response Views 🔗

The “select_categories” function allows you to form a multiple response array from a categorical array, or alter the “selected” categories in an existing multiple response array. It takes two arguments:

  • A reference to a categorical or categorical_array variable
  • A list of the category ids or category names to mark as “selected”

Given a variable such as:

{
    "element": "shoji:entity",
    "self": "https://app.crunch.io/api/datasets/3ad42c/variables/0000f5/",
    "body": {
        "name": "Cola",
        "alias": "cola",
        "type": "categorical",
        "categories": [
            {"id": -1, "name": "No Data", "numeric_value": null, "missing": true},
            {"id": 0, "name": "Never", "numeric_value": null, "missing": false},
            {"id": 1, "name": "Sometimes", "numeric_value": null, "missing": false},
            {"id": 2, "name": "Frequently", "numeric_value": null, "missing": false},
            {"id": 3, "name": "Always", "numeric_value": null, "missing": false}
        ],
        "subvariables": ["0001", "0002", "0003"],
        "references": {
            "subreferences": {
                "0003": {"alias": "Coke"},
                "0002": {"alias": "Pepsi"},
                "0001": {"alias": "RC"}
            }
        }
    }
}

POST’ing to the private variables catalog a Shoji Entity containing a ZCL function like:

{
    "element": "shoji:entity",
    "body": {
        "name": "Cola likes",
        "description": "Cola preferences",
        "alias": "cola_likes",
        "derivation": {
            "function": "select_categories",
            "args": [
                {"variable": "https://app.crunch.io/api/datasets/3ad42c/variables/0000f5/"},
                {"value": [2, 3]}
            ]
        }
    }
}

…results in a private multiple_response variable where the “Frequently” and “Always” categories are selected.

Categorical Logic 🔗

Crunch provides logical operations like most languages do, but with a few enhancements.

First, Crunch logic is actually based on three values so you can properly handle missing data. Many programming languages do this, too: SQL implements NULL in addition to TRUE and FALSE, and general languages often have a similar None or NA value.

In Crunch, rather than having a “boolean” type with true/false/null values, we instead allow any categorical variable to participate in logic. Categories which have the selected: true flag are “selected”; categories which have the missing: true flag are “missing”; any categories that lack both are “other”. Any categorical variable that has at least one “selected” category can function as a “logical variable”.

How can you obtain a logical variable? A few ways:

  • You can define it yourself simply by setting the selected: true flag on one or more categories when your data is uploaded. You are probably already doing this for missing: true.
  • You can use the select_categories function to derive new Multiple Response variables.
  • Logic Functions like ==, >, in, or, and not all return logical variables.
  • You can use the “filter builder” in the web UI to quickly create derived variables using the logic functions.

Given a variable such as:

{
    "element": "shoji:entity",
    "self": "https://app.crunch.io/api/datasets/3ad42c/variables/000301/",
    "body": {
        "name": "Height",
        "alias": "height",
        "type": "numeric",
    }
}

POST’ing to the private variables catalog a Shoji Entity containing a ZCL function like:

{
    "element": "shoji:entity",
    "body": {
        "name": "Tall people",
        "description": "People who are taller than 200cm",
        "alias": "tall",
        "derivation": {
            "function": ">",
            "args": [
                {"variable": "https://app.crunch.io/api/datasets/3ad42c/variables/000301/"},
                {"value": 200}
            ]
        }
    }
}

…results in a private logical variable with three categories: “Selected”, “Other”, and “No Data”.

What can you do with logical variables? A few things:

  • You can use them as filters. Any rows where the logical variable is “selected” match the filter and are retained; any that are “other” or “missing” do not match and are filtered out.
  • You can reduce them to just three values with the as_selected function. This lets you:
    • Calculate proportions and other statistics accurately for Multiple Response analyses by properly representing and propagating missingness.
    • Present views where only the “selected” slice is shown.
    • Represent the variable in clients as if it were true/false/NA; the “rcrunch” library for R does this, transforming three-valued Crunch variables into R logicals and back.
  • You can use and and or to combine them.

Logical Negation 🔗

In addition to the typical not operation, Crunch provides a separate not_selected function. The difference is in how missing values are handled. For not, missing values are left missing:

A not A
S O
O S
M M

This is called the “relative complement” and is usually what you want when you invert a set of selected categories. For example, if you have a variable telling you who likes Brad Pitt, it probably has some “missing” entries, where it’s unknown whether the respondent likes Brad Pitt or not. You often want the inverse of “who does not like Brad Pitt” to not turn those “missing” entries into “selected” ones. It also means that A != B has the same output as not(A == B).

When filtering, however, you often want the “absolute complement”. If you define a filter of those who like Brad Pitt, but then want to exclude those folks, you’ll want the “selected” entries to invert to “other” so they are excluded. But you don’t want the “missing” entries to stay “missing” or they will also be excluded. In this case, you want to use not_selected instead:

A not_selected A
S O
O S
M S

Binary Logic 🔗

The and function behaves like most languages. Any “missing” value makes the output “missing”:

and S O M
S S O M
O O O M
M M M M

The or function behaves like SQL and R: any “selected” value wins, but any row not matched by another case where any of the input variables are missing stays missing.

or S O M
S S S S
O S O M
M S M M

Contrary to lots of other languages (like SQL and R), eq returns “selected” if the two values are the same code, even if missing. For example, when comparing whether variable A is equal to missing -8, we want a -8 value in A to evaluate to “selected”, not “missing”. But other missings should result in a missing result.

eq 1 2 99 -1
missing -1 M M M S
missing 99 M M S  
valid 2 O S  
valid 1 S  
ne 1 2 99 -1
missing -1 M M M O
missing 99 M M O  
valid 2 S O  
valid 1 O  

Text Analysis 🔗

Sentiment Analysis 🔗

The “sentiment” function allows you to derive a categorical variable from text variable data, which is classified and accumulated in three categories (positive, negative, and neutral). It takes one parameter:

  • A reference to a text variable

Given a variable such as:

{
    "element": "shoji:entity",
    "self": "https://app.crunch.io/api/datasets/3ad42c/variables/0000f5/",
    "body": {
        "name": "Zest",
        "alias": "zest",
        "type": "text",
        "values": [
            "Zest is best",
            "Zest I can take it or leave it",
            "Zest is the worst"
        ]
    }
}

POSTing to the private variables catalog a Shoji Entity containing a ZCL function like:

{
    "element": "shoji:entity",
    "body": {
        "name": "Zesty Sentiment",
        "description": "Customer sentiment about Zest",
        "alias": "zest_sentiment",
        "derivation": {
            "function": "sentiment",
            "args": [
                {"variable": "https://app.crunch.io/api/datasets/3ad42c/variables/0000f5/"}
            ]
        }
    }
}

Will result in a new categorical variable, where for each row the text variable is classified as “Negative”, “Neutral”, or “Positive” using the VADER English social-media-tuned lexicon.

Other transformations 🔗

Arithmetic operations 🔗

It is possible to create new numeric variables out of pairs of other numeric variables. The following arithmetic operations are available and will take two numeric variables as their arguments.

  • “+” for adding up two numeric variables.
  • “-” returns the difference between two numeric variables.
  • “*” for the product of two numeric variables.
  • “/” Real division.
  • “//” Floor division; Returns always an integer.
  • “^” Raises the first argument to the power of the second argument
  • “%” Modulo operation; Accepts floats

The usage is as follows for all operators:

{
    "function": "+",
    "args": [
        {"variable": "https://app.crunch.io/api/datasets/123/variables/abc/"}
        {"variable": "https://app.crunch.io/api/datasets/123/variables/def/"}
    ]
}

bin 🔗

Receives a numeric variable and returns a categorical one where each category represents a bin of the numeric values.

Each category on the new variable is annotated with a “boundaries” member that contains the lower/upper bound of each bin.

{
    "function": "bin",
    "args": [
        {"variable": "https://app.crunch.io/api/datasets/123/variables/abc/"}
    ]
}

Optionally it is possible to pass a second argument indicating the desired bin size to use instead of allowing the API to decide them.

{
    "function": "bin",
    "args": [
        {"variable": "https://app.crunch.io/api/datasets/123/variables/abc/"},
        {'value': 100}
    ]
}

case 🔗

Returns a categorical variable with its categories following the specified conditions from different variables on the dataset. See Case Statements.

cast 🔗

Returns a new variable with its type and values casted. Not applicable on arrays or date variable; use Date Functions to work with date variables.

{
    "function": "cast",
    "args": [
        {"variable": "https://app.crunch.io/api/datasets/123/variables/abc/"},
        {"value": "numeric"}
    ]
}

The allowed output variable types are:

  • numeric
  • text
  • categorical

For categorical types it is necessary to indicate the categories as a type definition instead of a string name:

To cast to categorical type, the second argument value should not be a name string (numeric, text) but a type definition indicating a class and categories as follow:

{
    "function": "cast",
    "args": [
        {"variable": "https://app.crunch.io/api/datasets/123/variables/abc/"},
        {"value": {
                "class": "categorical",
                "categories": [
                    {"id": 1, "name": "one", "missing": false, "numeric_value": null},
                    {"id": 2, "name": "two", "missing": false, "numeric_value": null},
                    {"id": -1, "name": "No Data", "missing": true, "numeric_value": null},
                ]
            }
        }
    ]
}

To change the type of a variable a client should POST to the /variable/:id/cast/ endpoint. See Convert type for API examples.

char_length 🔗

Returns a numeric variable containing the text length of each value. Only applicable on text variables.

{
    "function": "char_length",
    "args": [
        {"variable": "https://app.crunch.io/api/datasets/123/variables/abc/"}
    ]
}

copy_variable 🔗

Returns a shallow copy of the indicated variable maintaining type and data.

{
    "function": "copy_variable",
    "args": [
        {"variable": "https://app.crunch.io/api/datasets/123/variables/abc/"}
    ]
}

Changes on the data of the original variable will be reflected on this copy.

combine_categories 🔗

Returns a categorical variable with values combined following the specified combination rules. See Combining categories.

combine_responses 🔗

Given a list of categorical variables, return the selected value out of them. See Combining responses.

row 🔗

Returns a numeric variable with row 0 based indices. It takes no arguments.

{
    "function": "row",
    "args": []
}

remap_missing 🔗

Given a text, numeric or datetime variable. return a new variable of the same type with its missing values mapped to new codes

{
  "function": "remap_missing",
  "args": [
    {"variable": "varid"},
    {"value": [
        {"reason": "Combined 1 and 2", "code": 1, "mapped_codes": [1, 2]},
        {"reason": "Only 3", "code": 2, "mapped_codes": [3]},
        {"reason": "No Data", "code": -1, "mapped_codes": [-1]}
    ]}
  ]
}

The example above will return a copy of the variable with id varid with the new missing_reasons grouping and mapping following the original codes.

Integrating variables 🔗

“Integrating” a variable means to remove its derived properties and turn it into a regular base variable. Doing so will make this variable stop reflecting the expression if new data is added to its original parent variable and new rows will be filled with No Data {"?": -1}.

To integrate a variable it is necessary to PATCH to the variable entity with the derived attribute set to false as so:

PATCH /api/dataset/abc/variables/123/
{
  "element": "shoji:entity",
  "body": {
    "derived": false
  }
}

Will effectively integrate the variable and make its derivation attribute contain null from now in. Note that it is only possible to set the derived attribute to false and never to true.

Creating unlinked derivations 🔗

It is possible to create a material copy, or one off copy of a variable or an expression of it.

To create such variables, proceed normally creating a derived variable with the derivation expression, but also include derived: false attribute to it. So the variable will be created with the values of the expression but will be unlinked from the original variable.

POST /api/dataset/abc/variables/