There are several ways to build a Crunch dataset. The most appropriate method for you to create a dataset depends primarily on the format in which the data is currently stored.
In some cases, you already have a file sitting on your computer which has source data, in CSV or SPSS format (or a Zip file containing a single file in CSV or SPSS format). You can upload these to Crunch and then attach them to datasets by following these steps.
POST /datasets/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 974
...
{
"element": "shoji:entity",
"body": {
"name": "my survey",
...
}
}
--------
201 Created
Location: /datasets/{dataset_id}/
ds <- newDatasetFromFile("my.csv", name="my survey")
# All three steps are handled within newDatasetFromFile
POST a Dataset Entity to the datasets catalog. See the documentation for POST /datasets/ for details on valid attributes to include in the POST.
POST /sources/ HTTP/1.1
Content-Length: 8874357
Content-Type: multipart/form-data; boundary=df5b17ff463a4cb3aa61cf02224c7303
--df5b17ff463a4cb3aa61cf02224c7303
Content-Disposition: form-data; name="uploaded_file"; filename="my.csv"
Content-Type: text/csv
"case_id","q1","q2"
234375,3,"sometimes"
234376,2,"always"
...
--------
201 Created
Location: /sources/{source_id}/
POST the file to the sources catalog.
Note that if the file is large (>100 megabytes), you should consider uploading it to a file-sharing service, like Dropbox.
To import from a URL (rather than a local file), use a JSON body with a
location
property giving the URL.
POST /sources/ HTTP/1.1
Content-Length: 71
Content-Type: application/json
{"location": "https://www.dropbox.com/s/znpoawnhg0rdzhw/iris.csv?dl=1"}
POST /datasets/{dataset_id}/batches/ HTTP/1.1
Content-Type: application/json
...
{
"element": "shoji:entity",
"body": {
"source": "/sources/{source_id}/"
}
}
--------
202 Accepted
Location: /datasets/{dataset_id}/batches/{batch_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
POST the URL of the just-created source entity (the Location in the 201 response from the previous step) to the batches catalog of the dataset entity created in step 1.
The POST to the batches catalog will return 202 (Accepted) status, and the response body contains a progress URL. Poll that URL to monitor the completion of the batch addition. See Progress for more. The 202 response will also contain a Location header with the URL of the newly created batch.
This approach may be most natural for importing data from databases that store data by rows. You can dump or export your database to Crunch’s JSON metadata format, plus a CSV of data, and upload those to Crunch, without requiring much back-and-forth with the API.
POST /datasets/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 974
...
{
"element": "shoji:entity",
"body": {
"name": "my survey",
...,
"table": {
"element": "crunch:table",
"metadata": {
"educ": {"name": "Education", "alias": "educ", "type": "categorical", "categories": [...], ...},
"color": {"name": "Favorite color", "alias": "color", "type": "text", ...},
"state": {"name": "State", "alias": "state", "view": {"geodata": [{"geodatum": <uri>, "feature_key": "properties.postal-code"}]}}
},
"order": ["educ", {'my group": "color"}],
"hidden": [{"My system variables": ["state"]}]
},
}
}
--------
201 Created
Location: /datasets/{dataset_id}/
POST a Dataset Entity to the datasets catalog, and in the “body”, include a Crunch Table object with variable definitions and order.
The “metadata” member in the table is an object containing all variable definitions, keyed by variable alias. See the Object Reference: Variable Definitions discussion for specific requirements for defining variables of various types, as well as the example below.
The “order” member is a Shoji Order object specifying the order, potentially hierarchically nested, of the variables in the dataset. The example below illustrates how this can be used. Shoji is JSON, which means the “metadata” object is explicitly unordered. If you wish the variables to have an order, you must supply an order object rather than relying on any order of the “metadata” object.
Additionally, an optional “hidden” member is allowed, which receives a Shoji Order object just like the “order” member. This structure will be used to construct the Hidden folder subfolders and the variables present inside it will be considered hidden variables.
Any variable that has been defined with "discarded": true
will
automatically be appended to the Hidden folder order structure generated
from the “order” member. Defining a variable as "discarded": true
is
a legacy option that cannot be combined with the “hidden” member.
Following the Variable Folders rules, a variable cannot be in two folders simultaneously, so the server will raise validation errors if any variable is present in both the public order and the hidden order.
It is possible to create derived variables using any of the derivation functions available simultaneously in one request when creating the dataset along its metadata. The variable references inside the derivation expressions must point to declared aliases of variables or subvariables.
POST /datasets/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 3294
...
{
"element": "shoji:entity",
"body": {
"name": "Dataset with derived arrays",
"settings": {
"viewers_can_export": true,
"viewers_can_change_weight": false,
"min_base_size": 3,
"weight": "weight_variable",
"dashboard_deck": null
},
"table": {
"metadata": {
"element": "crunch:table"
"weight_variable": {
"name": "weight variable",
"alias": "weight_variable",
"type": "numeric"
},
"combined": {
"name": "combined CA",
"derivation": {
"function": "combine_categories",
"args": [
{
"variable": "CA1"
},
{
"value": [
{
"combined_ids": [2],
"numeric_value": 2,
"missing": false,
"name": "even",
"id": 1
},
{
"combined_ids": [1],
"numeric_value": 1,
"missing": false,
"name": "odd",
"id": 2
}
]
}
]
}
},
"numeric": {
"name": "numeric variable",
"type": "numeric"
},
"numeric_copy": {
"name": "Copy of numeric",
"derivation": {
"function": "copy_variable",
"args": [{"variable": "numeric"}]
}
},
"MR1": {
"name": "multiple response",
"derivation": {
"function": "select_categories",
"args": [
{
"variable": "CA3"
},
{
"value": [
1
]
}
]
}
},
"CA3": {
"name": "cat array 3",
"derivation": {
"function": "array",
"args": [
{
"function": "select",
"args": [
{
"map": {
"var1": {
"variable": "ca2-subvar-2",
"references": {
"alias": "subvar2",
"name": "Subvar 2"
}
},
"var0": {
"variable": "ca1-subvar-1",
"references": {
"alias": "subvar1",
"name": "Subvar 1"
}
}
}
},
{
"value": ["var1", "var0"]
}
]
}
]
}
},
"CA2": {
"subvariables": [
{
"alias": "ca2-subvar-1",
"name": "ca2-subvar-1"
},
{
"alias": "ca2-subvar-2",
"name": "ca2-subvar-2"
}
],
"type": "categorical_array",
"name": "cat array 2",
"categories": [
{
"numeric_value": null,
"missing": false,
"id": 1,
"name": "yes"
},
{
"numeric_value": null,
"missing": false,
"id": 2,
"name": "no"
},
{
"numeric_value": null,
"missing": true,
"id": -1,
"name": "No Data"
}
]
},
"CA1": {
"subvariables": [
{
"alias": "ca1-subvar-1",
"name": "ca1-subvar-1"
},
{
"alias": "ca1-subvar-2",
"name": "ca1-subvar-2"
},
{
"alias": "ca1-subvar-3",
"name": "ca1-subvar-3"
}
],
"type": "categorical_array",
"name": "cat array 1",
"categories": [
{
"numeric_value": null,
"missing": false,
"id": 1,
"name": "yes"
},
{
"numeric_value": null,
"missing": false,
"id": 2,
"name": "no"
},
{
"numeric_value": null,
"missing": true,
"id": -1,
"name": "No Data"
}
]
}
}
}
}
}
--------
201 Created
Location: /datasets/{dataset_id}/
The example above does a number of things:
numeric
and arrays CA1
and CA2
.numeric
as numeric_copy
.CA3
reusing subvariables from CA1
and
CA2
.MR1
selecting category 1 from
categorical array CA3
.null
will receive the default value for that
attribute.order
for the dataset will be handled as if no order was
passed in.hidden
(list) value for the dataset will signify there are no
hidden variables.order
or hidden
)."discarded": true
is a legacy option that cannot
be combined with the hidden
order for the dataset.By file:
POST /datasets/{dataset_id}/batches/ HTTP/1.1
Content-Type: text/csv
Content-Length: 8874357
Content-Disposition: form-data; name="file"; filename="thedata.csv"
...
"educ","color"
3,"red"
2,"yellow"
...
--------
202 Accepted
Location: /datasets/{dataset_id}/batches/{batch_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
By S3 URL:
POST /datasets/{dataset_id}/batches/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 341
...
{
"element": "shoji:entity",
"body": {
"url": "s3://bucket_name/dir/subdir/?accessKey=ASILC6CBA&secretKey=KdJy7ZRK8fDIBQ&token=AQoDYXdzECAa%3D%3D"
}
}
--------
202 Accepted
Location: /datasets/{dataset_id}/batches/{batch_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
POST a CSV file or URL to the new dataset’s batches catalog. The CSV must include a header row of variable identifiers, which should be the aliases of the variables (and array subvariables) defined in step (1).
The values in the CSV MUST be the same format as the values you get out of Crunch, and it must match the metadata specified in the previous step. This includes:
Violation of any of these validation criteria will result in a 409 (Conflict) response status. To resolve, you can either (1) fix your CSV locally and re-POST it, or (2) PATCH the variable metadata that is invalid and then re-POST the CSV.
Imports are done in “strict” mode by default. Strict imports are faster,
and using strict mode will alert you if there is any mismatch between
data and metadata. However, in some cases, it may be convenient to be
more flexible and silently ignore or resolve inconsistencies. For
example, you may have a large CSV dumped out of a database, and the data
format isn’t exactly Crunch’s format, but it would be costly to
read-munge-write the whole file for minor changes. In cases like this,
you may append ?strict=0
to the URL of the POST request to loosen
that strictness.
With non-strict imports:
The CSV can be sent in one of two ways:
A 201 response to the POST request indicates success. All rows added in a single request become part of a new Batch, whose URL is returned in the response Location. You may inspect the new rows in isolation by following its batch/ link.
Here’s an example dataset metadata and corresponding csv.
Several things to note:
"missing": "true"
Crunch stores data by column internally, so if your data are stored in a column-major format as well, importing by column may be the most efficient way to import data.
POST /datasets/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 974
...
{
"element": "shoji:entity",
"body": {
"name": "my survey",
...
}
}
--------
201 Created
Location: /datasets/{dataset_id}/
ds <- createDataset("my suryey")
POST a Dataset Entity to the datasets catalog, just as in the first import method.
POST /datasets/{dataset_id}/variables/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 38475
...
{
"element": "shoji:entity",
"body": {
"name": "Gender",
"alias": "gender",
"type": "categorical",
"categories": [
{
"name": "Male",
"id": 1,
"numeric_value": null,
"missing": false
},
{
"name": "Female",
"id": 2,
"numeric_value": null,
"missing": false
},
{
"name": "Skipped",
"id": 9,
"numeric_value": null,
"missing": true
}
],
"values": [1, 9, 1, 2, 2, 1, 1, 1, 1, 2, 9, 1]
}
}
--------
201 Created
Location: /datasets/{dataset_id}/variables/{variable_id}/
# Here's a similar example. R's factor type becomes "categorical".
gender.names <- c("Male", "Female", "Skipped")
gen <- factor(gender.names[c(1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 3, 1)],
levels=gender.names)
# Assigning an R vector into a dataset will create a variable entity.
ds$gender <- gen
POST a Variable Entity to the newly created dataset’s variables catalog, and include with that Entity definition a “values” key that contains the column of data. Do this for all columns in your dataset.
If the values
attribute is not present, the new column will be
filled with “No Data” in all rows.
The data passed in values
can correspond to either the full data
column for the new variable or a single value, in which case it will be
used to fill up the column.
In the case of arrays, the single value should be a list containing the correct categorical values.
If the type of the values passed in does not correspond with the variable’s type, the server will return a 400 response indicating the error and the variable will not be created.