Advanced Commands¶
Note
On the command line some special characters may need to be escaped for
the commands to function correctly. E.g. &
as \&
, ?
as \?
,
=
as \=
.
Note
[SIC] all spelling errors in the example dataset.
Creating a Dataset from a Schema¶
You can create a dataset from a schema. Your schema must be written in the Simple Data Format (SDF).
Below is an example SDF schema file:
{
"_id": {
"label": "_id",
"olap_type": "measure",
"simpletype": "float"
},
"_percentage_complete": {
"label": "_percentage_complete",
"olap_type": "measure",
"simpletype": "float"
},
"amount": {
"label": "amount",
"olap_type": "measure",
"simpletype": "float"
},
"comments": {
"label": "comments",
"olap_type": "dimension",
"simpletype": "string"
},
"food_photo": {
"label": "food_photo",
"olap_type": "dimension",
"simpletype": "string"
},
"food_type": {
"label": "food_type",
"olap_type": "dimension",
"simpletype": "string"
},
"gps": {
"cardinality": 18,
"label": "gps",
"olap_type": "dimension",
"simpletype": "string"
},
"gps_alt": {
"label": "gps_alt",
"olap_type": "measure",
"simpletype": "float"
},
"gps_latitude": {
"label": "gps_latitude",
"olap_type": "measure",
"simpletype": "float"
},
"gps_longitude": {
"label": "gps_longitude",
"olap_type": "measure",
"simpletype": "float"
},
"gps_precision": {
"label": "gps_precision",
"olap_type": "measure",
"simpletype": "float"
},
"location_name": {
"label": "location_name",
"olap_type": "dimension",
"simpletype": "string"
},
"rating": {
"label": "rating",
"olap_type": "dimension",
"simpletype": "string"
},
"risk_factor": {
"label": "risk_factor",
"olap_type": "dimension",
"simpletype": "string"
},
"submit_date": {
"label": "submit_date",
"olap_type": "measure",
"simpletype": "datetime"
}
}
Supposing this file is saved locally as /home/modilabs/good_eats.schema.json
,
you can create a dataset from this scema using:
curl -X POST -F schema=@/home/modilabs/good_eats.schema.json http://bamboo.io/datasets
returns:
{
"id": "8a3d74711475d8a51c84484fe73f24bd151242ea"
}
Create a Dataset with data from a Schema¶
To create a dataset from a schema and upload a CSV file for it, use the following command:
curl -X POST -F schema=@/home/modilabs/good_eats.schema.json -F csv_file=@/home/modilabs/good_eats.csv http://bamboo.io/datasets
And similarly for a JSON file:
curl -X POST -F schema=@/home/modilabs/good_eats.schema.json -F json_file=@/home/modilabs/good_eats.json http://bamboo.io/datasets
returns:
{
"id": "8a3d74711475d8a51c84484fe73f24bd151242ea"
}
Creating Multiple Calculations via JSON¶
You can create multiple calculations by uploading a properly formatted JSON file to the create calculations endpoint.
Below is an example calculations JSON file:
[
{
"name": "sweet foods",
"formula": "food_type in [\"deserts\", \"streat_sweets\"]"
},
{
"name": "in northern hemisphere",
"formula": "gps_latitude > 0"
},
{
"name": "sum of amount",
"formula": "sum(amount)",
"groups": "risk_factor"
},
{
"name": "mean of amount",
"formula": "mean(amount)",
"groups": "risk_factor,food_type"
},
{
"name": "mean of amount",
"formula": "mean(amount)",
"groups": ["risk_factor", "food_type"]
}
]
Supposing this file is saved locally as /home/modilabs/good_eats.calculations.json
,
you can create a dataset from this json file using:
curl -X POST -F json_file=@/home/modilabs/good_eats.calculations.json http://bamboo.io/calculations/8a3d74711475d8a51c84484fe73f24bd151242ea
returns:
{
"success": "created calculations from JSON",
"id": "8a3d74711475d8a51c84484fe73f24bd151242ea"
}
Note
The file can also contain a single dictionary, for example:
{
"name": "in northern hemisphere",
"formula": "gps_latitude > 0"
}
Replicating a Calculation Across Groups¶
To create the same calculation with multiple groups you may include a
groups
key in your JSON file with the value as a list of groups to create
the calculation for.
The groups
key can also be set to a string, in which case bamboo will
create a single calculation with the value as the group.
For example, below the second to third calculation is created with the group risk_factor, and the fourth with the single multi-group risk_factor and food_type. The final calculation is create twice, once for the group risk_factor, and once for the group food_type.
[
{
"name": "sweet foods",
"formula": "food_type in [\"deserts\", \"streat_sweets\"]"
},
{
"name": "in northern hemisphere",
"formula": "gps_latitude > 0"
},
{
"name": "sum of amount",
"formula": "sum(amount)",
"groups": "risk_factor"
},
{
"name": "mean of amount",
"formula": "mean(amount)",
"groups": "risk_factor,food_type"
},
{
"name": "mean of amount",
"formula": "mean(amount)",
"groups": ["risk_factor", "food_type"]
}
]
Create a “perishable” dataset¶
Suppose you want to upload a dataset, perform some calculations, and use the
results, but you do not want your data to be permanently stored. To
accommodate this use case dataset creation supports on optional perish
parameter. The perish
parameter is an integer which specifies the number
of seconds after which to delete the dataset. For example, to create a dataset
that will be deleted after one day:
curl -X POST -d "url=http://formhub.org/mberg/forms/good_eats/data.csv&perish=86400" http://bamboo.io/datasets
returns:
{
"id": "8a3d74711475d8a51c84484fe73f24bd151242ea"
}
Additional dataset query parameters¶
Dataset queries may take the following optional parameters:
select
: This is a required argument, it can be ‘all’ or aMongoDB JSON query
distinct
: A field to return distinct results for.query
: If passed restrict results to rows matching this query.limit
: If passed limit the rows to this number.order_by
: If passed order the result using this column.format
: Format of output data, ‘json’ or ‘csv’callback
: A JSONP callback function to wrap the result in.index
: Boolean, if true also return the index. Default false.count
: Boolean, if true returns only the number of results. Default false.
Export data using the format parameter¶
You may use the format
parameter as an alternative to file extension
notation.
To export the data as CSV:
curl http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea?format=CSV
To export the data as JSON:
curl http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea?format=JSON
Ordering the results of a query¶
The order_by
parameter sorts the resulting rows according to a column value
and a sign indicating to order ascending (default) or descending
(the -
sign).
For example:
curl http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea&order_by=amount
curl http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea&order_by=-amount
Using a JSONP callback¶
If you would like to retrieve your dataset from bamboo.io and process the result using JavaScript, you can define a JavaScript callback to do so.
Suppose you have defined the JavaScript callback function
handleBambooDataset(json)
to process the JSON dataset.
You can pass this to bamboo using the json
parameter as follows:
curl http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea&callback=handleBambooDataset
Updating Dataset Metadata¶
The following metadata fields can be added to a dataset:
attribution
: attribution and original of this dataset.description
: a text description of the dataset.label
: a label or name for this dataset.license
: the license this dataset is under.
The fieldname is in bold with suggested uses to the right.
To set the metadata on a dataset, make a PUT request to info
, this will
also return the update dataset info.
curl -X PUT -d "description=Good%20eats%20description&license=public&attribution=mlberg&label=goodeats" http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea/info
returns:
{
"attribution": "mlberg",
"description": "Good eats description",
"id": "8a3d74711475d8a51c84484fe73f24bd151242ea",
"label": "goodeats",
"license": "public",
"schema": {
"amount": {
"label": "Amount",
"olap_type": "measure",
"simpletype": "float"
},
"rating": {
"label": "Rating",
"olap_type": "dimension",
"simpletype": "string",
"cardinality": 2
},
"food_type": {
"label": "Food Type",
"olap_type": "dimension",
"simpletype": "string",
"cardinality": 8
},
...
},
"created_at": "2012-6-18 14:43:32",
"updated_at": "2012-6-18 14:43:32",
"num_rows": "500",
"num_columns": "30",
"state": "ready"
}
Setting the OLAP Type¶
By default columns with numeric data are treated as measures and return summary statistics. However, if your numeric data represents indentifiers it may be semantically dimensional. You can use set OLAP Type to treat numeric data dimensionally.
Any numeric column can be converted to a dimension and back. Dimension columns
cannot be modified. For example, to convert the numeric column amount
to a
dimension:
curl -X PUT -d "column=amount&olap_type=dimension" http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea/set_olap_type
returns:
{"success": "set OLAP Type for column 'amount' to 'dimension'}
Timeseries operations on a dataset¶
Resampling a dataset¶
If your dataset contains any date columns, you can resample numeric columns in your data based on any of these date columns.
Any options that can be passed to the pandas resample
functions
(pandas docs)
function can be passed as parameters to bamboo.
The parameters are
date_column
: The date column to resample on.interval
: A code for the interval to use, any pandas codes are accepted,e.g. ‘D’ for daily, ‘W’ for weekly, ‘M’ for monthly.
how
: (Optional) How to calculate the grouped samples. The default is‘mean’.
query
: (Optional) A MongoDB query to restrict the dataset, only datamatching the query will be resampled.
format
: (Optional) Set to ‘csv’ to return a CSV of the resampled data.
For example, to resample a dataset at monthly intervals by mean use the following command:
curl http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea/resample?date_column=submit_date&interval=M&how=mean
returns:
[
{
'_percentage_complete': 'null',
'submit_date': {'$date': 1325289600000},
'gps_alt': 36.78334554038333,
'amount': 7.107142857142857,
'gps_latitude': 40.622100085116664,
'_id': 358490042584000.0,
'gps_precision': 43.333333333333336,
'gps_longitude': 29.94870928221667
},
{
'_percentage_complete': 'null',
'submit_date': {'$date': 1327968000000},
'gps_alt': 168.11249542212502,
'amount': 233.5625,
'gps_latitude': 29.618330361937502,
'_id': 358490042584000.0,
'gps_precision': 47.0,
'gps_longitude': 18.21789164405},
{
'_percentage_complete': 'null',
'submit_date': {'$date': 1330473600000},
'gps_alt': 0.0,
'amount': 45.0,
'gps_latitude': 40.8076961,
'_id': 358490042584000.0,
'gps_precision': 588.0,
'gps_longitude': -73.95805440000001
},
{
'_percentage_complete': 'null',
'submit_date': {'$date': 1333152000000},
'gps_alt': 0.0,
'amount': 12.0,
'gps_latitude': 49.26994343,
'_id': 358490042584000.0,
'gps_precision': 56.0,
'gps_longitude': -123.15297941
},
{
'_percentage_complete': 'null',
'submit_date': {'$date': 1335744000000},
'gps_alt': 281.30000305175,
'amount': 16.125,
'gps_latitude': 26.66630736555,
'_id': 358490042584000.0,
'gps_precision': 22.5,
'gps_longitude': 1.530683878799998
}
]
Calculating rolling statistics¶
To compute moving or rolling statistics / moments you can use the
rolling
request.
Any options that can be passed to the pandas rolling_window
function
(pandas docs)
can be passed as parameters to bamboo.
Window types are passed as the win_type
parameter. See
here for
window type definitions and examples.
For example, to calculating a rolling mean with a window of 3 values, use the following command:
curl http://bamboo.io/datasets/8a3d74711475d8a51c84484fe73f24bd151242ea/rolling?win_type=boxcar&window=3
returns:
[
{
'gps_longitude': 'null',
'_percentage_complete': 'null',
'gps_latitude': 'null',
'amount': 'null',
'gps_alt': 'null',
'_id': 'null',
'gps_precision': 'null'
},
{
'gps_longitude': 'null',
'_percentage_complete': 'null',
'gps_latitude': 'null',
'amount': 'null',
'gps_alt': 'null',
'_id': 'null',
'gps_precision': 'null'
},
{
'gps_longitude': 28.97413979283333,
'_percentage_complete': 'null',
'gps_latitude': 41.018141275299996,
'amount': 4.583333333333333,
'gps_alt': 45.60001627606666,
'_id': 'null', 'gps_precision': 41.666666666666664
},
...
]
Note
The first window - 1
rows will be null, because not enough
data will have been seen to calculate rolling statistics for those rows.