Skip to content

Latest commit

 

History

History
648 lines (612 loc) · 20.8 KB

README.md

File metadata and controls

648 lines (612 loc) · 20.8 KB

API version 4 (or API_v4)

We will show the query API_v4 for Nanocubes using the crime50k example. In order to create such a nanocube and serve it via http, we run the following script

#
# (0) go the the $NANOCUBE/data folder
#
# (1) create the nanocube index file 'crime50k.nanocube'
#
nanocube create <(gunzip -c crime50k.csv.gz) crime50k.map crime50k.nanocube -header
#
# (2) serve the index via http on port 51234 using the alias 'crimes'
#
nanocube serve 51234 crimes=crime50k.nanocube

schema

To serve the schemas of all the Nanocube indices being served in port 51234 of localhost we send a GET request to:

http://localhost:51234/schema()

Note that in our case we only have one Nanocube index being served for the crime50k example. The result of the request above is a json array with one entry: the schema of the crimes index:

[
	{
		"type":"schema",
		"name":"crimes",
		"index_dimensions":[
			{
				"index":0,
				"name":"location",
				"bits_per_level":2,
				"num_levels":25,
				"hint":"spatial",
				"aliases":{
				}
			},
			{
				"index":1,
				"name":"type",
				"bits_per_level":8,
				"num_levels":1,
				"hint":"categorical",
				"aliases":{
					"11":"CRIMINAL TRESPASS",
					"8":"ROBBERY",
					"24":"NON-CRIMINAL",
					"27":"OBSCENITY",
					"30":"NON-CRIMINAL (SUBJECT SPECIFIED)",
					"3":"BATTERY",
					"26":"GAMBLING",
					"23":"INTIMIDATION",
					"2":"THEFT",
					"4":"BURGLARY",
					"17":"LIQUOR LAW VIOLATION",
					"29":"CONCEALED CARRY LICENSE VIOLATION",
					"21":"PROSTITUTION",
					"16":"CRIM SEXUAL ASSAULT",
					"12":"WEAPONS VIOLATION",
					"20":"SEX OFFENSE",
					"7":"DECEPTIVE PRACTICE",
					"10":"OTHER OFFENSE",
					"25":"PUBLIC INDECENCY",
					"14":"PUBLIC PEACE VIOLATION",
					"15":"INTERFERENCE WITH PUBLIC OFFICER",
					"22":"ARSON",
					"1":"CRIMINAL DAMAGE",
					"13":"KIDNAPPING",
					"5":"MOTOR VEHICLE THEFT",
					"0":"NARCOTICS",
					"18":"HOMICIDE",
					"9":"ASSAULT",
					"28":"OTHER NARCOTIC VIOLATION",
					"6":"OFFENSE INVOLVING CHILDREN",
					"19":"STALKING"
				}
			},
			{
				"index":2,
				"name":"time",
				"bits_per_level":1,
				"num_levels":16,
				"hint":"temporal|2013-12-01T06:00:00Z_3600s",
				"aliases":{
				}
			}
		],
		"measure_dimensions":[
			{
				"order":1,
				"name":"count",
				"type":"u32"
			}
		]
	}
]

total

In order to get the total number of crimes indexed in crimes we send the query q(crimes):

http://localhost:51234/q(crimes)

The total is 50k, since we indexed 50k records and the value measure we defined in crime50k.map is simply a count of records.

[
	{
		"type":"table",
		"numrows":1,
		"index_columns":[
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					50000.000000
				]
			}
		]
	}
]

location

The quadtree convention we use is the following:

0 bottom-left  quadrant (in the first split of the mercator projection contains South-America)
1 bottom-right quadrant (in the first split of the mercator projection contains Africa and South Asia)
2 top-left     quadrant (in the first split of the mercator projection contains North America)
3 top-right    quadrant (in the first split of the mercator projection contains Europe and North Asia)

If we want to count the number of crimes in a 256x256 quadtree-aligned grid of the sub-cell (top-left, bottom-right, top-left) or (2,1,2), we run the query below. Note that the cell (2,1,2) contains Chicago.

http://localhost:51234/q(crimes.b('location',dive(p(2,1,2),8)))

The result is

[
	{
		"type":"table",
		"numrows":8,
		"index_columns":[
			{
				"name":"location",
				"hint":"none",
				"values_per_row":11,
				"values":[
					2,1,2,0,0,0,0,1,2,3,3,2,1,2,0,0,0,0,1,3,0,2,2,1,2,0,0,0,0,1,3,0,3,2,1,2,0,0,0,0,1,3,1,2,2,1,2,0,0,0,0,1,3,2,0,2,1,2,0,0,0,0,1,3,2,1,2,1,2,0,0,0,0,1,3,2,2,2,1,2,0,0,0,0,1,3,2,3
				]
			}
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					272.000000,416.000000,12268.000000,224.000000,6838.000000,16913.000000,5460.000000,7609.000000
				]
			}
		]
	}
]

Note that the way the resulting grid cells are described are as paths. Since we subdivided the cell (2,1,2) at depth 3 more 8 times (the depth of the dive), we get that each grid cell is described by a path of 11 subdivisions. So the way to read the result above is to split location 'values' array at every 11-th entry. The final correspondence can also be better seen by generating the above resul in text format.

http://localhost:51234/format('text');q(crimes.b('location',dive(p(2,1,2),8)))

which yields the following text

                                          location                           count
                             2,1,2,0,0,0,0,1,2,3,3                      272.000000
                             2,1,2,0,0,0,0,1,3,0,2                      416.000000
                             2,1,2,0,0,0,0,1,3,0,3                    12268.000000
                             2,1,2,0,0,0,0,1,3,1,2                      224.000000
                             2,1,2,0,0,0,0,1,3,2,0                     6838.000000
                             2,1,2,0,0,0,0,1,3,2,1                    16913.000000
                             2,1,2,0,0,0,0,1,3,2,2                     5460.000000
                             2,1,2,0,0,0,0,1,3,2,3                     7609.000000

If instead of the path we just want the local (x,y) coordinate of the 256x256 grid, we can use the hint 'img8' as the last parameter of the binding .b(...).

http://localhost:51234/format('text');q(crimes.b('location',dive(p(2,1,2),8),'img8'))

We now get coordinates x and y both in {0,1,...,255}

                                          location                           count
                                              11,7                      272.000000
                                              12,5                      416.000000
                                              13,5                    12268.000000
                                              14,5                      224.000000
                                              12,6                     6838.000000
                                              13,6                    16913.000000
                                              12,7                     5460.000000
                                              13,7                     7609.000000

if we want the global coord of the grid we generated we can use the hint 'img11':

http://localhost:51234/format('text');q(crimes.b('location',dive(p(2,1,2),8),'img11'))

We now get coordinates x and y both in {0,1,...,2047}. Note that the y coordinate of the cells grow bottom-up, Chicago is above the Equator and {1285,1286,1287} are in the upper half of 2048 or 2^11.

                                          location                           count
                                          523,1287                      272.000000
                                          524,1285                      416.000000
                                          525,1285                    12268.000000
                                          526,1285                      224.000000
                                          524,1286                     6838.000000
                                          525,1286                    16913.000000
                                          524,1287                     5460.000000
                                          525,1287                     7609.000000

A path on a quadtree can also be specified using tile2d(z,x,y) or img2d(z,x,y). The path p(2,1,2) is equivalent to tile2d(3,2,5) where the y coordinate grows bottom-up, or img2d(3,2,2) where the y coordinate grows top down (like the standard slippy tiles on open street maps or the conventional pixel coordinates in an image).

Quadtree path entry convention:

    2 | 3
    -----
    0 | 1

So path(2,1,2) is equivalent to tile2d at z=3 (path length) with x and y
having three (ie. z)  binary digits each following the convention:

    x binary digit is 0 if path entry is 0 or 2
                   is 1 if path entry is 1 or 3

    y binary digit is 0 if path entry is 0 or 1
                   is 1 if path entry is 2 or 3

So, in this case the tile2d coordinates are:

    path(2,1,2) ==> 2 1 2
                 x: 0 1 0 = 010b = 2
                 y: 1 0 1 = 101b = 5

    path(2,1,2) = tile2d(3,2,5)

In general

    tile2d(z,x,y) = img2d(z,x,(2^z)-1-y)

So

    path(2,1,2) = img2d(3,2,2)

We can also use the options of tile2d_range(z,x0,y0,x1,y1) or img2d_range(z,x0,y0,x1,y1) to restrict query to a rectangular range of tiles.

type

If we want to count the number of crimes by type of crime, we run the query below and retrieve the categories either by number (category id if you will) or by category name.

# by number
http://localhost:51234/q(crimes.b('type',dive(1)))

yields the following json

[
	{
		"type":"table",
		"numrows":31,
		"index_columns":[
			{
				"name":"type",
				"hint":"none",
				"values_per_row":1,
				"values":[
					0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
				]
			}
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					5742.000000,4660.000000,11367.000000,8990.000000,2933.000000,2226.000000,456.000000,2190.000000,2132.000000,2629.000000,3278.000000,1429.000000,489.000000,46.000000,441.000000,229.000000,181.000000,69.000000,69.000000,20.000000,119.000000,211.000000,63.000000,21.000000,1.000000,2.000000,2.000000,1.000000,2.000000,1.000000,1.000000
				]
			}
		]
	}
]

using the hint 'name' we get path aliases (ie the category names)

# by name
http://localhost:51234/q(crimes.b('type',dive(1)),'name')

results in

[
	{
		"type":"table",
		"numrows":31,
		"index_columns":[
			{
				"name":"type",
				"hint":"name",
				"values_per_row":1,
				"values":[
					"NARCOTICS","CRIMINAL DAMAGE","THEFT","BATTERY","BURGLARY","MOTOR VEHICLE THEFT","OFFENSE INVOLVING CHILDREN","DECEPTIVE PRACTICE","ROBBERY","ASSAULT","OTHER OFFENSE","CRIMINAL TRESPASS","WEAPONS VIOLATION","KIDNAPPING","PUBLIC PEACE VIOLATION","INTERFERENCE WITH PUBLIC OFFICER","CRIM SEXUAL ASSAULT","LIQUOR LAW VIOLATION","HOMICIDE","STALKING","SEX OFFENSE","PROSTITUTION","ARSON","INTIMIDATION","NON-CRIMINAL","PUBLIC INDECENCY","GAMBLING","OBSCENITY","OTHER NARCOTIC VIOLATION","CONCEALED CARRY LICENSE VIOLATION","NON-CRIMINAL (SUBJECT SPECIFIED)"
				]
			}
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					5742.000000,4660.000000,11367.000000,8990.000000,2933.000000,2226.000000,456.000000,2190.000000,2132.000000,2629.000000,3278.000000,1429.000000,489.000000,46.000000,441.000000,229.000000,181.000000,69.000000,69.000000,20.000000,119.000000,211.000000,63.000000,21.000000,1.000000,2.000000,2.000000,1.000000,2.000000,1.000000,1.000000
				]
			}
		]
	}
]

In text form:

# numerical
http://localhost:51234/format('text');q(crimes.b('type',dive(1)),'name')

results in

                                              type                           count
                                         NARCOTICS                     5742.000000
                                   CRIMINAL DAMAGE                     4660.000000
                                             THEFT                    11367.000000
                                           BATTERY                     8990.000000
                                          BURGLARY                     2933.000000
                               MOTOR VEHICLE THEFT                     2226.000000
                        OFFENSE INVOLVING CHILDREN                      456.000000
                                DECEPTIVE PRACTICE                     2190.000000
                                           ROBBERY                     2132.000000
                                           ASSAULT                     2629.000000
                                     OTHER OFFENSE                     3278.000000
                                 CRIMINAL TRESPASS                     1429.000000
                                 WEAPONS VIOLATION                      489.000000
                                        KIDNAPPING                       46.000000
                            PUBLIC PEACE VIOLATION                      441.000000
                  INTERFERENCE WITH PUBLIC OFFICER                      229.000000
                               CRIM SEXUAL ASSAULT                      181.000000
                              LIQUOR LAW VIOLATION                       69.000000
                                          HOMICIDE                       69.000000
                                          STALKING                       20.000000
                                       SEX OFFENSE                      119.000000
                                      PROSTITUTION                      211.000000
                                             ARSON                       63.000000
                                      INTIMIDATION                       21.000000
                                      NON-CRIMINAL                        1.000000
                                  PUBLIC INDECENCY                        2.000000
                                          GAMBLING                        2.000000
                                         OBSCENITY                        1.000000
                          OTHER NARCOTIC VIOLATION                        2.000000
                 CONCEALED CARRY LICENSE VIOLATION                        1.000000
                  NON-CRIMINAL (SUBJECT SPECIFIED)                        1.000000

If we want to simply get the number of crimes of a particular type, let's say THEFT we can can request it either by the theft corresponding number (a path of length 1 described in the schema or by name

#
# by number, since by the schema query above THEFT is an alias for the number 2
# (note that there can be aliases to any path in a hierarchy, in case of
# categorical dimension which are trees of height 2, a path is simply a number).
#
#        ...
#        "23":"INTIMIDATION",
#        "2":"THEFT",
#        "4":"BURGLARY",
#        ...
#
http://localhost:51234/q(crimes.b('type',p(2)))
#
# by name, when we find a string in the TARGET parameter of a binding we assume it
# is an alias and we process as if it was the path of that alias as in the query
# above.
#
http://localhost:51234/q(crimes.b('type','THEFT'))

both queries result in

[
	{
		"type":"table",
		"numrows":1,
		"index_columns":[
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					11367.000000
				]
			}
		]
	}
]

In case we want to aggregate multiple types, we use 'pathagg'

#
# Usage pathagg: pathagg(PATH (,PATH)*)
#                pathagg(ALIAS(,ALIAS)*)
#
# by number, let's say we want THEFT and BURGLARY added up in a single value
#
#        ...
#        "23":"INTIMIDATION",
#        "2":"THEFT",
#        "4":"BURGLARY",
#        ...
#
http://localhost:51234/q(crimes.b('type',pathagg(p(2),p(4))))
#
# by name
#
http://localhost:51234/q(crimes.b('type',pathagg('THEFT','BURGLARY')))

both result in adding up what we saw in the drilldown query of crime type:

   THEFT                    11367.000000
BURGLARY                     2933.000000
----------------------------------------
   TOTAL                    14300.000000

Here is the .json result

[
	{
		"type":"table",
		"numrows":1,
		"index_columns":[
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					14300.000000
				]
			}
		]
	}
]

time

Assuming a dimension of a nanocube is a binary tree, we can get aggregate values for a sequence of fixed width (in the finest resolution of the tree) using the 'intseq' target.

#
# Usage intseq: intseq(OFFSET,WIDTH,COUNT[,STRIDE])
#
# In the example of crime50k the 'time' dimension was specified in
# the .map file as
#
# index_dimension('time',                         # dimension name
#                 input('Date'),                  # .csv column wher the input date of the record is specified
#                 time(16,                        # binary tree with 16 levels (slots 0 to 65535)
#                    '2013-12-01T00:00:00-06:00', # offset 0 corresponds to 0 hour of Dec 1, 2013 at CST (Central Standard Time)
#                    3600,                        # finest time bin corresponds to 1 hour (or 3600 secs)
#                    6*60                         # add 360 minutes or 6 hoursto to all input values since input comes as
#                                                 # if it is UTC while it is actually CST Central Standard Time
#                   ));
#
# So the time bin semantic is the following: there are 16 levels in the binary
# tree which yields a correspondence between its leaves and the numbers {0,1,...,65535}.
# By the `index_dimension` spec above, each of these numbers correspond to one hour
# aggregate of date (ie. 3600 secs). Tying back these (fine time resolution)
# numbers to the calendar, we have the following correspondence:
#
#     time interval                                             finest bin
#                                                               or leaf number
#     [2013-12-01T00:00:00-06:00,2013-12-01T00:00:00-06:00)     0
#     [2013-12-01T00:00:01-06:00,2013-12-01T01:00:00-06:00)     1
#     [2013-12-01T00:00:02-06:00,2013-12-01T02:00:00-06:00)     2
#     [2013-12-01T00:00:03-06:00,2013-12-01T03:00:00-06:00)     3
#     ...
#     [2013-12-21T00:00:03-06:00,2013-12-01T00:00:00-06:00)     480 (is 20 days later)
#     ...
#
# if we want to query 10 days of daily aggregates starting at the 20th day from
# the 0 time bin, we would like to aggregate
#
#     [480 + 00 * 24, 480 + 01 * 24)
#     [480 + 01 * 24, 480 + 02 * 24)
#     ...
#     [480 + 09 * 24, 480 + 10 * 24)
#
# we achieve that by setting
#
#     OFFSET = 480
#     WIDTH  = 24
#     COUNT  = 10
#     STRIDE = 24  (we can ommit STRIDE when it is the same as the WIDTH)
#
# so the query we wan here can be one of the two below
#
http://localhost:51234/q(crimes.b('time',intseq(480,24,10,24)))
http://localhost:51234/q(crimes.b('time',intseq(480,24,10)))

The result of the 'intseq' query above is

[
	{
		"type":"table",
		"numrows":10,
		"index_columns":[
			{
				"name":"time",
				"hint":"none",
				"values_per_row":1,
				"values":[
					0,1,2,3,4,5,6,7,8,9
				]
			}
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					762.000000,724.000000,660.000000,515.000000,410.000000,584.000000,713.000000,712.000000,704.000000,617.000000
				]
			}
		]
	}
]

While we can get to query calendar time series using 'intseq', there is a more convenient way to do it using 'timeseries':

#
# Usage timeseries: timeseries(BASE_DATE,WIDTH_IN_SECONDS,COUNT[,STRIDE_IN_SECONDS])
#
http://localhost:51234/q(crimes.b('time',timeseries('2013-12-21T00:00-06',24*3600,10,24*3600)))
http://localhost:51234/q(crimes.b('time',timeseries('2013-12-21T00:00-06',24*3600,10)))

The result is the same as the previous 'intseq' version:

[
	{
		"type":"table",
		"numrows":10,
		"index_columns":[
			{
				"name":"time",
				"hint":"none",
				"values_per_row":1,
				"values":[
					0,1,2,3,4,5,6,7,8,9
				]
			}
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					762.000000,724.000000,660.000000,515.000000,410.000000,584.000000,713.000000,712.000000,704.000000,617.000000
				]
			}
		]
	}
]

Polygons

If we want to restrict spatial queries to shapes defined by lat/lon contours, we can use the regions and poly features enabled when we build nanocubes using --with-polycover. Note that we assume the mercator projection convention in all the geospatial interpretation here.

# query a polygon defined by four lat/lon pairs (latitude first, longitude second)
http://localhost:51234/q(crime.b('location',region(18,poly('41.8595,-87.6565,41.8969,-87.6565,41.8969,-87.6013,41.8595,-87.6013'))))

This yelds the following .json result:

[
	{
		"type":"table",
		"numrows":1,
		"index_columns":[
		],
		"measure_columns":[
			{
				"name":"count",
				"values":[
					3362.000000
				]
			}
		]
	}
]

Note that the region constraint expects a resolution parameter (ie 18) and that we can define a region by combining polys with algebraic operators: poly_complement, poly_diff, poly_union, poly_symdiff, poly_intersection.

# Region
# Region is a target based on lat/lon polygons. We define a region
# by feeding RESOLUTION and POLY to the 'region' function.
    region(RESOLUTION, POLY)
# A POLY primitive is defined by the poly function with a sequence
# of lat/lon pairs that are comma separated in a string:
    poly(LAT_LON_STRING)
# We can combine POLY objects using
    poly_complement(POLY)
    poly_diff(POLY, POLY)
    poly_union(POLY [, POLY]*)
    poly_symdiff(POLY [, POLY]*)
    poly_intersection(POLY [, POLY]*)

polygon