Skip to content

Implement the rquery piped query algebra in R using data.table. Distributed under choice of GPL-2 or GPL-3 license.

License

Notifications You must be signed in to change notification settings

WinVector/rqdatatable

Repository files navigation

CRAN_Status_Badge status

rqdatatable is an implementation of the rquery piped Codd-style relational algebra hosted on data.table. rquery allow the expression of complex transformations as a series of relational operators and rqdatatable implements the operators using data.table.

A Python version of rquery/rqdatatable is under initial development as data_algebra.

For example scoring a logistic regression model (which requires grouping, ordering, and ranking) is organized as follows. For more on this example please see “Let’s Have Some Sympathy For The Part-time R User”.

library("rqdatatable")
## Loading required package: wrapr

## Loading required package: rquery
# data example
dL <- build_frame(
   "subjectID", "surveyCategory"     , "assessmentTotal" |
   1          , "withdrawal behavior", 5                 |
   1          , "positive re-framing", 2                 |
   2          , "withdrawal behavior", 3                 |
   2          , "positive re-framing", 4                 )
scale <- 0.237

# example rquery pipeline
rquery_pipeline <- local_td(dL) %.>%
  extend_nse(.,
             probability :=
               exp(assessmentTotal * scale))  %.>% 
  normalize_cols(.,
                 "probability",
                 partitionby = 'subjectID') %.>%
  pick_top_k(.,
             k = 1,
             partitionby = 'subjectID',
             orderby = c('probability', 'surveyCategory'),
             reverse = c('probability', 'surveyCategory')) %.>% 
  rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
  select_columns(., c('subjectID', 
                      'diagnosis', 
                      'probability')) %.>%
  orderby(., cols = 'subjectID')

We can show the expanded form of query tree.

cat(format(rquery_pipeline))
mk_td("dL", c(
  "subjectID",
  "surveyCategory",
  "assessmentTotal")) %.>%
 extend(.,
  probability := exp(assessmentTotal * 0.237)) %.>%
 extend(.,
  probability := probability / sum(probability),
  partitionby = c('subjectID'),
  orderby = c(),
  reverse = c()) %.>%
 extend(.,
  row_number := row_number(),
  partitionby = c('subjectID'),
  orderby = c('probability', 'surveyCategory'),
  reverse = c('probability', 'surveyCategory')) %.>%
 select_rows(.,
   row_number <= 1) %.>%
 rename_columns(.,
  c('diagnosis' = 'surveyCategory')) %.>%
 select_columns(., 
    c('subjectID', 'diagnosis', 'probability')) %.>%
 order_rows(.,
  c('subjectID'),
  reverse = c(),
  limit = NULL)

And execute it using data.table.

ex_data_table(rquery_pipeline)
##   subjectID           diagnosis probability
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

One can also apply the pipeline to new tables.

build_frame(
   "subjectID", "surveyCategory"     , "assessmentTotal" |
   7          , "withdrawal behavior", 5                 |
   7          , "positive re-framing", 20                ) %.>%
  rquery_pipeline
##   subjectID           diagnosis probability
## 1         7 positive re-framing   0.9722128

Initial bench-marking of rqdatatable is very favorable (notes here).

To install rqdatatable please use install.packages("rqdatatable").

Some related work includes:

Note rqdatatable has an “immediate mode” which allows direct application of pipelines stages without pre-assembling the pipeline. “Immediate mode” is a convenience for ad-hoc analyses, and has some negative performance impact, so we encourage users to build pipelines for most work. Some notes on the issue can be found here.

rqdatatable implements the rquery grammar in the style of a “Turing or Cook reduction” (implementing the result in terms of multiple oracle calls to the related system).

rqdatatable is intended for “simple column names”, in particular as rqdatatable often uses eval() to work over data.table escape characters such as “\” and “\\” are not reliable in column names. Also rqdatatable does not support tables with no columns.

About

Implement the rquery piped query algebra in R using data.table. Distributed under choice of GPL-2 or GPL-3 license.

Resources

License

Stars

Watchers

Forks

Packages

No packages published