The goal is to make a dataframe/data analysis library that has seamless integration with a library like hmatrix, so it could be used for machine learning purposes. As of now, this is primarily a learning project and should not be used for production.
Say we have a CSV file like this (data/example.csv):
name,salary,age
John,123123,56
Mary,56000,32
Erich,3200,29
Philipp,,27
>>> df <- readCsv "data/example.csv"
>>> df
name salary age
John 123123 56
Mary 56000 32
Erich 3200 29
Philipp 27
-- NOTE: The type for the `col` function must ALWAYS be explicitly specified.
>>> col "name" df :: Series String
["John", "Mary", "Erich", "Philipp"]
>>> drop ["name", "salary"] df
age
56
32
29
27
Note that the column operations only work when there are no missing values. hdax
will throw an error if you try to run any of these on a column with missing values.
-- Note `Double` infers the type of the `mean` function, and not the type of `col "age" df`.
>>> mean $ col "age" $ rows [0..2] df :: Double
39.0
>>> median $ col "salary" $ rows [0..2] df :: Double
56000.0
>>> df !> 0 -- or `row 0 df`
Record { salary: 123123, name: John, age: 56 }
You can remove every element with missing values with the dropna
function:
>>> dropna "salary" df
name salary age
John 123123 56
Mary 56000 32
Erich 3200 29
You can also fill every missing value in a column with some predetermined value:
>>> m = mean $ col "salary" df :: Double
>>> fillna "salary" m df
name salary age
John 123123 56
Mary 56000 32
Erich 3200 29
Philipp 45580.75 27
>>> bin "age" [(>30), (<30)] ["over30", "under30"] df
name salary age over30 under30
John 123123 56 1.0 0.0
Mary 56000 32 1.0 0.0
Erich 3200 29 0.0 1.0
Philipp 27 0.0 1.0
>>> encode "name" df
name salary age name_John name_Mary name_Erich name_Philipp
John 123123 56 1.0 0.0 0.0 0.0
Mary 56000 32 0.0 1.0 0.0 0.0
Erich 3200 29 0.0 0.0 1.0 0.0
Philipp 27 0.0 0.0 0.0 1.0
-- This matrix can be fed directly to a machine learning model, for example.
>>> toHMatrix $ cols ["salary", "age"] df
(3><2)
[ 123123.0, 56.0
, 56000.0, 32.0
, 3200.0, 29.0 ]