From 065dd49e21cdc81f535d50bb57445b118cb3abfb Mon Sep 17 00:00:00 2001 From: bmcabrera <124897769+bmcabrera@users.noreply.github.com> Date: Thu, 28 Sep 2023 11:04:22 -0400 Subject: [PATCH] first updates (#25) --- learnEcl/1400-filter.md | 182 +++++++++++++++++++ learnEcl/2200-sample.md | 66 +++++++ learnEcl/2500-transform.md | 193 +++++++++++++++++++++ learnEcl/2600-project.md | 96 ++++++++++ learnEcl/2700-function.md | 144 +++++++++++++++ learnEcl/2800-module.md | 117 +++++++++++++ learnEcl/3000-join.md | 346 +++++++++++++++++++++++++++++++++++++ learnEcl/3100-table.md | 233 +++++++++++++++++++++++++ learnEcl/3200-normalize.md | 211 ++++++++++++++++++++++ 9 files changed, 1588 insertions(+) diff --git a/learnEcl/1400-filter.md b/learnEcl/1400-filter.md index 0168777..5128918 100644 --- a/learnEcl/1400-filter.md +++ b/learnEcl/1400-filter.md @@ -2,3 +2,185 @@ title: Filter slug: filter --- + +# FILTER + +Data filtering is the process of choosing a smaller part of your data set and using that subset for further processing. It’s recommended to filter down to the desire dataset before any processing. When using filter on STRING values keep in mind that STRING values are case sensitive. For example Sun, sun, SUN are not the same. + +## SQL vs. ECL + +Filter is similar to SELECT. In ECL the filtering fields are in ( ). +
+
+
+
+ +## Syntax + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| attr_name | The name by which the function will be invoked. | +| dataset_name | The dataset to perform action on. | +| Filtering condition(s) | Field or fields and required filtering conditions. Logical operators can be used to execute multiple filters. | + +### Demo Dataset + +| PersonID | FirstName | LastName | isEmployed | avgHouseIncome | +| :- | :- | :- | :- | :- | +| 102 | Fred | Smith | FALSE | 0 | +| 012 | Joe | Blow | TRUE | 11250 | +| 085 | Blue | Moon | TRUE | 185000 | +| 055 | Silver | Jo | FALSE | 5000 | +| 265 | Darling | Jo | TRUE | 5000 | +| 333 | Jane | Smith | FALSE | 50000 | + +**Example** + +
+
+
+ +## Logical Operators + +| Operator | Description | +| :- | :- | +| = | Equal | +| > | Greater than | +| < | Something | +| >= | Greater than or equal | +| <= | Less than or equal | +| <> | Not equal | +| != | Not equal | +| AND | Logical AND | +| OR | Logical OR | +| IN | To specify multiple possible values for a field/column. | +| NOT IN | To specify multiple possible values that are not in a field/column. | +| BETWEEN | Between a certain range. | + +**Example** + +
+
+
+ +
+
+
\ No newline at end of file diff --git a/learnEcl/2200-sample.md b/learnEcl/2200-sample.md index d86ba47..5404f19 100644 --- a/learnEcl/2200-sample.md +++ b/learnEcl/2200-sample.md @@ -2,3 +2,69 @@ title: Sample slug: sample --- + +# SAMPLE + +SAMPLE function returns a sample set of dataset. Returned value is a dataset. + +## Syntax + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| dataset | Input dataset to process. | +| interval | The intervals between records to return. | +| which | Optional. An integer specifying the ordinal number of the sample set to return. This is used to obtain multiple non-overlapping samples from the same recordset. | + +### Demo Dataset +| Color | ID | +| :- | :- | +| Red | 100 | +| Blue | 102 | +| Black | 103 | +| Yellow | 104 | +| Orange | 105 | +| White | 106 | +| Green | 107 | +| Purple | 108 | + +**Example** + +
+
+
\ No newline at end of file diff --git a/learnEcl/2500-transform.md b/learnEcl/2500-transform.md index 1bcf25a..b2233e9 100644 --- a/learnEcl/2500-transform.md +++ b/learnEcl/2500-transform.md @@ -2,3 +2,196 @@ title: Transform slug: transform --- + +# TRANSFORM + +TRANSFORM function, defines specific operations that will be performed on every field in result dataset. TRANSFORM functions starts from row one and covers the entire dataset row by row. When defining a transform you need to tell the function what it needs to be done on each field in the result dataset by using input datasets fields or creating new definitions for the fields. +Transform can be used with PROJECT, JOIN, ITERATE, ROLLUP and more. + +## Syntax + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| EXPORT | Optional. Used in MODULEs | +| return_dataset_layout | Record-definition/layout of result dataset. | +| transform_name | The name by which the transform will be invoked. | +| input_arguments_types | The argument’s data type. If passing a dataset, the data type is DATASET(record_definition). | +| arg_name | Used to reference your argument in the transform. | +| TRANSFORM | Required. | +| SELF | Reference the field in the return_data_type. | +| return_field_name | Refers to the field in result dataset. | +| input_Dataset_fieldname | Refers to the field in the input dataset. | +| SELF := [ ] | Assign default value for every field in result dataset that doesn't have a defined operation or doesn't exists in the input dataset. For example if there is a INTEGER field in the result dataset, that transform didn't assign a definition to it, the field will receive a 0 which is the default value for INTEGER. | +| END | Required. | + +## Transform Type One (Standalone TRANSFORM) + +If you need the transform to be used in multiple places, or it contains many fields or child datasets, you may want to define a standalone transform (a function that can be called multiple times) + +### Demo Dataset + +| FirstName | LastName | +| :- | :- | +| Sun | Shine | +| Blue | Moon | +| Silver | Rose | + +**Example** + +
+
+
+ +## Transform Type Two (Explicit TRANSFORM) + +Often times TRANSFORM is small enough to used within PROJECT, JOIN, ROLLUP and other functions. Let's take a look at how it can be used within PROJECT. Similar principal is applied in using transform with other functions mentioned above, at transform definition. + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| EXPORT | Optional, used for constants, or in modules. | +| project_name | The name by which the project will be invoked. | +| PROJECT | Required. JOIN, ROLLUP and other functions can be used here. | +| input_dataset | Input dataset itself and not the record definition. | +| TRANSFORM | Required. | +| return_dataset_layout | Record-definition/layout of result dataset. | +| SELF | Reference the field in the return_data_type. | +| return_field_name | Refers to the field in result dataset. | +| input_Dataset_fieldname | Refers to the field in the input dataset. | +| SELF := LEFT | Get the original values from input (left) dataset for all fields that don't have an operation defined. | +| SELF := RIGHT | Used where there are two input dataset like JOIN. Get the original values from input (right) dataset for all fields that don't have an operation defined. | +| SELF := [ ] | Assign default value for every field in result dataset that doesn't have a defined operation or doesn't exists in the input dataset. For example if there is a INTEGER field in the result dataset, that transform didn't assign a definition to it, the field will receive a 0 which is the default value for INTEGER. | + +**Example** + +
+
+
\ No newline at end of file diff --git a/learnEcl/2600-project.md b/learnEcl/2600-project.md index 0b1bdd7..e7e6405 100644 --- a/learnEcl/2600-project.md +++ b/learnEcl/2600-project.md @@ -2,3 +2,99 @@ title: Project slug: project --- + +# PROJECT + +The PROJECT function processes through all records in the record-set performing the TRANSFORM function on each record in turn. PROJECT result always have the same number of rows as input dataset. + +PROJECT is like SQL's SELECT … INTO TABLE … + +## Syntax + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| EXPORT | Optional, used within MODULEs. | +| project_name | The name by which the project will be invoked. | +| PROJECT | Required. | + +Please refer to TRANSFORM for TRANSFORM syntax. + +### Demo Dataset + +| StudentID | Name | ZipCode | Age | Major | isGraduated | +| :- | :- | :- | :- | :- | :- | +| 100 | Zorro | 30330 | 26 | History | TRUE | +| 409 | Dan | 40001 | 26 | Nursing | FALSE | +| 300 | Sarah | 30000 | 25 | Art | FALSE | +| 800 | Sandy | 30339 | 20 | Math | TRUE | +| 202 | Alan | 40001 | 33 | Math | TRUE | +| 604 | Danny | 40001 | 18 | N/A | FALSE | +| 305 | Liz | 30330 | 22 | Chem | TRUE | +| 400 | Matt | 30005 | 22 | Nursing | TRUE | + +**Example** + +
+
+
\ No newline at end of file diff --git a/learnEcl/2700-function.md b/learnEcl/2700-function.md index 6539680..27701eb 100644 --- a/learnEcl/2700-function.md +++ b/learnEcl/2700-function.md @@ -2,3 +2,147 @@ title: Function slug: function --- + +# FUNCTION + +A Function is a set of statements that take inputs, does some specific computation and produces an output or return a result. Result could be a value or a dataset. + +Notes + +* For function to be called/used from outside, EXPORT is required +* Function name should match the file name. if not "Error: Definition must contain EXPORT or SHARED " is generated + +## Syntax + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| EXPORT | Optional | +| return_data_type | Optional (compiler can infer it from return_value). If returning a dataset, the data type is DATASET(record_definition) | +| function_name | The name by which the function will be invoked | +| data_type | The argument’s data type. If passing a dataset, the data type is DATASET(record_definition) | +| ecl_code | Whatever code is needed to build return_value. Conversely, if the code does not contribute to return_value then it is ignored. Attributes defined here are scoped to the function | +| RETURN | Required | +| return_value | The result of the function | +| END | Required | + +**Example** + +
+
+
+ +## Outputs in Function - Using WHEN + +OUTPUT can be used to return multiple results from a function. PARALLEL and WHEN are the keywords used to generate multiple results. + +PARALLEL let's you run actions in parallel and WHEN behaves as a trigger. WHEN is used in scheduling. + +**Example** + +
+
+
+ +## One Line Function +If you don't have any ecl_code for the function, it is a one-liner. FUNCTION, RETURN, and END keywords are omitted. + +**Example** + +
+
+
\ No newline at end of file diff --git a/learnEcl/2800-module.md b/learnEcl/2800-module.md index cd921e1..f18e89c 100644 --- a/learnEcl/2800-module.md +++ b/learnEcl/2800-module.md @@ -2,3 +2,120 @@ title: Module slug: module --- + +# MODULE + +MODULE is s a container that allows you to group related definitions and functionalities. The parameters passed to the module are shared by all the related members definitions. + +## Notes +* OUTPUT can not be used within a module +* For modules to be called/used from outside, EXPORT is required +* Module name should match the file name. if not "Error: Definition must contain EXPORT or SHARED " is generated +* To call a module: ModuleName.attributeName; + +## Variable Scope + +* LOCAL Definitions are visible only up to an EXPORT or SHARED. + +* SHARED Definitions are visible through module. + +* EXPORT Definitions are visible within and outside of a module. + +* Modules can contain multiple, SHARED and EXPORT values. + +## Syntax + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| EXPORT | Optional, indicates that this module is available outside of this file | +| module_name | The name of the function. | +| param_data_type | Data type of each parameter (string, integer, Boolean, …). | +| MODULE | Required. | +| SHARED | The attribute or function can be accessed within the module. | +| EXPORT | The attribute or function can be accessed from outside of the module. | +| END | Indicates the end of module. | + +**Example** + +
+
+
+ +**Example** + +
+
+
\ No newline at end of file diff --git a/learnEcl/3000-join.md b/learnEcl/3000-join.md index 5f3a37c..10dbf1a 100644 --- a/learnEcl/3000-join.md +++ b/learnEcl/3000-join.md @@ -2,3 +2,349 @@ title: Join slug: join --- + +# JOIN + +JOIN is used to combine data or rows from two or more tables/datasets based on at least one boolean expression test. + +Note: When join condition is on STRING case sensitivity matters. For example Sun, SUN, and sun aren't the same. Make sure your STRING values are the same format to capture the matching dataset correctly. + +## Join Types + +Inner + +Returns all rows from both the dataset if the condition satisfies. This join will create the result-set by combining all rows from both the datasets where the condition satisfies. This is the default JOIN, so it doesn't need to be declared. +LEFT ONLY + +Returns all the rows of the dataset on the left side of the join that didn’t match any rows from right dataset. +LEFT OUTER + +Returns all the rows of the dataset on the left side of the join and matching rows for the dataset on the right side of join. +RIGHT ONLY + +Returns all the rows of the dataset on the right side of the join that didn’t match any rows from left dataset. +RIGHT OUTER + +Returns all the rows of the dataset on the right side of the join and matching rows for the dataset on the left side of join. +FULL ONLY + +Returns all rows from both left and right datasets that don't have a match in opposite dataset. +FULL OUTER + +Returns all rows from both the datasets regardless of join condition. This join will create the result-set by combining all rows from both the datasets, including matched and non matched rows. For non matched rows, the fields from opposite dataset will remain null. + +## Syntax + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| attribName | The name by which the function will be invoked. | +| LEFT_DatasetName | Left dataset of the join. LEFT is the first dataset passed to JOIN. | +| RIGHT_DatasetName | Right dataset of the join. RIGHT is the second dataset passed to JOIN. | +| LEFT.fieldName = RIGHT.fieldName | Join matching condition, it can use equal (=) or not-equal (!=). Join can take place on multiple conditions can exists using AND/OR. | +| Transform/xFormName | Explicit or stand-alone TRANSFORM. Keep in mind that you are passing two arguments to JOIN (left dataset and right dataset). | +| JoinType | Default is Inner join. | +| Flags | Optional. | + +### Optional Flags + +LOOKUP + +The right dataset is relatively small and there should be only one match for any LEFT record. +ALL + +The right dataset is relatively small and can be copied to every node in its entirety. +* Can have multiple matches (unlike LOOKUP) +* Supports join conditions that contain no equalities +* Required if there are no equality tests in the condition +FEW + +Few Specifies the LOOKUP right dataset has few records, so little memory is used. +NOSORT + +NOSORT Performs the JOIN without dynamically sorting the tables. This implies that the left and/or right record-set must have been previously sorted and partitioned based on the fields specified in the join Condition. +KEYED + +KEYED Specifies using indexed access into the right record-set. +LOCAL + +LOCAL JOIN performed on each supercomputer node independently, and maintains the pervious distribution of data. +KEEP(n) + +KEEP(n) Specifies the maximum number of matching records (n) to generate into the result set. If omitted, all matches are kept. +LIMIT + +LIMIT Specifies a maximum number of matching records which, if exceeded, either fails the job, or eliminates all those matches from the result set. + +### Demo Dataset + +Left Dataset(StudentDS) + +| StudentID | Name | ZipCode | Age | Major | isGraduated | +| :- | :- | :- | :- | :- | :- | +| 100 | Zoro | 30330 | 26 | History | TRUE | +| 409 | Dan | 40001 | 26 | Nursing | FALSE | +| 300 | Sarah | 30000 | 25 | Art | FALSE | +| 800 | Sandy | 30339 | 20 | Math | TRUE | +| 202 | Alan | 40001 | 33 | Math | TRUE | +| 604 | Danny | 40001 | 18 | N/A | FALSE | +| 305 | Liz | 30330 | 22 | Chem | TRUE | +| 400 | Matt | 30005 | 22 | Nursing | TRUE | + +Right Dataset(MajorDS) + +|MajorID|MajorName|NumOfYears|Department| +|:-|:-|:-|:-| +M101 | Dentist | 5 | medical +M102 | Nursing | 4 | Medical +M201 | Surgeon | 12 | Medical +S101 | Math | 4 | Science +S333 | Computer | 4 | Science +A101 | Art | 3 | Art +A102 | Digital Art | 3 | Art + +**Example** + +
+
+
+ +### Demo Dataset + +Left Dataset +ColorID|Color|isDark +|---|---|--- +1 | Blue | 1 +2 | Red | 0 +3 | Black | 1 +4 | Green | 1 +5 | Olive | 0 +11 | Maroon | 1 + +Right Dataset: + +ID|Hue|Code| +---|---|--- +2 | Red | #FF0000 +3 | Black | #000000 +4 | Green | #008000 +8 | Green | #FFC0CB +10 | Red | #000000 +12 | Lime |#00FF00 + +**Example** + +
+
+
\ No newline at end of file diff --git a/learnEcl/3100-table.md b/learnEcl/3100-table.md index 64faf5b..e8cc1d9 100644 --- a/learnEcl/3100-table.md +++ b/learnEcl/3100-table.md @@ -2,3 +2,236 @@ title: Table slug: table --- + +# TABLE + +TABLE is the most commonly-used data aggregation functions in ECL. It creates a new dataset in memory while workunit is running. The new table inherits the implicit rationality the recordset has (if any), unless the optional expression is used to perform aggregation. There are two types of Table: + +Vertical Number of records in the input dataset is equal to generated table, which means no aggregation is involved. + +CrossTab There is at least one field using an aggregate function with the keyword Grouping Condition as its first parameter. The number of records produced is equal to the number of distinct values of the expression. + +## Syntax + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| out_record_def | Record definition that will contain both the grouping condition results and any new attributes computed as part of the aggregation. | +| dataset.field | Field(s) from input dataset. | +| field_name | Newly defined fields. | +| attr_name | The name by which the table will be invoked. | +| TABLE | Required. | +| dataset | Input dataset to create the table from. | +| grouping_condition | One or more comma-delimited expressions. Please see Grouping Condition for more information. | +| flags | Optional flags that can alter the behavior of TABLE. | + +
+
+
+
+ +| _Value_ | _Definition_ | +| :- | :- | +| attr_name | The name by which the table will be invoked. | +| TABLE | Required. | +| field | Field(s) from input dataset. | +| field_name | Newly defined field. | +| grouping_condition | One or more comma-delimited expressions. Please see Group for more information. | +| flags | Optional flags that can alter the behavior of TABLE. | + +### Grouping Condition + +* One or more comma-delimited expressions +* An expression could simply be an attribute name within the dataset; this is the most common usage +* An expression could be a computed value, such as (myValue % 2) to group on even/odd values +* All records within dataset that evaluate to the same set of condition values will be grouped together +* Each group will result in one output record +* Functions evaluated within outrecorddef will operate on the group + +### Optional Flags + +Flags can alter the behavior of TABLE. Commonly used flags are MERGE and LOCAL + +| _Flag_ | _Definition_ | +| :- | :- | +| FEW | Indicates that the expression will result in fewer than 10,000 distinct groups. This allows optimization to produce a significantly faster result. | +| MANY | Indicates that the expression will result in many distinct groups. | +| UNSORTED | Specifies that you don't care about the order of the groups. This allows optimization to produce a significantly faster result. | +| LOCAL | Specifies the operation is performed on each node independently; the operation maintains the distribution of any previous DISTRIBUT. | +| KEYED | Specifies the activity is part of an index read operation, which allows the optimizer to generate optimal code for the operation. | +| MERGE | Specifies that results are aggregated on each node and then the aggregated intermediaries are aggregated globally. This is a safe method of aggregation that shines particularly well if the underlying data was skewed. | +| SKEW | Indicates that you know the data will not be spread evenly across nodes. | + +## GROUP + +The GROUP keyword is used within output format parameter (RECORD Structure) of a TABLE definition. GROUP replaces the recordset parameter of any aggregate built-in function used in the output to indicate the operation is performed for each group of the expression. This is similar to an SQL "GROUP BY" clause. + + +### Demo Dataset + +| Pickup_Date | Fare | Distance | +| :- | :- | :- | +| 1/1/2021 | 25.1 | 15.5 | +| 1/2/2021 | 40.15 | 7.2 | +| 1/3/2021 | 25.36 | 6.5 | +| 1/2/2021 | 120 | 23 | +| 1/3/2021 | 30 | 60.75 | +| 2/2/2021 | 25 | 71 | +| 1/2/2021 | 10 | 2.2 | +| 3/10/2021 | 45 | 12.23 | + + +**Example** + +
+
+
+ +### Demo Dataset + +| PersonID | FirstName | LastName | isEmployed | avgIncome | EmpGroupNum | +| :- | :- | :- | :- | :- | :- | +| 1102 | Fred | Smith | FALSE | 1000 | 900 | +| 3102 | Fact | Smith | TRUE | 200000 | 100 | +| 1012 | Joe | Blow | TRUE | 11250 | 200 | +| 2085 | Blue | Moon | TRUE | 185000 | 500 | +| 3055 | Silver | Jo | FALSE | 5000 | 900 | +| 1265 | Darling | Jo | TRUE | 5000 | 100 | +| 1265 | Darling | Alex | TRUE | 5000 | 100 | +| 5265 | Blue | Silver | TRUE | 75000 | 200 | +| 7333 | Jane | Smith | FALSE | 50000 | 900 | +| 6033 | Alex | Silver | TRUE | 102000 | 200 | +| 1024 | Nancy | Moon | TRUE | 201100 | 700 | + +**Example** + +
+
+
\ No newline at end of file diff --git a/learnEcl/3200-normalize.md b/learnEcl/3200-normalize.md index 82c7d1c..29fc586 100644 --- a/learnEcl/3200-normalize.md +++ b/learnEcl/3200-normalize.md @@ -2,3 +2,214 @@ title: Normalize slug: normalize --- + +# NORMALIZE + +NORMALIZE gets a parent-child (DENORMALIZED) dataset and extract the child dataset from it. The purpose is to take variable-length flat-file records and split out the child information. + +There are two ways to normalize a child dataset: + +* All Records: Processing all child records +* With Counter: Using counter for a certain number of children + +## Normalize All Records + +This form processes through all records in the recordset executing transform function through all the child dataset records in each record. This method is used when we have embedded child dataset. + +You can think of this as a specialized JOIN where the TRANSFORM is called with, LEFT as the “main” record being processed and RIGHT as one of the records from the child dataset. + +In this form TRANSFORM is called for each parent record with child record pair. + +## Parameters +* Must have a RIGHT record of the same format as the child dataset. +* The resulting record set format does not need to be the same as the input. +* Child layout is being called as an embedded dataset. + +**Example** + +
+
+
+ +## All Records Syntax + +
+
+
+
+ +## Normalize With COUNTER + +This NORMALIZE form calls TRANSFORM times for each parent record. does not need to be the same value for every record. The TRANSFORM function must take at least a LEFT record of the same format as the input recordset. The resulting record set format does not need to be the same as the input. + +**Example** + +
+
+
+ +## With COUNTER Syntax + +
+
+
+
+ +## Flags + +| Options | Description | +| :- | :- | +| UNORDERED | Specifies the output record order is not significant. | +| ORDERED | Specifies the significance of the output record order. | +| STABLE | Specifies the input record order is significant. | +| PARALLEL | Try to evaluate this activity in parallel. | \ No newline at end of file