Add type three scd upserts functionality #127

CommanderWahid · 2023-10-15T18:13:43Z

Add a new functionality to perform a type 3 scd upsert on a target delta table:

Add type_3_scd_upsert function.
Add unit test for type_3_scd_upsert function.
Update README to include the new feature.

MrPowers · 2023-10-16T12:47:46Z

README.md

+- `delta_table` (`DeltaTable`): An object representing the delta table to be upserted.
+- `updates_df`  (`DataFrame`):  The data to be used in order to upsert the target delta table.
+- `primary_key`  (`str`):  The primary key (i.e. business key) uniquely identifiy each row in the target delta table.
+- `curr_prev_col_names`  (`dict[str,str]`): A dictionary of column names to store current and previous values.


Can we separate this to two parameters: current_col_name and previous_col_name. I generally prefer exposing public interfaces with full works instead of abbreviations. Is there any advantage to grouping these two arguments in a dictionary?

i think that, it depends on if we want to:

Populate multiple columns at a time when applying the type 3 scd upserts:

Suppose that we have a delta table having this structure:
+----+----+----+-------+--------+------------+-------------+--------------+
|pkey|name|job|prev_job| country|prev_country| continent|prev_continent|
+----+----+---+--------+--------+------------+-------------+--------------+

The columns prev_job, prev_country and prev_continent will store the previous values for the columns job, country and continent after applying the type 3 scd.

Uisng a dictionary as a parameter to store the column names (for current and previous values) is less prone to error than storing them in two separate lists.

Option #1: Using a dictinary --> Triggering the type_3_scd_upsert once
col_names = {"country":"prev_country", "job":"prev_job", "continent":"prev_continent"}
mack.type_3_scd_upsert(delta_table, updates_df, "pkey",col_names)

Option #2: Using two parameters --> Triggering the type_3_scd_upsert multiple times
current_col_names = ["country","job","continent"]
previous_col_names = ["prev_country","prev_job","prev_continent"]
mack.type_3_scd_upsert(delta_table, updates_df, "pkey",current_col_names,previous_col_names)

Here an error on columns order can cause a cartastrophy, like storing the previous values of the column 'job' in the column 'prev_country' (or the opposite):

current_col_names= ["country", "job", "continent"]
previous_col_names= ["prev_job","prev_country","prev_continent"]

Populate one column at a time when applying the type 3 scd upserts

let's use the same delta table structure mentioned above.

In this case we will have something like this:

//type 3 scd upserts for column job
current_col_names = "job"
previous_col_names = "prev_job"
mack.type_3_scd_upsert(delta_table, updates_df, "pkey",current_col_names,previous_col_names)

//type 3 scd upserts for column country
current_col_names = "country"
previous_col_names = "prev_country"
mack.type_3_scd_upsert(delta_table, updates_df, "pkey",current_col_names,previous_col_names)

//type 3 scd upserts for column continent
current_col_names = "continent"
previous_col_names = "prev_continent"
mack.type_3_scd_upsert(delta_table, updates_df, "pkey",current_col_names,previous_col_names)

MrPowers · 2023-10-16T12:48:25Z

README.md

+----+----+----+-------+--------+------------+-------------+--------------+
+|pkey|name|job|prev_job| country|prev_country|    continent|prev_continent|
+----+----+---+--------+--------+------------+-------------+--------------+
+|   1|   A| AA|    null|   Japan|        null|         Asia|          null|


Let's add some examples where the previous_job column is not null.

MrPowers · 2023-10-16T12:50:44Z

README.md

+----+----+---+------------+-------------+
+|   1|  A1| AA|       Japan|         Asia|  // update on name 
+|   2|  B1|BBB|        Peru|South America|  // updates on name,job,country,continent --> storing previous values on prev_job,prev_country,prev_continent 
+|   3|   C| CC|  New Zeland|      Oceania|  // updates on country,continent --> storing previous values on prev_country,prev_continent 


We should demonstrate these two cases:

a row that's in source, but not in target

a row that's identical in source and target

Yes.
Just to be sure, do you mean by source the update dataframe and the target the delta table ?

MrPowers · 2023-10-16T12:51:23Z

mack/__init__.py

@@ -735,3 +736,119 @@ def rename_delta_table(
        delta_table.toDF().write.format("delta").mode("overwrite").saveAsTable(
            new_table_name
        )
+
+def type_3_scd_upsert(


How would the user use this function with an effective_date column?

I recomend to use it as an optional parameter.

If needed, the function have to update the effective_date each time a real update occurs on the current and previous columns.

A structure of a delta table with an effective_date could be something like this:

+----+----+----+-------+------------------+
|pkey|name|job|prev_job|job_effective_date
+----+----+---+--------+------------------+

or much more wider

+----+----+----+-------+------------------+-------+------------+-----------------+---------+-------------------------+
pkey| name| job| prev_job| job_effective_date| country| prev_country| country_effective| continent| prev_continent
+----+----+---+--------+------------------+-------+------------+-----------------+---------+-------------------------+

Now how we will provide the column names representing the effective_date to the function ?

A parameter as a list of list ? --> param = ( (job,prev_job,job_effective_date),(country,prev_country,country_effective), (continent,prev_continent) )

A three parameters: current_col_names, previous_col_names and effective_date_col_names ?

I don't like the second option unless we consider that the function cannot handle multiple scd columns at a time.

MrPowers · 2023-10-16T12:52:18Z

@CommanderWahid - thanks for submitting this PR.

I don't know much about Type 3 SCD, but I am trying to learn here.

I added some comments. Thanks for your patience here. I am going to have to learn as we go and we'll have to collaborate to make sure we get the right abstraction!

add type three scd upserts functionality

32c4a85

MrPowers requested review from danielbeach and robertkossendey October 16, 2023 12:38

MrPowers reviewed Oct 16, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add type three scd upserts functionality #127

Add type three scd upserts functionality #127

CommanderWahid commented Oct 15, 2023

MrPowers Oct 16, 2023

CommanderWahid Oct 17, 2023 •

edited

Loading

MrPowers Oct 16, 2023

CommanderWahid Oct 17, 2023

MrPowers Oct 16, 2023

CommanderWahid Oct 17, 2023

MrPowers Oct 16, 2023

CommanderWahid Oct 17, 2023 •

edited

Loading

MrPowers commented Oct 16, 2023

Add type three scd upserts functionality #127

Are you sure you want to change the base?

Add type three scd upserts functionality #127

Conversation

CommanderWahid commented Oct 15, 2023

MrPowers Oct 16, 2023

Choose a reason for hiding this comment

CommanderWahid Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

MrPowers Oct 16, 2023

Choose a reason for hiding this comment

CommanderWahid Oct 17, 2023

Choose a reason for hiding this comment

MrPowers Oct 16, 2023

Choose a reason for hiding this comment

CommanderWahid Oct 17, 2023

Choose a reason for hiding this comment

MrPowers Oct 16, 2023

Choose a reason for hiding this comment

CommanderWahid Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

MrPowers commented Oct 16, 2023

CommanderWahid Oct 17, 2023 •

edited

Loading

CommanderWahid Oct 17, 2023 •

edited

Loading