Possible deduplication solution that doesn't require a primary key #95

MrPowers · 2023-03-09T22:00:43Z

We'll have to translate this to Python:

val duplicates = df
  .select(<pk cols>)
  .withColumn("__file_path", col("_metadata.file_path"))
  .withColumn("__row_index", col("_metadata.row_index"))
  .withColumn(
    "rank", 
    row_number().over(
      Window()
        .partitionBy(<pk cols>)
        .orderBy(<pk cols>)))
  .filter("rank > 1")
  .drop("rank")

And then:

df.alias("old")
  .merge(
    duplicates.alias("new"),
    "old.<pk1> = new.<pk1> AND ... AND old.<pkn> = new.<pkn>" +
      " AND old._metadata.file_path = new.__file_path" +
      " AND old._metadata.row_index = new.__row_index")
  .whenMatchedDelete()
  .execute()

The text was updated successfully, but these errors were encountered:

robertkossendey · 2023-03-09T22:24:37Z

Where is the row_index property documented?

robertkossendey · 2023-04-17T07:36:00Z

Ahh, found it! ;) https://issues.apache.org/jira/browse/SPARK-37980

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible deduplication solution that doesn't require a primary key #95

Possible deduplication solution that doesn't require a primary key #95

MrPowers commented Mar 9, 2023

robertkossendey commented Mar 9, 2023

robertkossendey commented Apr 17, 2023

Possible deduplication solution that doesn't require a primary key #95

Possible deduplication solution that doesn't require a primary key #95

Comments

MrPowers commented Mar 9, 2023

robertkossendey commented Mar 9, 2023

robertkossendey commented Apr 17, 2023