Brainstorming functions to make PySpark easier #83

MrPowers · 2023-03-13T20:02:11Z

One of the overall goals of this project is to make PySpark easier. The purpose of this issue is to brainstorm additional functionality that should be added to quinn to make PySpark easier.

Let's look at the Ruby on Rails web framework that has a well known ActiveSupport module to make common web development tasks easy.

When you use Rails and search "rails beginning of week", you're immediately directed to the beginning_of_week function, which is easy.

When you type in "spark beginning of week", you're redirected to this blog post I wrote, which provides a spark-daria Scala function for beginningOfWeek, which is also easy. I can see that users read this blog post every day from the Google Analytics traffic.

When you type in "pyspark beginning of week", you're redirected to this Stackoverflow question, which is scary, complicated, and makes this seem hard. I need to write a blog post that ranks #1 for "pyspark beginning of week" searches that show the quinn.week_start_date() function, so Python users think this is easy. We have a function that's easy, but it's not findable yet.

What are your thoughts for new functions we can add to quinn that will help make PySpark easy?

The text was updated successfully, but these errors were encountered:

robertkossendey · 2023-03-13T20:09:02Z

If I understand you correctly, we are looking for new functions, that do not exist yet, to make PySpark more easy, right?

MrPowers · 2023-03-13T20:11:42Z

@robertkossendey - yep, what new functions should we add to make PySpark easier?

robertkossendey · 2023-03-13T20:14:14Z

One thing that comes to my mind immediately is the calculation of how many business days lie between two dates. I found it always annoying that PySpark does not provide such functions natively.

danielbeach · 2023-03-13T20:31:17Z

@MrPowers One thing I find myself doing constantly, is a regex string between functionality is super annoying, and I have to use it all the time. Like if you just want to find the first occurrence of a string between two other strings

danielbeach · 2023-03-13T20:34:21Z

@MrPowers also in pyspark it can be annoying to do window functions all the time, I can see some of function that reduces the boiler plate required to do a window function.

SemyonSinchenko · 2023-03-14T08:34:29Z

I think we can add some math sugar. For example, recently, I worked on implementing of Differential Privacy and found that there is only Uniform random distribution out of the box, and to implement Laplacian random distribution, you should invest a lot of time. Also offen I use random integers to repartition the data, but there is no functionality to generate random ints out of the box.

Also, I want to see aggregation functions like first_row, last_row, max_row, min_row, and so forth and so on. It is really often task, for example, to get "the most recent by date row per customer or per product".

SemyonSinchenko · 2023-03-14T08:58:44Z

Also I really want to see some functionality for working with S3/HDFS/DBFS/etc using only PySpark but @MrPowers you told me that it is better to do it in eren, not quinn repository.

fpvmorais · 2023-03-14T13:25:13Z

One thing that comes to my mind immediately is the calculation of how many business days lie between two dates. I found it always annoying that PySpark does not provide such functions natively.

Shameless plug :P

https://fpvmorais.com/post/databricks_workdays/

MrPowers · 2023-03-14T16:47:23Z

@fpvmorais - looks like your workdays function would be a good contribution. We'd probably be able to figure out how to do it without a UDF. Want to submit a PR?

fpvmorais · 2023-03-14T17:15:41Z

I would love to submit it :D Do I just PR on this repo?

MrPowers · 2023-03-14T21:47:17Z

@fpvmorais - can you please comment on this issue, so I can assign the work to you? You can submit the PR by forking the repo and then creating the PR on your fork.

YevIgn · 2023-04-20T06:12:39Z

@MrPowers Small proposal - maybe adding UUID5 (not as complete as Python version obviously, but better than nothing) generator? - https://github.com/YevIgn/pyspark-uuid5/blob/2055a4aa8429424ef79c248f78aba2a33e462806/src/research_udf_performance.py#L158 - recently I made an attempt to write one, this version doesn't use UDF, Pandas/PyArrow

What is your opinion?

UPD.: opened PR for the function in question - #96

MrPowers · 2023-09-26T14:56:17Z

@robertkossendey - FYI, we added the function to compute the number of business days that fall between two dates. We should probably extend it so the user can supply a list of holidays for their respective countries if that's something they need.

glallen01 · 2023-12-25T03:40:08Z

@MrPowers A function that might be useful is one that could generate a spark schema, including nested structs, from json-schema/openapi.

SemyonSinchenko · 2023-12-25T05:43:05Z

@MrPowers A function that might be useful is one that could generate a spark schema, including nested structs, from json-schema/openapi.

May you give an example, please? Because there is already a method pyspark.sql.types.StructType.fromJson(json), that create a schema object from JSON.

MrPowers mentioned this issue Mar 14, 2023

Add a function to compute the number of business days between two dates #84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brainstorming functions to make PySpark easier #83

Brainstorming functions to make PySpark easier #83

MrPowers commented Mar 13, 2023 •

edited

Loading

robertkossendey commented Mar 13, 2023

MrPowers commented Mar 13, 2023

robertkossendey commented Mar 13, 2023

danielbeach commented Mar 13, 2023

danielbeach commented Mar 13, 2023 •

edited

Loading

SemyonSinchenko commented Mar 14, 2023

SemyonSinchenko commented Mar 14, 2023

fpvmorais commented Mar 14, 2023

MrPowers commented Mar 14, 2023

fpvmorais commented Mar 14, 2023

MrPowers commented Mar 14, 2023

YevIgn commented Apr 20, 2023 •

edited

Loading

MrPowers commented Sep 26, 2023

glallen01 commented Dec 25, 2023

SemyonSinchenko commented Dec 25, 2023

Brainstorming functions to make PySpark easier #83

Brainstorming functions to make PySpark easier #83

Comments

MrPowers commented Mar 13, 2023 • edited Loading

robertkossendey commented Mar 13, 2023

MrPowers commented Mar 13, 2023

robertkossendey commented Mar 13, 2023

danielbeach commented Mar 13, 2023

danielbeach commented Mar 13, 2023 • edited Loading

SemyonSinchenko commented Mar 14, 2023

SemyonSinchenko commented Mar 14, 2023

fpvmorais commented Mar 14, 2023

MrPowers commented Mar 14, 2023

fpvmorais commented Mar 14, 2023

MrPowers commented Mar 14, 2023

YevIgn commented Apr 20, 2023 • edited Loading

MrPowers commented Sep 26, 2023

glallen01 commented Dec 25, 2023

SemyonSinchenko commented Dec 25, 2023

MrPowers commented Mar 13, 2023 •

edited

Loading

danielbeach commented Mar 13, 2023 •

edited

Loading

YevIgn commented Apr 20, 2023 •

edited

Loading