Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brainstorming functions to make PySpark easier #83

Open
MrPowers opened this issue Mar 13, 2023 · 15 comments
Open

Brainstorming functions to make PySpark easier #83

MrPowers opened this issue Mar 13, 2023 · 15 comments

Comments

@MrPowers
Copy link
Collaborator

MrPowers commented Mar 13, 2023

One of the overall goals of this project is to make PySpark easier. The purpose of this issue is to brainstorm additional functionality that should be added to quinn to make PySpark easier.

Let's look at the Ruby on Rails web framework that has a well known ActiveSupport module to make common web development tasks easy.

When you use Rails and search "rails beginning of week", you're immediately directed to the beginning_of_week function, which is easy.

When you type in "spark beginning of week", you're redirected to this blog post I wrote, which provides a spark-daria Scala function for beginningOfWeek, which is also easy. I can see that users read this blog post every day from the Google Analytics traffic.

When you type in "pyspark beginning of week", you're redirected to this Stackoverflow question, which is scary, complicated, and makes this seem hard. I need to write a blog post that ranks #1 for "pyspark beginning of week" searches that show the quinn.week_start_date() function, so Python users think this is easy. We have a function that's easy, but it's not findable yet.

What are your thoughts for new functions we can add to quinn that will help make PySpark easy?

@robertkossendey
Copy link

If I understand you correctly, we are looking for new functions, that do not exist yet, to make PySpark more easy, right?

@MrPowers
Copy link
Collaborator Author

@robertkossendey - yep, what new functions should we add to make PySpark easier?

@robertkossendey
Copy link

One thing that comes to my mind immediately is the calculation of how many business days lie between two dates. I found it always annoying that PySpark does not provide such functions natively.

@danielbeach
Copy link

@MrPowers One thing I find myself doing constantly, is a regex string between functionality is super annoying, and I have to use it all the time. Like if you just want to find the first occurrence of a string between two other strings

@danielbeach
Copy link

danielbeach commented Mar 13, 2023

@MrPowers also in pyspark it can be annoying to do window functions all the time, I can see some of function that reduces the boiler plate required to do a window function.

@SemyonSinchenko
Copy link
Collaborator

I think we can add some math sugar. For example, recently, I worked on implementing of Differential Privacy and found that there is only Uniform random distribution out of the box, and to implement Laplacian random distribution, you should invest a lot of time. Also offen I use random integers to repartition the data, but there is no functionality to generate random ints out of the box.

Also, I want to see aggregation functions like first_row, last_row, max_row, min_row, and so forth and so on. It is really often task, for example, to get "the most recent by date row per customer or per product".

@SemyonSinchenko
Copy link
Collaborator

Also I really want to see some functionality for working with S3/HDFS/DBFS/etc using only PySpark but @MrPowers you told me that it is better to do it in eren, not quinn repository.

@fpvmorais
Copy link
Contributor

One thing that comes to my mind immediately is the calculation of how many business days lie between two dates. I found it always annoying that PySpark does not provide such functions natively.

Shameless plug :P

https://fpvmorais.com/post/databricks_workdays/

@MrPowers
Copy link
Collaborator Author

@fpvmorais - looks like your workdays function would be a good contribution. We'd probably be able to figure out how to do it without a UDF. Want to submit a PR?

@fpvmorais
Copy link
Contributor

I would love to submit it :D Do I just PR on this repo?

@MrPowers
Copy link
Collaborator Author

@fpvmorais - can you please comment on this issue, so I can assign the work to you? You can submit the PR by forking the repo and then creating the PR on your fork.

@YevIgn
Copy link
Contributor

YevIgn commented Apr 20, 2023

@MrPowers Small proposal - maybe adding UUID5 (not as complete as Python version obviously, but better than nothing) generator? - https://github.com/YevIgn/pyspark-uuid5/blob/2055a4aa8429424ef79c248f78aba2a33e462806/src/research_udf_performance.py#L158 - recently I made an attempt to write one, this version doesn't use UDF, Pandas/PyArrow

What is your opinion?

UPD.: opened PR for the function in question - #96

@MrPowers
Copy link
Collaborator Author

@robertkossendey - FYI, we added the function to compute the number of business days that fall between two dates. We should probably extend it so the user can supply a list of holidays for their respective countries if that's something they need.

@glallen01
Copy link

@MrPowers A function that might be useful is one that could generate a spark schema, including nested structs, from json-schema/openapi.

@SemyonSinchenko
Copy link
Collaborator

@MrPowers A function that might be useful is one that could generate a spark schema, including nested structs, from json-schema/openapi.

May you give an example, please? Because there is already a method pyspark.sql.types.StructType.fromJson(json), that create a schema object from JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants