-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brainstorming functions to make PySpark easier #83
Comments
If I understand you correctly, we are looking for new functions, that do not exist yet, to make PySpark more easy, right? |
@robertkossendey - yep, what new functions should we add to make PySpark easier? |
One thing that comes to my mind immediately is the calculation of how many business days lie between two dates. I found it always annoying that PySpark does not provide such functions natively. |
@MrPowers One thing I find myself doing constantly, is a regex string between functionality is super annoying, and I have to use it all the time. Like if you just want to find the first occurrence of a string between two other strings |
@MrPowers also in pyspark it can be annoying to do window functions all the time, I can see some of function that reduces the boiler plate required to do a window function. |
I think we can add some Also, I want to see aggregation functions like |
Also I really want to see some functionality for working with S3/HDFS/DBFS/etc using only PySpark but @MrPowers you told me that it is better to do it in |
Shameless plug :P |
@fpvmorais - looks like your workdays function would be a good contribution. We'd probably be able to figure out how to do it without a UDF. Want to submit a PR? |
I would love to submit it :D Do I just PR on this repo? |
@fpvmorais - can you please comment on this issue, so I can assign the work to you? You can submit the PR by forking the repo and then creating the PR on your fork. |
@MrPowers Small proposal - maybe adding UUID5 (not as complete as Python version obviously, but better than nothing) generator? - https://github.com/YevIgn/pyspark-uuid5/blob/2055a4aa8429424ef79c248f78aba2a33e462806/src/research_udf_performance.py#L158 - recently I made an attempt to write one, this version doesn't use UDF, Pandas/PyArrow What is your opinion? UPD.: opened PR for the function in question - #96 |
@robertkossendey - FYI, we added the function to compute the number of business days that fall between two dates. We should probably extend it so the user can supply a list of holidays for their respective countries if that's something they need. |
@MrPowers A function that might be useful is one that could generate a spark schema, including nested structs, from json-schema/openapi. |
May you give an example, please? Because there is already a method |
One of the overall goals of this project is to make PySpark easier. The purpose of this issue is to brainstorm additional functionality that should be added to quinn to make PySpark easier.
Let's look at the Ruby on Rails web framework that has a well known ActiveSupport module to make common web development tasks easy.
When you use Rails and search "rails beginning of week", you're immediately directed to the beginning_of_week function, which is easy.
When you type in "spark beginning of week", you're redirected to this blog post I wrote, which provides a spark-daria Scala function for
beginningOfWeek
, which is also easy. I can see that users read this blog post every day from the Google Analytics traffic.When you type in "pyspark beginning of week", you're redirected to this Stackoverflow question, which is scary, complicated, and makes this seem hard. I need to write a blog post that ranks #1 for "pyspark beginning of week" searches that show the
quinn.week_start_date()
function, so Python users think this is easy. We have a function that's easy, but it's not findable yet.What are your thoughts for new functions we can add to quinn that will help make PySpark easy?
The text was updated successfully, but these errors were encountered: