Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend GraphQL request schema to pull statistics for PRs #22

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

robertbindar
Copy link

Hi!
As part of our efforts to analyze how well we're doing in handling community contributions, MariaDB Foundation ended up using this awesome tool for pulling GitHub activity from MariaDB/server repo. In this process we realized it's useful to have some more statistics for each PR (lines added, lines deleted, number of changed files).
Here's is what we changed to make this work. If you think this may break existing apps/script using github-activity, I can include this functionality under a "--include-pr-stats" option.

Many thanks for this work and please let me know if there's anything I can help with.

@choldgraf
Copy link
Member

@robertbindar awesome to hear that this tool is useful for your community!

Something I have wrestled with is: how can we make the GraphQL queries extensible in a way that doesn't require a PR each time we want extra information. I wonder if there could be a section of the PR and ISSUES queries that is configurable at the command-line level. Basically it would insert a few query parameters at request time. Any thoughts on that?

(and either way, this PR looks good to me...I think we can leave the extensible queries bit for another issue...my only concern is that at some point users will start hitting request limits if we ask for everything, I see you've already run into this somewhat in #23 )

@choldgraf
Copy link
Member

choldgraf commented Feb 20, 2020

btw - as you're doing community activity analysis, perhaps you'd be interested to see the original repository that motivated the creation of this tool for me :-)

https://github.com/choldgraf/jupyter-activity-snapshot

e.g.:

https://nbviewer.jupyter.org/github/choldgraf/jupyter-activity-snapshot/blob/master/reports/summaries/time.ipynb

It's another open source community that I've been a part of, and I've found this tool helpful in getting high-level views of what's going on.

I'm also still trying to figure out the right way to download and cache the data that is downloaded (#18)...if you have any ideas on that, I'd love to discuss or take a look at any PRs

@robertbindar
Copy link
Author

@choldgraf many thanks for the quick answer! :-)
On Jupyter activity analysis,
Very cool work! Hope you won't mind if I might try to find inspiration in the metrics and charts you derived there :D
I'm currently trying to convince our organization to use notebooks and nbviewer as well since it is infinitely easier to reproduce the statistics each year (so far there were scripts all around the org for pulling data and plotting charts, so it was quite hard to re-do the statistics). Plus, having them online for the whole community to see is more transparent and I like it.
Here's the prototype I have so far: https://nbviewer.jupyter.org/github/robertbindar/mariadb.org-tools/blob/master/reporting/MariaDB_Statistics_2019.ipynb
Repo here: https://github.com/robertbindar/mariadb.org-tools/tree/master/reporting
A particular metric I'm proud of is something I called Contributor Frustration Metric (is the total time a contributor waited for us to merge PRs since we are on github), I think this can help us see what are the pain points for the upcoming year that we should address with the highest urgency.

On caching downloaded data (#18),
For my usecases this would be useful because it would shorten the time I wait for results quite a lot (I tend to make lots of little mistakes in the process and I run the github-activity tool many times before I have the desired dataset, also I query all-time data 2014-2019 as well).
If I'm understanding the problem correctly, my guess is that the request to GraphQL should be made more granular (i.e. split into multiple tiny requests that have some sort of logic based on common usecases) so that github-activity can construct a dataset from data coming from both client-side cache and network data from GraphQL. I'll try to think more about this, but I'd say the most important step now is to see how people use this tool so you can come up with a design for the cache that will indeed reduce their query time.

On configurable GraphQL requests,
A solution I see now is to make github-activity able to parse some sort of config file (e.g. columns.cfg) which users can fill with GraphQL fields. These fields will be validated against an internal list of fields that we know are valid in GraphQL language, and then they will be used to construct the request to GraphQL.
Another similar one is to allow in command-line something like --include-options "[graphql_field1, grapghql_field2, ...]" and then do something similar to the above. In both cases, these fields/items should be documented somewhere (wiki page, --help option) so user should know what they are able to request.
I'll think more about these.

@choldgraf
Copy link
Member

Hope you won't mind if I might try to find inspiration in the metrics and charts you derived there

Of course not!

On configurable GraphQL requests

yep I was imagining something like that too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants