Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up old jobs data for dashboard generation #55

Open
raulcd opened this issue Jan 17, 2023 · 6 comments
Open

Clean up old jobs data for dashboard generation #55

raulcd opened this issue Jan 17, 2023 · 6 comments

Comments

@raulcd
Copy link
Collaborator

raulcd commented Jan 17, 2023

The nightlies job dashboard is great!!!

http://crossbow.voltrondata.com/

But after 7 months of jobs information we should add a way of cleaning old data from it. Both to remove some of the csv's generated on the repo: https://github.com/ursacomputing/crossbow/tree/master/csv_reports
And to make the graphs for trends clearer, right now is difficult to understand dates, etcetera.
image

@raulcd
Copy link
Collaborator Author

raulcd commented Jan 17, 2023

@assignUser @boshek what do you think is a good amount of time to keep? In my opinion 120 days should be enough to give us the state for a couple of releases, so we can compare what was the job status on the previous release when we are creating a new release. At the moment the first data points are from mid May 2022.

@boshek
Copy link
Contributor

boshek commented Jan 18, 2023

Good thought. One idea is to restrict the dates plotted and then display some aggregated to show the long term mean and say the 120 days mean like this:

image

I also think we could introduce some cheap interactivity with plotly such that we could have some hover capabilities - that is hover over a point and it can give you the exact date, maybe the percent failed and even exactly what failed if we want.

As far as removing the csvs, I am always a bit resistant to removing any data. Is size the issue? Perhaps we could convert to parquet or maybe we could even write them to a bucket somewhere and then use arrow to query that bucket.

@assignUser
Copy link
Contributor

I have tested around with s3 before, using 3 csvs a day makes it quite slow due to the number of objects. But the csvs compress very well so using a single parquet file and re-writing it for each push wouldn't be a problem.

+1 for ✨ interactivity :D

@assignUser
Copy link
Contributor

(also a good place for some dogfooding of {arrow})

@boshek
Copy link
Contributor

boshek commented Jan 19, 2023

single parquet file

And maybe partitioning by month and year would be a good idea too. That would give us some efficiency especially since the OP is able trimmed our look back window. Even that long term mean could be calculated efficiently with a query.

@assignUser
Copy link
Contributor

AH yeah, that way we would only have to rewrite the latest partition vs all values. Nice!

assignUser pushed a commit that referenced this issue Aug 26, 2023
This PR is a draft to address #55. To do this I have ported the report
to become a quarto doc rather than rmarkdown and then written the viz in
javascript for more interactivity.

Because the way this is implemented, it needs to be served up via https
rather than a local html file. So screenshots it is. Here is the default
which sets the x-axes to extend to the last 120 days but we can slide to only look at the last ten days or look at the past 6 months.

I have also updated the build table to include passing runs. Because
this adds a significant amount of rows to the table, I've implemented
some interactivity for the build table. This looks like this:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants