Clean up old jobs data for dashboard generation #55

raulcd · 2023-01-17T15:50:21Z

The nightlies job dashboard is great!!!

http://crossbow.voltrondata.com/

But after 7 months of jobs information we should add a way of cleaning old data from it. Both to remove some of the csv's generated on the repo: https://github.com/ursacomputing/crossbow/tree/master/csv_reports
And to make the graphs for trends clearer, right now is difficult to understand dates, etcetera.

raulcd · 2023-01-17T15:52:33Z

@assignUser @boshek what do you think is a good amount of time to keep? In my opinion 120 days should be enough to give us the state for a couple of releases, so we can compare what was the job status on the previous release when we are creating a new release. At the moment the first data points are from mid May 2022.

boshek · 2023-01-18T19:16:56Z

Good thought. One idea is to restrict the dates plotted and then display some aggregated to show the long term mean and say the 120 days mean like this:

I also think we could introduce some cheap interactivity with plotly such that we could have some hover capabilities - that is hover over a point and it can give you the exact date, maybe the percent failed and even exactly what failed if we want.

As far as removing the csvs, I am always a bit resistant to removing any data. Is size the issue? Perhaps we could convert to parquet or maybe we could even write them to a bucket somewhere and then use arrow to query that bucket.

assignUser · 2023-01-19T00:00:54Z

I have tested around with s3 before, using 3 csvs a day makes it quite slow due to the number of objects. But the csvs compress very well so using a single parquet file and re-writing it for each push wouldn't be a problem.

+1 for ✨ interactivity :D

assignUser · 2023-01-19T00:01:38Z

(also a good place for some dogfooding of {arrow})

boshek · 2023-01-19T00:27:41Z

single parquet file

And maybe partitioning by month and year would be a good idea too. That would give us some efficiency especially since the OP is able trimmed our look back window. Even that long term mean could be calculated efficiently with a query.

assignUser · 2023-01-19T00:39:18Z

AH yeah, that way we would only have to rewrite the latest partition vs all values. Nice!

This PR is a draft to address #55. To do this I have ported the report to become a quarto doc rather than rmarkdown and then written the viz in javascript for more interactivity. Because the way this is implemented, it needs to be served up via https rather than a local html file. So screenshots it is. Here is the default which sets the x-axes to extend to the last 120 days but we can slide to only look at the last ten days or look at the past 6 months. I have also updated the build table to include passing runs. Because this adds a significant amount of rows to the table, I've implemented some interactivity for the build table. This looks like this:

boshek mentioned this issue Aug 24, 2023

Make trend plot interactive #64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up old jobs data for dashboard generation #55

Clean up old jobs data for dashboard generation #55

raulcd commented Jan 17, 2023

raulcd commented Jan 17, 2023

boshek commented Jan 18, 2023

assignUser commented Jan 19, 2023

assignUser commented Jan 19, 2023

boshek commented Jan 19, 2023

assignUser commented Jan 19, 2023

Clean up old jobs data for dashboard generation #55

Clean up old jobs data for dashboard generation #55

Comments

raulcd commented Jan 17, 2023

raulcd commented Jan 17, 2023

boshek commented Jan 18, 2023

assignUser commented Jan 19, 2023

assignUser commented Jan 19, 2023

boshek commented Jan 19, 2023

assignUser commented Jan 19, 2023