Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge changes from TfWM #58

Open
wants to merge 87 commits into
base: master
Choose a base branch
from

Conversation

oweno-tfwm
Copy link
Contributor

Large number of changes in this fork - likely to result in some instability until they are bedded in.

  1. Fix activity on LT records not being imported
  2. Fix overlay rules not being applied correctly
    Accompanied by a reasonable number of unit tests for a range of overlay situations
  3. Merge was only written for narrow use case of tables & fields expected in this product – so when it was used to merge a BODS extract with the output from UK2GTFS it resulted in tables and fields being thrown away.
    Now at a minimum maintains all input tables and fields, even if it doesn’t do any special processing on them
  4. Nailed down data type definition rather than letting some of them being determined automatically, have done for all fields present in BODS GTFS, but because some of these are not present in CIF data, this generates a number of unit test warnings.
  5. Write code also was for a specific use case of expected tables & fields, so when presented with merged BODS gtfs file threw data away, now writes all presented tables to it to the output.
    Have also changed the default options on this fn to the highest performing combination

This means you can do things like using it to convert CIF to CSV like this…
# Read In each File
mca <- importMCA(
file = "..\RAIL-CIF\RAIL-CIF.gz",
silent = FALSE,
working_timetable = FALSE,
public_only = public_only
)
writeGtfs( mca, filename = "RAIL-CIF-CSV" )

  1. Various existing filters within the code filter out services / calls for non-public use, but the control of these wasn’t exposed – now exposed up the call stack to be in control of caller.
  2. Even after all that filtering some non-public services were leaking into output, (ECS moves) so since GTFS seems to be tolerant of extra fields, have exposed some extra fields, in the shape of routes.train_category, which then allows the consumer of the data to filter some more as they wish.
  3. …and then in gtfs_clean() enhanced the ‘public only’ filtering to only retain train_category that are for public use (i.e. provides mechanism to filter out ECS moves)
  4. Along a similar lines have added a trips.power_type field so it can be analysed which services are dirty diesel and which are lovely clean electric ones
    (I’ve not actually yet tried pumping these into OTP to see if it moans about the extra fields)
  5. Added mode name onto trip long name so it’s “Bus from….” “Train from….” “Ship from…”
  6. The new loadData functionality was getting triggered every time the package was loaded – including in every child thread when started.
    So have put in a fix to stop this getting activated when creating a child process.

 Still seems to be absolutely battering the API to get fresh data every time I build, think the caching may need a bit more work to look at some dates so it only ‘phones home’ at most once a month ?

  1. Reworked the location load code to make it work with csv files, RDA files containing lat,long coordinates, and RDA files containing geometry objects.
  2. Merged in the recent changes in main trunk (was a bit of a painful merge, think I’ve got it all, but may have made some errors)
  3. Since this is a lot of change, some of which is likely to be breaking – version number uprev to 1.0

performance

  1. Changed from data.frame to data.table in main atoc load / processing and merge.
    There is a high chance that this is a breaking change in some code paths
    – while I’ve got good confidence in the main NR code path because that’s the one I’m testing heavily
    – and also confidence in paths I’ve written units tests for (although they are mainly testing the success path and don’t probe edge cases)
    I’m relying on the unit tests for all the other code paths, so it’s quite likely something somewhere is broken by this change
  2. About a dozen rounds of performance tuning – my initial cut after making all these changes didn’t perform well, so have done a lot of profiling and performance tuning on it to minimise time consuming malloc() and maximise vectorisation

So that’s a lot of change in one go, likely to be a bit of instability on code paths with low unit test coverage.

oweno-tfwm and others added 30 commits August 15, 2023 21:40
… columns not being explicitly set - and some fields assumed to be present in route table (route_desc) are not present in BODS data
…quencies' and 'feed_info' when reading / writing BODS data.

Removed assumptions about table names, all additional tables encountered in zip are read and written.

also performance improvement when writing.
…ay anything else that it wasn't expecting. OK for the GTFS tables generated by this tool from CIF files, but caused data loss when using BODS data.

modified to keep all encountered columns and tables - passes through if there isn't any specific code for the table in question.
renamed ZZ to reflect it's a generic obfuscated freight, not a specific operator
…lendar_date exists with no calendar (rare but valid data configuration)
merge up to latest leeds master
1) make load and merge agnostic about expected tables and fields, so that optional tables and fields are at least passed through rather than discarded. (allows GTFS generated by this tool and BODS to be merged)
2) tighten up data type expectations for fields in BODS data
3) fix bugs when calendar dates have no calendar (this is a legitimate but range arrangement of data)
4) performance improvement while formatting files before writing to file system (3 minute improvement when writing 400mb zipped GTFS file (all GB BODS extract)

There is one line in merge that needs resolution - complex line that I couldn't understand so commented it out.

calendar_dates <- calendar_dates[sapply(calendar_dates, function(x){ifelse(is.null(nrow(x)),0,nrow(x))}) > 0]

no idea what it's supposed to do and no explanatory comment
…o the caller being a different shaped structure being passed back
change to use data.table (likely a breaking change)
performance improvements
add setting of route_long_name "Train from..." to change based on mode
…bles, change default options to highest performing write combination
…as also been extensively profiled and reasonable level of optimisation applied
…of date+time

(this means if github can't be reached should only nag once per day to update instead of on every single package load)
2) add timeout to update check on package load and make it short (but make default when explicitly called long)
(atoc data doesn't seem to have many non-public moves in it to start with)
@mem48
Copy link
Contributor

mem48 commented Sep 8, 2023

Wow, thats quite the PR! THis may take some time to check.

@oweno-tfwm
Copy link
Contributor Author

Yeah, probably easiest way is to review the unit tests first, then review the completely new file atoc_overlay.R , then pull down both branches and diff across them rather than the line-by-line view that the online diff tool tends to give, which is a bit myopic for changes on this scale.

CI build seems to be falling over on something related to update_data stuff, then doesn't seem to get any further into the unit tests in the github CI process, they work ok in R Studio.

_Warning in file(con, "w") :
cannot open file '/home/runner/work/_temp/package/UK2GTFS/extdata/date.txt': No such file or directory
Warning in value[3L] :

  • DONE (UK2GTFS)
    2023-09-08 15:14:52.759665 Process id=7106 threw errors during package load while calling update_data() :Error in file(con, "w"): cannot open the connection

Running specific tests for package ‘UK2GTFS’
Running ‘testthat.R’
Error: Error: Failure in /home/runner/work/_temp/package/UK2GTFS/UK2GTFS-tests/testthat.Rout.fail_

…t completely follow the schema.

Also add 'metro' to long name for underground & Tyne&Wear metro services
…ass time

2) when looking at non-public services include passes marked as 'no pickup/dropoff available' from a GTFS perspective
mem48 added a commit that referenced this pull request Sep 11, 2024
Resolving conflicts with @oweno-tfwm 's PR in a new branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants