Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024 HQTA Methodology Revisions (draft) #1252

Draft
wants to merge 45 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
7f87713
add fixed peaks, test out stop-level averaging
edasmalchi Oct 3, 2024
ff8443a
fix example
edasmalchi Oct 3, 2024
7051570
even clearer
edasmalchi Oct 3, 2024
b25c1ab
fix exploding merge per Tiff suggestion
edasmalchi Oct 3, 2024
05c9482
remove debug
edasmalchi Oct 3, 2024
62b1e5d
add distance to sb125 path examples
edasmalchi Oct 4, 2024
621b2cc
clear map
edasmalchi Oct 4, 2024
0f83492
test scenarios, proceed with keeping multi-route aggregation
edasmalchi Oct 8, 2024
c41c4e0
update script and vars with seperate hq corr and major stop frequence…
edasmalchi Oct 8, 2024
5db76c5
switch prep pairwise to major stop precursors, also make threshold ho…
edasmalchi Oct 8, 2024
4f06b62
revise sjoin stops to use fixed peak
edasmalchi Oct 8, 2024
53211e1
debug script
edasmalchi Oct 9, 2024
6da262e
wip
edasmalchi Oct 9, 2024
a1911fe
reran gtfs digest portfolio w sept 2024 data
amandaha8 Oct 2, 2024
b17d47c
timestamp issue when comparing scheduled and rt lags
amandaha8 Oct 1, 2024
f814a3a
transit bunching 2 min approach, began work on agency metrics in pip…
amandaha8 Oct 1, 2024
e9deea6
fixing some weird github thing
amandaha8 Oct 2, 2024
edc9dc3
figuring out why merge_data segment speed portion wont run
amandaha8 Oct 2, 2024
bb6f702
testing my script for 2024 dates
amandaha8 Oct 3, 2024
fe0cd52
added agency metrics to makefile and concat func
amandaha8 Oct 4, 2024
55065f1
do not drop duplicates for feed to organization_name
Oct 4, 2024
f86e741
rerun crosswalk tables for all dates with additional integer coercing
Oct 4, 2024
2a69e4e
(remove): empty script
Oct 4, 2024
7ddeb71
use operator instead of agency for consistency in yml
Oct 4, 2024
721b5ba
switch ref from helpers to publsh_utils and remove it from segment_sp…
Oct 4, 2024
3298609
break out segment speeds time-series into tabular and geometry
Oct 4, 2024
6cc2365
add notebook for feeds to organizations
Oct 4, 2024
d57d448
add new script to Makefile
Oct 4, 2024
1e68877
turn sco list wide to long
KatrinaMKaiser Oct 4, 2024
ac9639c
remove printed contact info
KatrinaMKaiser Oct 4, 2024
0cca9e3
summer work dashboard refactor
shweta487 Oct 6, 2024
b9c1c2f
schedule stop metrics, backfill all dates
Sep 30, 2024
2f95b2f
remove flex and private datasets from published_operators.yml'
Sep 30, 2024
6e6fed3
deprecate old config.yml function
Sep 30, 2024
b47f494
add publish_utils for patching in previous dates and test on stops file
Oct 3, 2024
c0b656e
combine publish_utils and prep_traffic_ops and update data dict
Oct 3, 2024
680c154
(remove): publish_utils, combined into open_data_utils
Oct 3, 2024
e77233a
refactor create routes and add patching
Oct 4, 2024
605e2e5
(remove): open_data script, work it into metadata_update_pro script
Oct 4, 2024
bcd1eb3
add list of route_ids to scheduled stops, refactor geoportal routes l…
Oct 4, 2024
13ca9c0
update metadata with new columns for stops added
Oct 8, 2024
3d3eee0
clean up nb
edasmalchi Oct 9, 2024
4b2201b
change intersection buffer, use new trips_hr cols
edasmalchi Oct 10, 2024
007cf76
allow selecting either hq corridor or ms precursor
edasmalchi Oct 10, 2024
61731a0
run full pipeline, start qa
edasmalchi Oct 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion _shared_utils/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
setup(
name="shared_utils",
packages=find_packages(),
version="2.6",
version="2.7",
description="Shared utility functions for data analyses",
author="Cal-ITP",
license="Apache",
Expand Down
18 changes: 0 additions & 18 deletions _shared_utils/shared_utils/catalog_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
from typing import Literal

import intake
import yaml
from omegaconf import OmegaConf # this is yaml parser

repo_name = "data-analyses/"
Expand All @@ -22,20 +21,3 @@ def get_catalog(catalog_name: Literal["shared_data_catalog", "gtfs_analytics_dat

else:
return intake.open_catalog(catalog_path)


def get_parameters(config_file: str, key: str) -> dict:
"""
Parse the config.yml file to get the parameters needed
for working with route or stop segments.
These parameters will be passed through the scripts when working
with vehicle position data.

Returns a dictionary of parameters.
"""
# https://aaltoscicomp.github.io/python-for-scicomp/scripts/
with open(config_file) as f:
my_dict = yaml.safe_load(f)
params_dict = my_dict[key]

return params_dict
5 changes: 5 additions & 0 deletions _shared_utils/shared_utils/gtfs_analytics_data.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,19 @@ rt_vs_schedule_tables:
sched_route_direction_metrics: "schedule_route_dir/schedule_route_direction_metrics"
vp_trip_metrics: "vp_trip/trip_metrics"
vp_route_direction_metrics: "vp_route_dir/route_direction_metrics"
vp_operator_metrics: "vp_operator/operator_metrics"
sched_stop_metrics: "schedule_stop/schedule_stop_metrics"
#vp_stop_metrics: "vp_stop/vp_stop_metrics" # WIP: transit bunching
schedule_rt_stop_times: "schedule_rt_stop_times"
early_trip_minutes: -5
late_trip_minutes: 5


digest_tables:
dir: ${gcs_paths.RT_SCHED_GCS}
route_schedule_vp: "digest/schedule_vp_metrics"
route_segment_speeds: "digest/segment_speeds"
route_segment_geometry: "digest/segment_speeds_geom"
operator_profiles: "digest/operator_profiles"
operator_routes_map: "digest/operator_routes"
operator_sched_rt: "digest/operator_schedule_rt_category"
Expand Down
32 changes: 31 additions & 1 deletion _shared_utils/shared_utils/publish_utils.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
import os
from pathlib import Path
from typing import Union
from typing import Literal, Union

import gcsfs
import geopandas as gpd
import pandas as pd
from shared_utils import catalog_utils

fs = gcsfs.GCSFileSystem()
SCHED_GCS = "gs://calitp-analytics-data/data-analyses/gtfs_schedule/"
PUBLIC_BUCKET = "gs://calitp-publish-data-analysis/"
GTFS_DATA_DICT = catalog_utils.get_catalog("gtfs_analytics_data")


def write_to_public_gcs(
Expand Down Expand Up @@ -59,3 +63,29 @@ def exclude_private_datasets(
Filter out private datasets.
"""
return df[df[col].isin(public_gtfs_dataset_keys)].reset_index(drop=True)


def subset_table_from_previous_date(
gcs_bucket: str,
filename: Union[str, Path],
operator_and_dates_dict: dict,
date: str,
crosswalk_col: str = "schedule_gtfs_dataset_key",
data_type: Literal["df", "gdf"] = "df",
) -> pd.DataFrame:
CROSSWALK_FILE = GTFS_DATA_DICT.schedule_tables.gtfs_key_crosswalk

crosswalk = pd.read_parquet(f"{SCHED_GCS}{CROSSWALK_FILE}_{date}.parquet", columns=["name", crosswalk_col])

subset_keys = crosswalk[crosswalk.name.isin(operator_and_dates_dict[date])][crosswalk_col].unique()

if data_type == "df":
past_df = pd.read_parquet(
f"{gcs_bucket}{filename}_{date}.parquet", filters=[[(crosswalk_col, "in", subset_keys)]]
)
else:
past_df = gpd.read_parquet(
f"{gcs_bucket}{filename}_{date}.parquet", filters=[[(crosswalk_col, "in", subset_keys)]]
)

return past_df
19 changes: 8 additions & 11 deletions _shared_utils/shared_utils/schedule_rt_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,19 +151,16 @@ def get_organization_id(
sorting = [True for c in merge_cols]
keep_cols = ["organization_source_record_id"]

# Eventually, we need to move to 1 organization name, so there's
# no fanout when we merge it on
# Until then, handle it by dropping duplicates and pick 1 name
dim_provider_gtfs_data2 = (
dim_provider_gtfs_data2.sort_values(
merge_cols + ["_valid_to", "_valid_from"], ascending=sorting + [False, False]
)
.drop_duplicates(merge_cols)
.reset_index(drop=True)[merge_cols + keep_cols]
)
# We allow fanout when merging a feed to multiple organization names,
# but we should handle it by selectig a preferred
# rather than alphabetical.
# (organization names Foothill Transit and City of Duarte)
dim_provider_gtfs_data2 = dim_provider_gtfs_data2.sort_values(
merge_cols + ["_valid_to", "_valid_from"], ascending=sorting + [False, False]
).reset_index(drop=True)[merge_cols + keep_cols]

df2 = pd.merge(df, dim_provider_gtfs_data2, on=merge_cols, how="inner")
# return dim_provider_gtfs_data2

return df2


Expand Down
2 changes: 1 addition & 1 deletion ahsc_grant/ACS_eda.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6629,7 +6629,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.9.13"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion ahsc_grant/process_mst.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1498,7 +1498,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
"version": "3.9.13"
}
},
"nbformat": 4,
Expand Down
Loading
Loading