+
Introduction
+
+
This vignette provides users with an example analysis using the serocalculator
+package by reproducing the analysis for: Estimating
+typhoid incidence from community-based serosurveys: a multicohort
+study (Aiemjoy et al.
+(2022)). We review the methods underlying the analysis and then
+walk through an example of enteric fever incidence in Pakistan. Note
+that because this is a simplified version of the analysis, the results
+here will differ slightly from those presented in the publication.
+
In this example, users will determine the seroincidence of enteric
+fever in cross-sectional serosurveys conducted as part of the
+SeroEpidemiology and Environmental Surveillance (SEES) for enteric fever
+study in Bangladesh, Nepal, and Pakistan. Longitudinal antibody
+responses were modeled from 1420 blood culture-confirmed enteric fever
+cases enrolled from the same countries.
+
+
+
Load data
+
+
Pathogen-specific sample datasets, noise parameters, and longitudinal
+antibody dynamics for serocalculator are available on
+the Serocalculator Data Repository
+on Open Science Framework (OSF). We will pull this data directly into
+our R environment.
+
Note that each dataset has specific formatting and variable name
+requirements.
+
+
Load and prepare longitudinal parameter data
+
+
We will first load the longitudinal curve parameters to set the
+antibody decay parameters. In this example, these parameters were
+modeled with Bayesian hierarchical models to fit two-phase
+power-function decay models to the longitudinal antibody responses among
+confirmed enteric fever cases.
+
Formatting Specifications: Data should be imported as a
+“wide” dataframe with one column for each parameter and one row for each
+iteration of the posterior distribution for each antigen isotype. Column
+names must exactly match follow the naming conventions:
+
+
+
+
+y0 |
+Baseline antibody concentration |
+
+
+y1 |
+Peak antibody concentration |
+
+
+t1 |
+Time to peak antibody concentration |
+
+
+alpha |
+Antibody decay rate |
+
+
+r |
+Antibody decay shape |
+
+
+
+
Note that variable names are case-sensitive
+
+# Import longitudinal antibody parameters from OSF
+curves <-
+ "https://osf.io/download/rtw5k/" %>%
+ load_curve_params()
+
+
+
Visualize curve parameters
+
+
We can graph the decay curves with an autoplot()
+method:
+
+# Visualize curve parameters
+curves %>% filter(antigen_iso == "HlyE_IgA" |
+ antigen_iso == "HlyE_IgG") %>%
+ autoplot()
+
+
+
+
Load and prepare cross-sectional data
+
+
Next, we load our sample cross-sectional data. We will use a subset
+of results from the SEES dataset. Ideally, this will be a representative
+sample of the general population without regard to disease status.
+Later, we will limit our analysis to cross-sectional data from
+Pakistan.
+
We have selected hemolysin E (HlyE) as our target antigen
+and IgG and IgA as our target immunoglobulin isotypes.
+Users may select different serologic markers depending on what is
+available in your data. From the original dataset, we rename our result
+and age variables to the names required by
+serocalculator.
+
Formatting Specifications: Cross-sectional population data
+should be a “long” dataframe with one column for each variable and one
+row for each antigen isotype resulted for an individual. So the same
+individual will have more than one row if they have results for more
+than one antigen isotype. The dataframe can have additional variables,
+but the two below are required:
+
+
+
+
+value |
+Quantitative antibody response |
+
+
+age |
+Numeric age |
+
+
+
+
Note that variable names are case sensitive
+
+#Import cross-sectional data from OSF and rename required variables
+xs_data <- readr::read_rds("https://osf.io/download//n6cp3/") %>%
+ as_pop_data()
+
+
+
Summarize antibody data
+
+
We can compute numerical summaries of our cross-sectional antibody
+data with a summary()
method for pop_data
+objects:
+
+xs_data %>% summary()
+#>
+#> n = 3336
+#>
+#> Distribution of age:
+#>
+#> # A tibble: 1 × 7
+#> n min first_quartile median mean third_quartile max
+#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+#> 1 3336 0.6 5 10 10.5 15 25
+#>
+#> Distributions of antigen-isotype measurements:
+#>
+#> # A tibble: 2 × 7
+#> antigen_iso Min `1st Qu.` Median `3rd Qu.` Max `# NAs`
+#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
+#> 1 HlyE_IgA 0 0.851 1.74 3.66 133. 0
+#> 2 HlyE_IgG 0 1.15 2.70 6.74 219. 0
+
+
+
Visualize antibody data
+
+
We examine our cross-sectional antibody data by visualizing the
+distribution of quantitative antibody responses. Here, we will look at
+the distribution of our selected antigen and isotype pairs, HlyE IgA and
+HlyE IgG, across participating countries.
+
+
+#color palette
+country_pal <- c("#EA6552", "#8F4B86", "#0099B4FF")
+
+xs_data %>% autoplot(strata = "Country", type = "density") +
+ scale_fill_manual(values = country_pal)
+
+
We see that across countries, our data is highly skewed with the
+majority of responses on the lower end of our data with long tails.
+Let’s get a better look at the distribution by log transforming our
+antibody response value.
+
+# Create log transformed plots
+
+xs_data %>%
+ mutate(Country = fct_relevel(Country, "Bangladesh", "Pakistan", "Nepal")) %>%
+ autoplot(strata = "Country", type = "density") +
+ scale_fill_manual(values = country_pal) +
+ scale_x_log10(labels = scales::label_comma())
+#> Warning in scale_x_log10(labels = scales::label_comma()): log-10
+#> transformation introduced infinite values.
+#> Warning: Removed 18 rows containing non-finite outside the scale range
+#> (`stat_density()`).
+
+
Once log transformed, our data looks much more normally distributed.
+In most cases, log transformation will be the best way to visualize
+serologic data.
+
Let’s also take a look at how antibody responses change by age.
+
+
+#Plot antibody responses by age
+ggplot(data = xs_data %>%
+ mutate(Country = fct_relevel(
+ Country, "Bangladesh", "Pakistan", "Nepal"
+ )),
+aes(x = age, y = value, color = Country)) +
+ geom_point(size = 0.6, alpha = 0.7) +
+ geom_smooth(method = "lm", se = FALSE) +
+ scale_y_log10() +
+ theme_linedraw() +
+ labs(title = "Quantitative Antibody Results by Age", x = "Age", y = "Value") +
+ scale_color_manual(values = country_pal)
+
+
In this plot, a steeper slope indicates a higher incidence. We can
+see that the highest burden is in Bangladesh. Nepal has a slightly
+higher incidence in the older group (higher slope).
+
+
+
Load noise parameters
+
+
Next, we must set conditions based on some assumptions about the data
+and errors that may need to be accounted for. This will differ based on
+background knowledge of the data.
+
The biological noise,
+
+(“nu”), represents error from cross-reactivity to other antibodies. It
+is defined as the 95th percentile of the distribution of antibody
+responses to the antigen-isotype in a population with no exposure.
+
Measurement noise,
+
+(“epsilon”), represents measurement error from the laboratory testing
+process. It is defined by a CV (coefficient of variation) as the ratio
+of the standard deviation to the mean for replicates. Note that the CV
+should ideally be measured across plates rather than within the same
+plate.
+
Formatting Specifications: Noise parameter data should be a
+dataframe with one row for each antigen isotype and columns for each
+noise parameter below.
+
+
+
+
+y.low |
+Lower limit of detection of the assay |
+
+
+nu |
+Biologic noise |
+
+
+y.high |
+Upper limit of detection of the assay |
+
+
+eps |
+Measurement noise |
+
+
+
+
Note that variable names are case-sensitive.
+
+# Import noise parameters from OSF
+
+noise <- url("https://osf.io/download//hqy4v/") %>% readRDS()
+
+
+
+
Estimate Seroincidence
+
+
Now we are ready to begin estimating seroincidence. We will conduct
+two separate analyses using two distinct functions,
+est.incidence
and est.incidence.by
, to
+calculate the overall seroincidence and the stratified seroincidence,
+respectively.
+
+
Overall Seroincidence
+
+
Using the function est.incidence
, we filter to sites in
+Pakistan and define the datasets for our cross-sectional data
+(pop_data), longitudinal parameters (curve_param), and noise parameters
+(noise_param). We also define the antigen-isotype pairs to be included
+in the estimate (antigen_isos). Here, we have chosen to use two antigen
+isotypes, but users can add additional pairs if available.
+
+# Using est.incidence (no strata)
+
+est1 <- est.incidence(
+ pop_data = xs_data %>% filter(Country == "Pakistan"),
+ curve_param = curves,
+ noise_param = noise %>% filter(Country == "Pakistan"),
+ antigen_isos = c("HlyE_IgG", "HlyE_IgA")
+)
+
+summary(est1)
+#> # A tibble: 1 × 10
+#> est.start incidence.rate SE CI.lwr CI.upr coverage log.lik iterations
+#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
+#> 1 0.1 0.128 0.00682 0.115 0.142 0.95 -2376. 4
+#> # ℹ 2 more variables: antigen.isos <chr>, nlm.convergence.code <ord>
+
+
+
Stratified Seroincidence
+
+
We can also produce stratified seroincidence estimates. Users can
+select one or more stratification variables in their cross-sectional
+population dataset. Let’s compare estimates across all countries and by
+age group.
+
+#Using est.incidence.by (strata)
+
+est_country_age <- est.incidence.by(
+ strata = c("Country", "ageCat"),
+ pop_data = xs_data,
+ curve_params = curves,
+ curve_strata_varnames = NULL,
+ noise_params = noise,
+ noise_strata_varnames = "Country",
+ antigen_isos = c("HlyE_IgG", "HlyE_IgA"),
+ num_cores = 8 # Allow for parallel processing to decrease run time
+)
+#> Warning in count_strata(., strata_varnames): The number of observations in `data` varies between antigen isotypes, for at least one stratum. Sample size for each stratum will be calculated as the minimum number of observations across all antigen isotypes.
+#> Warning in check_parallel_cores(.): This computer appears to have 4 cores
+#> available. `est.incidence.by()` has reduced its `num_cores` argument to 3 to
+#> avoid destabilizing the computer.
+
+summary(est_country_age)
+#> Seroincidence estimated given the following setup:
+#> a) Antigen isotypes : HlyE_IgG, HlyE_IgA
+#> b) Strata : Country, ageCat
+#>
+#> Seroincidence estimates:
+#> # A tibble: 9 × 14
+#> Stratum Country ageCat n est.start incidence.rate SE CI.lwr CI.upr
+#> <chr> <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
+#> 1 Stratum 1 Banglad… <5 101 0.1 0.400 0.0395 0.330 0.485
+#> 2 Stratum 2 Banglad… 5-15 256 0.1 0.477 0.0320 0.418 0.544
+#> 3 Stratum 3 Banglad… 16+ 44 0.1 0.449 0.0763 0.322 0.627
+#> 4 Stratum 4 Nepal <5 171 0.1 0.0203 0.00444 0.0132 0.0311
+#> 5 Stratum 5 Nepal 5-15 378 0.1 0.0355 0.00311 0.0299 0.0421
+#> 6 Stratum 6 Nepal 16+ 211 0.1 0.0935 0.00776 0.0795 0.110
+#> 7 Stratum 7 Pakistan <5 126 0.1 0.106 0.0136 0.0823 0.136
+#> 8 Stratum 8 Pakistan 5-15 261 0.1 0.115 0.00845 0.0991 0.132
+#> 9 Stratum 9 Pakistan 16+ 107 0.1 0.190 0.0204 0.154 0.235
+#> # ℹ 5 more variables: coverage <dbl>, log.lik <dbl>, iterations <int>,
+#> # antigen.isos <chr>, nlm.convergence.code <ord>
+
Note that we get a warning about uneven observations between antigen
+isotypes, meaning some participants do not have results for both
+HlyE IgA and HlyE IgG. The warning indicates that the
+“Sample size for each stratum will be calculated as the minimum
+number of observations across all antigen isotypes”, so only
+participants with both antigen isotypes are included. To avoid this,
+filter the dataset to include only records with all specified antigen
+isotypes.
+
We set curve_strata_varnames = NULL
to avoid
+stratification in the “curves” dataset because it does not include our
+strata variables. Without this, a warning appears: “curve_params is
+missing all strata variables, and will be used unstratified”. To
+stratify based on variables that exist in a longitudinal curve
+parameters dataset, specify variables using
+curve_strata_varnames
, similar to how
+noise_strata_varnames
is used for “noise” above.
+
Finally, let’s visualize our seroincidence estimates by country and
+age category.
+
+# Plot seroincidence estimates
+
+## Save summary(est_country_age) as a dataframe and sort by incidence rate
+est_country_agedf <- summary(est_country_age) %>%
+ mutate(
+ Country = fct_relevel(Country, "Bangladesh", "Pakistan", "Nepal"),
+ ageCat = factor(ageCat)
+ )
+
+## Create plot by country and age category
+ggplot(est_country_agedf,
+ aes(
+ y = fct_rev(ageCat),
+ x = incidence.rate * 1000, #rescale incidence
+ fill = Country
+ )) +
+ geom_bar(stat = "identity",
+ position = position_dodge(),
+ show.legend = TRUE) +
+ geom_errorbar(
+ aes(xmin = CI.lwr * 1000, xmax = CI.upr * 1000), #rescale CIs
+ position = position_dodge(width = 0.9),
+ width = .2
+ ) +
+ labs(title = "Enteric Fever Seroincidence by Country and Age",
+ x = "Seroincidence rate per 1000 person-years",
+ y = "Age Category",
+ fill = "Country") +
+ theme_linedraw() +
+ theme(axis.text.y = element_text(size = 11),
+ axis.text.x = element_text(size = 11)) +
+ scale_x_continuous(expand = c(0, 10)) +
+ scale_fill_manual(values = country_pal)
+
+
+
+
+
Conclusions
+
+
We find that Bangladesh has the highest overall seroincidence of
+enteric fever with a rate of 449 per 1000 person-years, as well as the
+highest seroincidence by age category. In comparison, Nepal has a
+seroincidence rate over 1 times lower than that of Bangladesh (400 per
+1000 person-years) and the lowest age-specific seroincidence rates of
+the three countries in the study. serocalculator
+provides an efficient tool to conduct this analysis and produce
+actionable results.
+
+
+
Acknowledgments
+
+
We gratefully acknowledge the study participants for their valuable
+time and interest in participating in these studies. Special thanks to
+our collaborators at Sabin Vaccine Institute, Aga Khan University
+(Karachi, Pakistan), Child Health Research Foundation (Dhaka,
+Bangladesh), and Dhulikhel Hospital, Kathmandu University Hospital
+(Dhulikhel, Nepal).
+
+