Skip to content

Hierarchical sampling structure

Caitlin Cherryh edited this page Oct 23, 2024 · 9 revisions

Many pooled surveys employ a hierarchical or cluster sampling structure where sampling is nested within one or more hierarchical levels. The example schematic (Figure 1) shows an example of a three-stage sampling design. In each region a number of villages are selected, from each village a number of representative sites (households) are selected, and at each site a number of units (e.g. mosquitoes, blackflies, or blood samples) are collected.

The tools in PoolTools model are not restricted to a village/household/unit model, but can apply for all cases where a hierarchical sampling frame is involved. Surveys may have fewer levels (e.g. school/unit) or more levels (district/subdistrict/street/unit).

Figure 1. Example schematic of the sampling in SimpleExampleData.csv

Why does clustering/hierarchy matter when designing a survey?

The purpose of a survey is to assess the presence/absence or prevalence of the disease from the population of interest, including those individuals that have not been sampled. Simple random sampling is a common way of getting a representative sample that can be generalised to the rest of the population but is often logistically impractical. It is often easier to choose a number of clusters (households, locations, workplaces) and sample units (mosquitos, blood samples) from them. Done correctly, cluster/hierarchical sampling still can result in a representative sample. However, units within a cluster are usually more similar to each other than they are to units from other clusters (intra-cluster correlation), making it harder to generalise to the rest of the population as most of the population will be in the clusters that have not been sampled.

Consequently, to achieve a given degree of precision requires a larger sample size if using a cluster survey compared to a simple random survey. The ratio of the sample size required is called the design effect. The larger the design effect, the more units need to be sampled. Cluster/hierarchical surveys designs and pooled testing both increase the design effect.

Important

Failing to account for clustering and pooling when designing a survey will result in an underpowered sample: you may fail to detect the disease even if present or may get a very uncertain estimate of prevalence.

Important

The efficiency of a sampling design will depend greatly on the survey setting, and common designs may be very inefficient in some settings. Using PoolTools to identify an optimal survey design can result in large cost savings.

Tip

The design effect and required sample size is larger if:

  • there is a high degree of correlation between units sampled from the same cluster
  • only a few clusters are selected
  • many units (e.g. mosquitos, blackflies) are sampled from each cluster
  • many units are placed in each pool
  • many or most clusters only have a single pool

These same factors will increase the width of confidence intervals (decrease precision of estimates).

Why does clustering/hierarchy matter when analysing a survey?

Cluster or hierarchical sampling designs provide less information about the population than a simple random sample with the same sample size. Most tools for estimating prevalence from pooled/group tests ignore the effect of cluster/hierarchical sampling. However, PoolTools provides options to account for complex survey designs.

When estimating prevalence from a hierarchical/cluster survey, failing to account for this sampling structure will mean estimates of prevalence will be over-confident. Specifically, the confidence intervals will be too narrow and are much more likely to omit the "true" prevalence value. This can make a practical difference if a decision is required to start/stop an intervention based on finding that the upper confidence interval is above/below a prevalence threshold. This effect will be more noticeable whenever the design effect for the survey design is large (see Tip above).

However, even surveys with large sample sizes collected from a large number of sites, with more than one pool per site should still be analysed with using a method that takes the hierarchical sampling structure into account.

Important

We recommend to always adjust for clustering in your own analyses if you've used a cluster/hierarchical sampling design. If you do not adjust for clustering, your confidence intervals will be too narrow. You may incorrectly conclude that prevalence is below a threshold.

How to account for cluster/hierarchical sampling in PoolTools

PoolTools provides options for adjusting for cluster/hierarchical sampling.

When analysing survey data you can select the option 'Adjust for hierarchical sampling?'. This option is not selected by default. See this tutorial for a worked example: Estimating marker prevalence in pooled data. The dataset to be analysed must include one or more columns which indicates which cluster each tested pool (i.e. row of the dataset) has been taken from. For hierarchical surveys with more than one level (e.g. village/household) the dataset should have one column for each level (e.g. columns called 'Village' and 'Household').

For more details on preparing datasets for analysis see this how-to guide: How to prepare your data for analysis

When designing a survey data you can select option 'Clustered design?' which is selected by default. See this tutorial for a worked example: Designing a cost-effective survey to estimate marker prevalence.

Under the Hood

Designing surveys

PoolTools uses a range of functions from PoolPoweR to evaluate and optimise survey designs. These functions use a random effect modelling framework to calculate design effects and expected (Fisher) information from survey designs and identify optimal survey designs that maximise survey cost efficiency.

A paper describing these statistical models and methods will be available soon.

Analysing survey data

PoolTools uses functions from PoolTestR to estimate prevalence from data. The function PoolTestR::HierPoolPrev is used for hierarchical/cluster survey designs and PoolTestR::PoolPrev if not adjusting for clustering. PoolTestR::HierPoolPrev uses a random-effect model to estimate prevalence in the population and each cluster, the degree of correlation between units in a cluster, and the uncertainty of the estimates. For more details on these models and functions please see this paper in Environmental Modelling and Software.

The point estimates of prevalence are usually slightly higher in the hierarchical model than in the non-hierarchical model. This is most notable when there is evidence of substantial correlation between sampled units from the same site cluster (e.g. site or village), and the sample size is small. In this case, the model can't rule out the possibility that sampling missed some of the clusters with the highest prevalence, and so must adjust the prevalence estimate upwards from what you would expect from a non-clustered sample.