Skip to content

How to prepare your data for analysis

Caitlin Cherryh edited this page Oct 23, 2024 · 3 revisions

Supported formats

In order to estimate prevalence with your own data, you must upload tabular data (i.e. spreadsheet). PoolTools currently supports .xlsx or .csv.

If using .xlsx, there must be no merged or nested cells. Specifically, datasets must adhere to the principles of tidy data for PoolTools to work as intended.

The minimum required data

Your data (in .xlsx or .csv) must have two columns at minimum:

  1. Results per pool
  2. Number of units per pool

This example shows a PoolTools-compatible table with the two required columns, and placeholders for additional columns containing added information for each pool. Your own data is likely to contain more rows than the example. More comprehensive datasets and templates can be found here.

Results NumInPool Additional Columns
0 10 ... ...
0 10
1 15
0 12
1 10

Important

All columns must be named in the first row of the dataset. Each subsequent row in your dataset must correspond to a single pool.

Note

Column names can be anything. It does not need to be strictly named as Results or NumInPool, as long as the data is in the correct format.

Required columns

Results

Ensure that the results column contains only 0 or 1, where:

  • 0 indicates a pool is negative for the marker
  • 1 indicates a pool is positive for the marker

If this data is in presented differently, such as present and absent, it must be replaced with 0 and 1.

Number of units per pool

This column indicates the number of units sampled in the pool. It must only contain integers (whole numbers) and cannot be 0 or negative. The number of units may be different for each pool.

Optional columns

Additional columns containing information associated with each pool can be included. Some examples include, but are not limited to the:

  • sampling location (site, region, coordinates)
  • genus/species of sample collected

There is no limit to the number of additional columns. Each column can be in any format (i.e. numerical, text). However, you must ensure that the format of data in each column is consistent along that column.

Note

There are no restrictions around the names of optional columns. You can choose whatever names best reflect your data (e.g., House, Country, Species_name).

Ensuring clusters are uniquely identified

When including information about hierarchical sampling frame (e.g. where units were sampled from sites and sites were selected from a broader region), it is critical that each cluster can be uniquely identified. Further reading on hierarchical sampling frames in PoolTools and why this is important: Hierarchical sampling structure

Important

It's not enough for the combination of columns specifying the hierarchy to be unique. Each different location needs to have a unique label.

In the following example, sampling was conducted at two sites in region A and three sites in region B. Sites are numbered within each Region. Site 1 in Region A is different from Site 1 in Region B, however this is not clear in the data as the sites do not have unique names.

Result NumInPool Region Site
0 10 A 1
1 12 A 1
1 10 A 2
0 15 A 2
0 10 B 1
0 12 B 2
1 10 B 3

To ensure each site has unique name, we can edit the data so the Site is prefixed with the Region name:

Result NumInPool Region Site
0 10 A A-1
1 12 A A-1
1 10 A A-2
0 15 A A-2
0 10 B B-1
0 12 B B-2
1 10 B B-3

An alternative is to renumber all the sites sequentially:

Result NumInPool Region Site
0 10 A 1
1 12 A 1
1 10 A 2
0 15 A 2
0 10 B 3
0 12 B 4
1 10 B 5

🚀 Where to from here?