From 332861f62f22431357f92afc2860bd13f82db8aa Mon Sep 17 00:00:00 2001
From: Yan Wong <yan.wong@bdi.ox.ac.uk>
Date: Wed, 5 Jul 2023 23:31:04 +0100
Subject: [PATCH] Add an example of creating a matrix of all pairwise
 comparisons

This is a common thing to want to do
---
 docs/stats.md | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/docs/stats.md b/docs/stats.md
index 39257e1017..039952296d 100644
--- a/docs/stats.md
+++ b/docs/stats.md
@@ -280,12 +280,10 @@ and also inefficient, because the allele frequencies for one population
 gets used in computing many of those values.
 So, statistics that take a `sample_sets` argument also take an `indexes` argument,
 which for a statistic that operates on `k` sample sets will be a list of `k`-tuples.
-If `indexes` is a length `n` list of `k`-tuples,
-then the output will have `n` columns,
-and if `indexes[j]` is a tuple `(i0, ..., ik)`,
-then the `j`-th column will contain values of the statistic computed on
-`(sample_sets[i0], sample_sets[i1], ..., sample_sets[ik])`.
-
+If `indexes` is a length `n` list of `k`-tuples, then the output will have `n`
+columns, so if `indexes[j]` is the tuple `(i0, ..., ik)`
+the `j`-th column will contain values of the statistic computed on
+`(sample_sets[i0], sample_sets[i1], ..., sample_sets[ik])`. 
 
 How multiple statistics are handled differs slightly between statistics
 that operate on single sample sets and multiple sample sets.
@@ -338,14 +336,29 @@ The `indexes` parameter is interpreted in the following way:
   statistic selecting the specified sample sets and we remove the last dimension
   in the result array as described in the {ref}`sec_stats_output_dimensions` section.
 
-- If if is `None` and `sample_sets` contains exactly `k` sample sets,
+- If it is `None` and `sample_sets` contains exactly `k` sample sets,
   this is equivalent to `indexes=range(k)`. **Note
   that we also drop the outer dimension of the result array in this case**.
 
-- If is is a list of `k`-tuples (each consisting of integers
+- If it is a list of `k`-tuples (each consisting of integers
   of integers between `0` and `len(sample_sets) - 1`) of length `n` we
   compute `n` statistics based on these selections of sample sets.
 
+For instance, a common requirement in the case of a pairwise (i.e. `k=2`) symmetrical
+statistic such as `divergence` is to obtain the statistic between all pairs of
+sample sets. This can be done by indexing all possible pairs as follows:
+
+```{code-cell} ipython3
+# Compute all pairwise divergences between diploid individuals. Or e.g. for pairwise
+# divergences between each sampled genome, use `sample_sets=[[s] for s in ts.samples()]`
+sample_sets=[i.nodes for i in ts.individuals()]
+index_all_pairs = np.tril_indices(len(sample_sets))  # ((0,0), (1,0), (1,1), (2,0), ...)
+d = ts.divergence(sample_sets, indexes=tuple(zip(*index_all_pairs))) 
+pairwise_matrix = np.zeros((len(sample_sets), len(sample_sets)))
+pairwise_matrix[index_all_pairs] = d
+
+print(pairwise_matrix)  # NB: only lower triangular values filled
+```
 
 (sec_stats_windows)=