Analysis summary
Pairwise distances between the strains
Lengths of tree compatible segments along the alignment
Pairwise analysis
Structure of SNP patterns
Format of downloadable files

In this guide we use figures from RECOPHY report for 92 E. coli strains. Original RECOPHY report is avaliable here.

This dataset is a part of of Elife paper:
"Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species"
Thomas Sakoparnig, Chris Field, Erik van Nimwegen; Elife, 2021.

All the plots in this guide are interactive. You can move the mouse cursor over data points to see corresponding values, zoom in, zoom out, and perform other interactive actions.

Analysis summary

Table. Project information

Project name	E. coli (eLife paper)
Number of strains	92
Alignment length	2,593,107
Reference sequence length	4,641,652
Estimated fraction of bi-allelic SNPs with multiple substitutions	0.02467

Project Name - project name given by the user at the submission time.
Number of Strains - number of strains in the dataset.
Alignment length - length of the core alignment.
Reference sequence length - length of the reference sequence. If there are multiple references sequences in the dataset, then the length of the shortest reference sequence is shown.
Estimated fraction of bi-allelic SNPs with multiple substitutions - estimated fraction of bi-allelic SNPs with multiple substitutions.

Table. Diversity in multiple genome alignment.

	Invariant	Bi-allelic	Tri-allelic	Tetra-allelic
Numbers	2,325,122	250,968	16,320	685
Fractions	0.896660	0.096780	0.006290	0.000260

This table shows numbers and fractions of Invariant, Bi-allelic, Tri-allelic, Tetra-allelic columns along the core genome alignment.

Fig. Phylogenetic tree.

This picture shows the phylogenetic tree of your set of species. For plotting convenience, full names of strains are changed to short IDs, each unique to its respective strain. If you move your mouse cursor over a strain ID on the image then you will see original strain name in a pop-up box. A branch color indicates a fraction of SNPs which support the branch. This fraction is calculates as f = S/(S + C), where S is a number of SNPs supporting the split, C is a number of SNPs clashing with the split. So blue color indicates branches with majority of SNPs supporting the split, while red color indicates branches where majority of SNPs are clashing with the split.

Fig. Histogram of branch supports.

This is a histogram of fraction f of supporting SNPs (f = S/(S + C)) for all branches in the phylogenetic tree. In the given example, more than 70% of all branches have fraction of supporting SNps less than 0.1.

Fig. SNPs per 1 KB block along the core alignment.

This plot shows how number of SNPs is changing as you slide 1 KB window along the core alignment.

Fig. Histogram of SNPs per 1 KB block along the core alignment.

Here you see the distribution of SNP number per 1 KB block along the core alignment. The distribution is overlayed with a Poisson distribution (red line). The Poisson distribution corresponds to clonnaly inherited regions and deviations from the Poisson distribution show the effect of recombination for the strain set.

Pairwise distances between the strains.

Fig. Cumulative distribution of pairwise distances.

Cumulative distribution of pairwise distances between all strains for the core alignment.

Fig. Histogram of pairwise distances

This plot shows the distribution of pairwise distances between all strains for the core alignment.

Lengths of tree compatible segments along the alignment

Fig. Number of consecutive tree-compatible SNPs

Here we plot distribution of numbers of consecutive tree-compatible SNPs. To calculate the distribution of the number of consecutive tree-compatible SNPs, we start from each SNP s in the core genome alignment and count the number n_s of SNP columns immediately following s, until a SNP column occurs that is incompatible with at least one of the n_s SNP columns. As shown in the figure, the distribution of the lengths of tree-compatible stretches has a mode at n=3, and stretches are very rarely longer than n=30 consecutive SNPs.

Fig. Number of consecutive tree-compatible nucleotides.

This plot shows the distribution of the number of tree-compatible nucleotides. Similar to the previous plot, to obtain the distribution of the number of consecutive tree-compatible nucleotides we start from each position p in the core genome alignment and count the number n_p of consecutive nucleotides until a SNP column occurs that is incompatible with at least one of the SNP columns among the n_p nucleotides. As you can see the number of tree-compatible nucleotides rarely exeeds 500.

Fig. Rate of phylogeny changes to SNPs for random subset of strains.

This figure shows the ratio C/S for random subsets of all strains as a function of the number of strains in the subset, using the core genome alignment with 5% potential homoplasic positions removed. C is a number of phylogeny changes and S is a number of substitutions along the core alignment. C/S ratio provides a lower bound for the ratio between the total number of phylogeny changes and substitutions that occur along the core alignment.

Pairwise analysis.

Table. Parameters of the mixture model fit

	Clonal fraction	SNP rate in clonal	SNP rate in recombined	SNPs in clonal	SNPs in recombined
	0.999274	0.000002	0.001500	4	2

Clonal fraction - clonal fraction of SNPs for the selected pair.
SNP rate in clonal - SNP rate in clonaly inherited SNP columns.
SNP rate in recombined - SNP rate in recombined SNP columns.
SNPs in clonal - number of SNPs in clonal SNP columns.
SNPs in recombined - number of SNPs in recombined SNP columns.

Fig. Local SNP densities along the core alignment.

Here we plot the SNP densities (SNP per 1 KB) for the selected pair of strains.

Fig. Histogram of SNPs per 1 KB block,

This plot shows a number of SNPs per 1 KB block for the selected pair of strains together with a fit (red line) of a mixture model. The mixture model consists of a Poisson distribution for the clonally inherited regions plus a negative binomial for the recombined regions.

Fig. Divergence vs. clonal fractions.

This plot shows the fraction of the genome that was inherited clonally as a function of the nucleotide divergence of the pair. Each dot correspond to a pair of strains.

Fig. Clonal fraction vs. fraction SNPs from recombination.

Here we show the fraction of all SNPs that lie in recombined regions as a function of the clonally inherited fraction of the genome. Each dot corresponds to a pair of strains.

Fig. CDF of clonal fractions.

Cumulative distribution of the clonal fractions of the pairs.

Fig. Sizes of recombined stretches.

Cumulative distribution of the lengths of recombined segments for pairs that are in the mostly clonal regime.

Structure of SNP patterns.

Fig. Reverse CDF of SNP pattern counts.

Reverse cumulative distributions of the frequencies of all observed 2-SNPs (blue line), 3-SNPs (orange line), etc.

Fig. N-SNP entropy profiles.

Entropy profiles of the N-SNP distribution for each strain. This is done from N-2 to the max number of clades in your dataset and N is on the x-axis. The entropy is calculated on the distribution of frequencies which a particular strain shares SNPs with N-1 other strains. Click twice on a trace name in the legend to hide all other traces. Click twice again on the trace name to show all the traces.

Format of downloadable files

Tree file - phylogenetic tree inferred by REALPHY in newick format.
Tree image - tree image in SVG format. We recommend to use web browser for viewing the image.
Core genome alignment - core alignment in PHYLIP (PHY) format compressed with gzip.
Branch support file - two columns text file. Columns are separated by space. Column 1: contains a branch definition with leave nodes separated with semicolon, 2: contains branch support value (ratio of SNP supporting the branch to all SNPs).
Pairwise results - table contains results of fitting the mixture model. Columns are tab separated. Column1: first strain in the pair, 2: second strain in the pair, 3: divergence between strains, 4: clonal fraction of SNPs in the pair, 5: SNP rate in clonaly inherited SNP columns, 6: SNP rate in recombined SNP columns, 7: SNP number in clonal columns, 8: SNP number in recombined columns.
Distribution of segment-lengths of compatible SNPs - text file. First line: list of lengths of compatible SNPs, second line: list of lengths of compatible nucleotides.
SNP types along the core genome alignment - two columns text file. Columns are tab separated. Column1: core genome position, 2: binary string coding the SNP
Counts of SNP types - five columns text file. Columns are tab separated. Column1: reverse cumulative distribution, 2: number of this SNP type, 3: number of strains having the minor allele, 4: binary string coding the SNP type, 5: names of strains having the minor allele
Distribution of SNP type counts - three columns text file. Columns are tab separated. Column1: count of SNP type, 2: Reverse cumulative distribution, 3: type of SNP, number of strains having the minor allele. (i.e, 2-SNP, 3-SNP,...)
Entropy profiles -three coumns text file. Column1: entropy, 2: number of strains which share the minor allele with strains in the third column, 3: strain.
Mapping of strain short IDs to original names - text file containaining two columns. Column 1: short ID assigned to a strain, 2: original name. Columns are space separated.
Download RECOPHY figures - TAR archive containing all the figures from Recophy report in separate HTML files. Use your web browser to open these HTML files.

RECOPHY results guide.