In this guide we use figures from RECOPHY report for 92
E. coli strains. Original RECOPHY report is
avaliable here.
This dataset is a part of of Elife paper:
"Whole genome phylogenies reflect the distributions of
recombination rates for many bacterial
species"
Thomas Sakoparnig, Chris Field, Erik
van Nimwegen; Elife, 2021.
All the plots in this guide are interactive. You can
move the mouse cursor over data points to see
corresponding values, zoom in, zoom out, and perform
other interactive actions.
Project name | E. coli (eLife paper) |
Number of strains | 92 |
Alignment length | 2,593,107 |
Reference sequence length | 4,641,652 |
Estimated fraction of bi-allelic SNPs with multiple substitutions | 0.02467 |
Invariant | Bi-allelic | Tri-allelic | Tetra-allelic | |
---|---|---|---|---|
Numbers | 2,325,122 | 250,968 | 16,320 | 685 |
Fractions | 0.896660 | 0.096780 | 0.006290 | 0.000260 |
This table shows numbers and fractions of Invariant, Bi-allelic, Tri-allelic, Tetra-allelic columns along the core genome alignment.
This picture shows the phylogenetic tree of your set of species. For plotting convenience, full names of strains are changed to short IDs, each unique to its respective strain. If you move your mouse cursor over a strain ID on the image then you will see original strain name in a pop-up box. A branch color indicates a fraction of SNPs which support the branch. This fraction is calculates as f = S/(S + C), where S is a number of SNPs supporting the split, C is a number of SNPs clashing with the split. So blue color indicates branches with majority of SNPs supporting the split, while red color indicates branches where majority of SNPs are clashing with the split.
This is a histogram of fraction f of supporting SNPs (f = S/(S + C)) for all branches in the phylogenetic tree. In the given example, more than 70% of all branches have fraction of supporting SNps less than 0.1.
This plot shows how number of SNPs is changing as you slide 1 KB window along the core alignment.
Here you see the distribution of SNP number per 1 KB block along the core alignment. The distribution is overlayed with a Poisson distribution (red line). The Poisson distribution corresponds to clonnaly inherited regions and deviations from the Poisson distribution show the effect of recombination for the strain set.
Cumulative distribution of pairwise distances between all strains for the core alignment.
This plot shows the distribution of pairwise distances between all strains for the core alignment.
Here we plot distribution of numbers of consecutive tree-compatible SNPs. To calculate the distribution of the number of consecutive tree-compatible SNPs, we start from each SNP s in the core genome alignment and count the number ns of SNP columns immediately following s, until a SNP column occurs that is incompatible with at least one of the ns SNP columns. As shown in the figure, the distribution of the lengths of tree-compatible stretches has a mode at n=3, and stretches are very rarely longer than n=30 consecutive SNPs.
This plot shows the distribution of the number of tree-compatible nucleotides. Similar to the previous plot, to obtain the distribution of the number of consecutive tree-compatible nucleotides we start from each position p in the core genome alignment and count the number np of consecutive nucleotides until a SNP column occurs that is incompatible with at least one of the SNP columns among the np nucleotides. As you can see the number of tree-compatible nucleotides rarely exeeds 500.
This figure shows the ratio C/S for random subsets of all strains as a function of the number of strains in the subset, using the core genome alignment with 5% potential homoplasic positions removed. C is a number of phylogeny changes and S is a number of substitutions along the core alignment. C/S ratio provides a lower bound for the ratio between the total number of phylogeny changes and substitutions that occur along the core alignment.
Clonal fraction | SNP rate in clonal | SNP rate in recombined | SNPs in clonal | SNPs in recombined | |
---|---|---|---|---|---|
0.999274 | 0.000002 | 0.001500 | 4 | 2 |
Here we plot the SNP densities (SNP per 1 KB) for the selected pair of strains.
This plot shows a number of SNPs per 1 KB block for the selected pair of strains together with a fit (red line) of a mixture model. The mixture model consists of a Poisson distribution for the clonally inherited regions plus a negative binomial for the recombined regions.
This plot shows the fraction of the genome that was inherited clonally as a function of the nucleotide divergence of the pair. Each dot correspond to a pair of strains.
Here we show the fraction of all SNPs that lie in recombined regions as a function of the clonally inherited fraction of the genome. Each dot corresponds to a pair of strains.
Cumulative distribution of the clonal fractions of the pairs.
Cumulative distribution of the lengths of recombined segments for pairs that are in the mostly clonal regime.
Reverse cumulative distributions of the frequencies of all observed 2-SNPs (blue line), 3-SNPs (orange line), etc.
Entropy profiles of the N-SNP distribution for each strain. This is done from N-2 to the max number of clades in your dataset and N is on the x-axis. The entropy is calculated on the distribution of frequencies which a particular strain shares SNPs with N-1 other strains. Click twice on a trace name in the legend to hide all other traces. Click twice again on the trace name to show all the traces.