TY - JOUR
T1 - Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic
AU - PANGEA-HIV Consortium
AU - ICONIC Project
AU - Yebra, Gonzalo
AU - Hodcroft, Emma B.
AU - Ragonnet-Cronin, Manon L.
AU - Pillay, Deenan
AU - Leigh Brown, Andrew J.
AU - Fraser, Christophe
AU - Kellam, Paul
AU - De Oliveira, Tulio
AU - Dennis, Ann
AU - Hoppe, Anne
AU - Kityo, Cissy
AU - Frampton, Dan
AU - Ssemwanga, Deogratius
AU - Tanser, Frank
AU - Keshani, Jagoda
AU - Lingappa, Jairam
AU - Herbeck, Joshua
AU - Wawer, Maria
AU - Essex, Max
AU - Cohen, Myron S.
AU - Paton, Nicholas
AU - Ratmann, Oliver
AU - Kaleebu, Pontiano
AU - Hayes, Richard
AU - Fidler, Sarah
AU - Quinn, Thomas
AU - Novitsky, Vladimir
AU - Haywards, Andrew
AU - Nastouli, Eleni
AU - Morris, Steven
AU - Clark, Duncan
AU - Kozlakidis, Zisis
N1 - Publisher Copyright:
© The Author(s) 2016.
PY - 2016/12/23
Y1 - 2016/12/23
N2 - HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA-HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.
AB - HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA-HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.
UR - http://www.scopus.com/inward/record.url?scp=85007277444&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85007277444&partnerID=8YFLogxK
U2 - 10.1038/srep39489
DO - 10.1038/srep39489
M3 - Article
C2 - 28008945
AN - SCOPUS:85007277444
SN - 2045-2322
VL - 6
JO - Scientific reports
JF - Scientific reports
M1 - 39489
ER -