Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic

PANGEA-HIV Consortium; ICONIC Project

doi:10.1038/srep39489

Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic

PANGEA-HIV Consortium, ICONIC Project

Research output: Contribution to journal › Article › peer-review

15 Scopus citations

Abstract

HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA-HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.

Original language	English (US)
Article number	39489
Journal	Scientific reports
Volume	6
DOIs	https://doi.org/10.1038/srep39489
State	Published - Dec 23 2016

ASJC Scopus subject areas

General

Access to Document

10.1038/srep39489

Cite this

@article{fd1460a719a14344aa31901b2a22ba8d,

title = "Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic",

abstract = "HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA-HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.",

author = "{PANGEA-HIV Consortium} and {ICONIC Project} and Gonzalo Yebra and Hodcroft, {Emma B.} and Ragonnet-Cronin, {Manon L.} and Deenan Pillay and {Leigh Brown}, {Andrew J.} and Christophe Fraser and Paul Kellam and {De Oliveira}, Tulio and Ann Dennis and Anne Hoppe and Cissy Kityo and Dan Frampton and Deogratius Ssemwanga and Frank Tanser and Jagoda Keshani and Jairam Lingappa and Joshua Herbeck and Maria Wawer and Max Essex and Cohen, {Myron S.} and Nicholas Paton and Oliver Ratmann and Pontiano Kaleebu and Richard Hayes and Sarah Fidler and Thomas Quinn and Vladimir Novitsky and Andrew Haywards and Eleni Nastouli and Steven Morris and Duncan Clark and Zisis Kozlakidis",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2016.",

year = "2016",

month = dec,

day = "23",

doi = "10.1038/srep39489",

language = "English (US)",

volume = "6",

journal = "Scientific reports",

issn = "2045-2322",

publisher = "Nature Publishing Group",

}

TY - JOUR

T1 - Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic

AU - PANGEA-HIV Consortium

AU - ICONIC Project

AU - Yebra, Gonzalo

AU - Hodcroft, Emma B.

AU - Ragonnet-Cronin, Manon L.

AU - Pillay, Deenan

AU - Leigh Brown, Andrew J.

AU - Fraser, Christophe

AU - Kellam, Paul

AU - De Oliveira, Tulio

AU - Dennis, Ann

AU - Hoppe, Anne

AU - Kityo, Cissy

AU - Frampton, Dan

AU - Ssemwanga, Deogratius

AU - Tanser, Frank

AU - Keshani, Jagoda

AU - Lingappa, Jairam

AU - Herbeck, Joshua

AU - Wawer, Maria

AU - Essex, Max

AU - Cohen, Myron S.

AU - Paton, Nicholas

AU - Ratmann, Oliver

AU - Kaleebu, Pontiano

AU - Hayes, Richard

AU - Fidler, Sarah

AU - Quinn, Thomas

AU - Novitsky, Vladimir

AU - Haywards, Andrew

AU - Nastouli, Eleni

AU - Morris, Steven

AU - Clark, Duncan

AU - Kozlakidis, Zisis

PY - 2016/12/23

Y1 - 2016/12/23

N2 - HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA-HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.

AB - HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA-HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.

UR - http://www.scopus.com/inward/record.url?scp=85007277444&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85007277444&partnerID=8YFLogxK

U2 - 10.1038/srep39489

DO - 10.1038/srep39489

M3 - Article

C2 - 28008945

AN - SCOPUS:85007277444

SN - 2045-2322

VL - 6

JO - Scientific reports

JF - Scientific reports

M1 - 39489

ER -

Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this