TY - JOUR
T1 - What do neighbors tell about you
T2 - The local context of cis-regulatory modules complicates prediction of regulatory variants
AU - Penzar, Dmitry D.
AU - Zinkevich, Arsenii O.
AU - Vorontsov, Ilya E.
AU - Sitnik, Vasily V.
AU - Favorov, Alexander V.
AU - Makeev, Vsevolod J.
AU - Kulakovskiy, Ivan V.
N1 - Funding Information:
This study was supported by the Russian Foundation for Basic Research grants 18-34-20024 and 19-29-04131, Skoltech Systems Biology Fellowship (to IV), Program “Postgenomic technologies and perspective solutions in the biomedicine” of the RAS Presidium, project АААА-А19-119091090024-4, and Russian Program of Fundamental Research for State Academies. The CAGI experiment coordination was supported by NIH U41 HG007446 and the CAGI conference by NIH R13 HG006650.
Funding Information:
This study was supported by the Russian Foundation for Basic Research grants 18-34-20024 and 19-29-04131, Skoltech Systems Biology Fellowship (to IV), Program ?Postgenomic technologies and perspective solutions in the biomedicine? of the RAS Presidium, project ????-?19-119091090024-4, and Russian Program of Fundamental Research for State Academies. The CAGI experiment coordination was supported by NIH U41 HG007446 and the CAGI conference by NIH R13 HG006650. We thank CAGI organizers and personally Gaia Andreoletti and Lipika Ray. We personally thank Martin Kircher for sharing the complete experimental data of the ?Regulation Saturation? challenge. We thank the Institute of Systems Biology, Ltd, BIOSOFT.RU, and personally Fedor Kolpakov for providing direct access to the GTRD data.
Publisher Copyright:
© 2019 Penzar, Zinkevich, Vorontsov, Sitnik, Favorov, Makeev and Kulakovskiy.
PY - 2019
Y1 - 2019
N2 - Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent “Regulation Saturation” Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the “information leakage” caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_ regsat and https://genomeinterpretation.org/content/expression-variants.
AB - Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent “Regulation Saturation” Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the “information leakage” caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_ regsat and https://genomeinterpretation.org/content/expression-variants.
KW - Enhancers
KW - Machine learning
KW - Promoters
KW - RSNP
KW - Regulatory variants
KW - Saturation mutagenesis massively parallel reporter assay
UR - http://www.scopus.com/inward/record.url?scp=85074793774&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074793774&partnerID=8YFLogxK
U2 - 10.3389/fgene.2019.01078
DO - 10.3389/fgene.2019.01078
M3 - Article
C2 - 31737053
AN - SCOPUS:85074793774
SN - 1664-8021
VL - 10
JO - Frontiers in Genetics
JF - Frontiers in Genetics
IS - OCT
M1 - 1078
ER -