Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer

Silvia Pineda; Francisco X. Real; Manolis Kogevinas; Alfredo Carrato; Stephen J. Chanock; Núria Malats; Kristel Van Steen

doi:10.1371/journal.pgen.1005689

Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer

Silvia Pineda, Francisco X. Real, Manolis Kogevinas, Alfredo Carrato, Stephen J. Chanock, Núria Malats, Kristel Van Steen

School of Medicine

Research output: Contribution to journal › Article › peer-review

38 Scopus citations

Abstract

Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.

Original language	English (US)
Article number	e1005689
Journal	PLoS genetics
Volume	11
Issue number	12
DOIs	https://doi.org/10.1371/journal.pgen.1005689
State	Published - 2015

ASJC Scopus subject areas

Ecology, Evolution, Behavior and Systematics
Molecular Biology
Genetics
Genetics(clinical)
Cancer Research

Access to Document

10.1371/journal.pgen.1005689

Cite this

@article{b53f08acb5fa4cb4a536b118b6a4e384,

title = "Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer",

abstract = "Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.",

author = "Silvia Pineda and Real, {Francisco X.} and Manolis Kogevinas and Alfredo Carrato and Chanock, {Stephen J.} and N{\'u}ria Malats and {Van Steen}, Kristel",

note = "Publisher Copyright: {\textcopyright} 2015 This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication",

year = "2015",

doi = "10.1371/journal.pgen.1005689",

language = "English (US)",

volume = "11",

journal = "PLoS genetics",

issn = "1553-7390",

publisher = "Public Library of Science",

number = "12",

}

TY - JOUR

T1 - Integration Analysis of Three Omics Data Using Penalized Regression Methods

T2 - An Application to Bladder Cancer

AU - Pineda, Silvia

AU - Real, Francisco X.

AU - Kogevinas, Manolis

AU - Carrato, Alfredo

AU - Chanock, Stephen J.

AU - Malats, Núria

AU - Van Steen, Kristel

N1 - Publisher Copyright: © 2015 This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication

PY - 2015

Y1 - 2015

N2 - Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.

AB - Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.

UR - http://www.scopus.com/inward/record.url?scp=84953313516&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84953313516&partnerID=8YFLogxK

U2 - 10.1371/journal.pgen.1005689

DO - 10.1371/journal.pgen.1005689

M3 - Article

C2 - 26646822

AN - SCOPUS:84953313516

SN - 1553-7390

VL - 11

JO - PLoS genetics

JF - PLoS genetics

IS - 12

M1 - e1005689

ER -

Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this