Bayesian copy number detection and association in large-scale studies

Stephen Cristiano, David McKean, Jacob Carey, Paige Bracci, Paul Brennan, Michael Chou, Mengmeng Du, Steven Gallinger, Michael G. Goggins, Manal M. Hassan, Rayjean J. Hung, Robert C. Kurtz, Donghui Li, Lingeng Lu, Rachel Neale, Sara Olson, Gloria Petersen, Kari G. Rabe, Jack Fu, Harvey RischGary L. Rosner, Ingo Ruczinski, Alison P. Klein, Robert B. Scharpf

Research output: Contribution to journalArticlepeer-review


Background: Germline copy number variants (CNVs) increase risk for many diseases, yet detection of CNVs and quantifying their contribution to disease risk in large-scale studies is challenging due to biological and technical sources of heterogeneity that vary across the genome within and between samples. Methods: We developed an approach called CNPBayes to identify latent batch effects in genome-wide association studies involving copy number, to provide probabilistic estimates of integer copy number across the estimated batches, and to fully integrate the copy number uncertainty in the association model for disease. Results: Applying a hidden Markov model (HMM) to identify CNVs in a large multi-site Pancreatic Cancer Case Control study (PanC4) of 7598 participants, we found CNV inference was highly sensitive to technical noise that varied appreciably among participants. Applying CNPBayes to this dataset, we found that the major sources of technical variation were linked to sample processing by the centralized laboratory and not the individual study sites. Modeling the latent batch effects at each CNV region hierarchically, we developed probabilistic estimates of copy number that were directly incorporated in a Bayesian regression model for pancreatic cancer risk. Candidate associations aided by this approach include deletions of 8q24 near regulatory elements of the tumor oncogene MYC and of Tumor Suppressor Candidate 3 (TUSC3). Conclusions: Laboratory effects may not account for the major sources of technical variation in genome-wide association studies. This study provides a robust Bayesian inferential framework for identifying latent batch effects, estimating copy number, and evaluating the role of copy number in heritable diseases.

Original languageEnglish (US)
Article number856
JournalBMC cancer
Issue number1
StatePublished - Sep 7 2020


  • Batch effects
  • CNPBayes
  • Copy number variants
  • Genome-wide association
  • Pancreatic cancer
  • SNP array

ASJC Scopus subject areas

  • Oncology
  • Genetics
  • Cancer Research


Dive into the research topics of 'Bayesian copy number detection and association in large-scale studies'. Together they form a unique fingerprint.

Cite this