Using machine learning to identify major shifts in human gut microbiome protein family abundance in disease

Mehrdad Yazdani, Bryn C. Taylor, Justine W. Debelius, Weizhong Li, Rob Knight, Larry Smarr

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Inflammatory Bowel Disease (IBD) is an autoimmune condition that is observed to be associated with major alterations in the gut microbiome taxonomic composition. Here we classify major changes in microbiome protein family abundances between healthy subjects and IBD patients. We use machine learning to analyze results obtained previously from computing relative abundance of ∼10,000 KEGG orthologous protein families in the gut microbiome of a set of healthy individuals and IBD patients. We develop a machine learning pipeline, involving the Kolomogorv-Smirnov test, to identify the 100 most statistically significant entries in the KEGG database. Then we use these 100 as a training set for a Random Forest classifier to determine ∼5% the KEGGs which are best at separating disease and healthy states. Lastly, we developed a Natural Language Processing classifier of the KEGG description files to predict KEGG relative over-or under-abundance. As we expand our analysis from 10,000 KEGG protein families to one million proteins identified in the gut microbiome, scalable methods for quickly identifying such anomalies between health and disease states will be increasingly valuable for biological interpretation of sequence data.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
EditorsRonay Ak, George Karypis, Yinglong Xia, Xiaohua Tony Hu, Philip S. Yu, James Joshi, Lyle Ungar, Ling Liu, Aki-Hiro Sato, Toyotaro Suzumura, Sudarsan Rachuri, Rama Govindaraju, Weijia Xu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1272-1280
Number of pages9
ISBN (Electronic)9781467390040
DOIs
StatePublished - 2016
Externally publishedYes
Event4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
Duration: Dec 5 2016Dec 8 2016

Publication series

NameProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

Conference

Conference4th IEEE International Conference on Big Data, Big Data 2016
Country/TerritoryUnited States
CityWashington
Period12/5/1612/8/16

Keywords

  • ibd
  • kegg
  • machine learning
  • microbiome
  • pca
  • random forest

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Using machine learning to identify major shifts in human gut microbiome protein family abundance in disease'. Together they form a unique fingerprint.

Cite this