Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm

Steven Salzberg

doi:10.1089/cmb.1995.2.473

Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm

Steven Salzberg

Research output: Contribution to journal › Article › peer-review

44 Scopus citations

Abstract

Genes in eukaryotic DNA cover hundreds or thousands of base pairs, while the regions of those genes that code for proteins may occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance in understanding these genes. Many recent research efforts have studied computational methods for distinguishing between coding and noncoding regions, and several promising results have been reported. We describe here a new approach, using a machine learning system that builds decision trees from the data. This approach combines several coding measures to produce classifiers with consistently higher accuracies than previous methods, on DNA sequences ranging from 54 to 162 base pairs in length. The algorithm is very efficient, and it can easily be adapted to different sequence lengths. Our conclusion is that decision trees are a highly effective tool for identifying protein coding regions.

Original language	English (US)
Pages (from-to)	473-485
Number of pages	13
Journal	Journal of Computational Biology
Volume	2
Issue number	3
DOIs	https://doi.org/10.1089/cmb.1995.2.473
State	Published - 1995
Externally published	Yes

Keywords

coding regions
decision trees
exons
machine learning

ASJC Scopus subject areas

Modeling and Simulation
Molecular Biology
Genetics
Computational Mathematics
Computational Theory and Mathematics

Access to Document

10.1089/cmb.1995.2.473

Cite this

@article{d7c8c3ea19e1406e807139a0fc8ad6aa,

title = "Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm",

abstract = "Genes in eukaryotic DNA cover hundreds or thousands of base pairs, while the regions of those genes that code for proteins may occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance in understanding these genes. Many recent research efforts have studied computational methods for distinguishing between coding and noncoding regions, and several promising results have been reported. We describe here a new approach, using a machine learning system that builds decision trees from the data. This approach combines several coding measures to produce classifiers with consistently higher accuracies than previous methods, on DNA sequences ranging from 54 to 162 base pairs in length. The algorithm is very efficient, and it can easily be adapted to different sequence lengths. Our conclusion is that decision trees are a highly effective tool for identifying protein coding regions.",

keywords = "coding regions, decision trees, exons, machine learning",

author = "Steven Salzberg",

year = "1995",

doi = "10.1089/cmb.1995.2.473",

language = "English (US)",

volume = "2",

pages = "473--485",

journal = "Journal of Computational Biology",

issn = "1066-5277",

publisher = "Mary Ann Liebert Inc.",

number = "3",

}

TY - JOUR

T1 - Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm

AU - Salzberg, Steven

PY - 1995

Y1 - 1995

N2 - Genes in eukaryotic DNA cover hundreds or thousands of base pairs, while the regions of those genes that code for proteins may occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance in understanding these genes. Many recent research efforts have studied computational methods for distinguishing between coding and noncoding regions, and several promising results have been reported. We describe here a new approach, using a machine learning system that builds decision trees from the data. This approach combines several coding measures to produce classifiers with consistently higher accuracies than previous methods, on DNA sequences ranging from 54 to 162 base pairs in length. The algorithm is very efficient, and it can easily be adapted to different sequence lengths. Our conclusion is that decision trees are a highly effective tool for identifying protein coding regions.

AB - Genes in eukaryotic DNA cover hundreds or thousands of base pairs, while the regions of those genes that code for proteins may occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance in understanding these genes. Many recent research efforts have studied computational methods for distinguishing between coding and noncoding regions, and several promising results have been reported. We describe here a new approach, using a machine learning system that builds decision trees from the data. This approach combines several coding measures to produce classifiers with consistently higher accuracies than previous methods, on DNA sequences ranging from 54 to 162 base pairs in length. The algorithm is very efficient, and it can easily be adapted to different sequence lengths. Our conclusion is that decision trees are a highly effective tool for identifying protein coding regions.

KW - coding regions

KW - decision trees

KW - exons

KW - machine learning

UR - http://www.scopus.com/inward/record.url?scp=0029365976&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0029365976&partnerID=8YFLogxK

U2 - 10.1089/cmb.1995.2.473

DO - 10.1089/cmb.1995.2.473

M3 - Article

C2 - 8521276

AN - SCOPUS:0029365976

SN - 1066-5277

VL - 2

SP - 473

EP - 485

JO - Journal of Computational Biology

JF - Journal of Computational Biology

IS - 3

ER -

Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this