TY - JOUR
T1 - Monte-Carlo methods for determining optimal number of significant variables. Application to mouse urinary profiles
AU - Wongravee, Kanet
AU - Lloyd, Gavin R.
AU - Hall, John
AU - Holmboe, Maria E.
AU - Schaefer, Michele L.
AU - Reed, Randall R.
AU - Trevejo, Jose
AU - Brereton, Richard G.
N1 - Funding Information:
Acknowledgements We thank Dr. Sarah Dixon and Dr Yun Xu of the Centre of Chemometrics for developing software used in this project and valuable discussions. This work was sponsored by ARO Contract DAAD19-03-1-0215. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.
PY - 2009/12
Y1 - 2009/12
N2 - Three methods for variable selection are described, namely the t-statistic, Partial Least Squares Discriminant Analysis (PLS-DA) weights and regression coefficients, with the aim of determining which variables are the most significant markers for discriminating between two groups: a variable's level of significance is related to its magnitude. Monte-Carlo methods are employed to determine empirical significance of variables, by permuting randomly the class membership 5000 times to obtain null distributions, and comparing the observed statistic for each variable with the null distribution. Seven simulations consisting of 200 samples, divided equally between two classes, and 300 variables, are constructed; in one dataset there are no induced correlations between variables, in two datasets correlations are induced but there is no induced separation between the classes, and in four datasets, separation is induced by selecting 20 of the variables to be discriminators. In addition two metabolomic datasets were analysed consisting of the GCMS of urinary extracts from mice both to determine the effect of stress and to determine the effect of diet on the urinary chemosignal. It is shown that the t-statistic combined with Monte-Carlo permutations provides similar results to PLS weights. PLS regression coefficients find the least number of markers but, for the simulations, the lowest False Positives rates.
AB - Three methods for variable selection are described, namely the t-statistic, Partial Least Squares Discriminant Analysis (PLS-DA) weights and regression coefficients, with the aim of determining which variables are the most significant markers for discriminating between two groups: a variable's level of significance is related to its magnitude. Monte-Carlo methods are employed to determine empirical significance of variables, by permuting randomly the class membership 5000 times to obtain null distributions, and comparing the observed statistic for each variable with the null distribution. Seven simulations consisting of 200 samples, divided equally between two classes, and 300 variables, are constructed; in one dataset there are no induced correlations between variables, in two datasets correlations are induced but there is no induced separation between the classes, and in four datasets, separation is induced by selecting 20 of the variables to be discriminators. In addition two metabolomic datasets were analysed consisting of the GCMS of urinary extracts from mice both to determine the effect of stress and to determine the effect of diet on the urinary chemosignal. It is shown that the t-statistic combined with Monte-Carlo permutations provides similar results to PLS weights. PLS regression coefficients find the least number of markers but, for the simulations, the lowest False Positives rates.
KW - GCMS
KW - Monte-Carlo methods
KW - Mouse urine
KW - Partial Least Squares Discriminant Analysis
KW - Variable selection
KW - Volatiles
UR - http://www.scopus.com/inward/record.url?scp=74449084100&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=74449084100&partnerID=8YFLogxK
U2 - 10.1007/s11306-009-0164-4
DO - 10.1007/s11306-009-0164-4
M3 - Article
AN - SCOPUS:74449084100
SN - 1573-3882
VL - 5
SP - 387
EP - 406
JO - Metabolomics
JF - Metabolomics
IS - 4
ER -