Flexible propensity score estimation strategies for clustered data in observational studies

Ting Hsuan Chang; Trang Quynh Nguyen; Youjin Lee; John W. Jackson; Elizabeth A. Stuart

doi:10.1002/sim.9551

Flexible propensity score estimation strategies for clustered data in observational studies

Ting Hsuan Chang, Trang Quynh Nguyen, Youjin Lee, John W. Jackson, Elizabeth A. Stuart

Bloomberg School of Public Health

Research output: Contribution to journal › Article › peer-review

Abstract

Existing studies have suggested superior performance of nonparametric machine learning over logistic regression for propensity score estimation. However, it is unclear whether the advantages of nonparametric propensity score modeling are carried to settings where there is clustering of individuals, especially when there is unmeasured cluster-level confounding. In this work we examined the performance of logistic regression (all main effects), Bayesian additive regression trees and generalized boosted modeling for propensity score weighting in clustered settings, with the clustering being accounted for by including either cluster indicators or random intercepts. We simulated data for three hypothetical observational studies of varying sample and cluster sizes. Confounders were generated at both levels, including a cluster-level confounder that is unobserved in the analyses. A binary treatment and a continuous outcome were generated based on seven scenarios with varying relationships between the treatment and confounders (linear and additive, nonlinear/nonadditive, nonadditive with the unobserved cluster-level confounder). Results suggest that when the sample and cluster sizes are large, nonparametric propensity score estimation may provide better covariate balance, bias reduction, and 95% confidence interval coverage, regardless of the degree of nonlinearity or nonadditivity in the true propensity score model. When the sample or cluster sizes are small, however, nonparametric approaches may become more vulnerable to unmeasured cluster-level confounding and thus may not be a better alternative to multilevel logistic regression. We applied the methods to the National Longitudinal Study of Adolescent to Adult Health data, estimating the effect of team sports participation during adolescence on adulthood depressive symptoms.

Original language	English (US)
Pages (from-to)	5016-5032
Number of pages	17
Journal	Statistics in Medicine
Volume	41
Issue number	25
DOIs	https://doi.org/10.1002/sim.9551
State	Published - Nov 10 2022

Keywords

clustering
machine learning
observational studies
propensity score weighting
unmeasured confounder

ASJC Scopus subject areas

Epidemiology
Statistics and Probability

Access to Document

10.1002/sim.9551

Cite this

@article{218e2a466b8340448ecbc4a8290de290,

title = "Flexible propensity score estimation strategies for clustered data in observational studies",

abstract = "Existing studies have suggested superior performance of nonparametric machine learning over logistic regression for propensity score estimation. However, it is unclear whether the advantages of nonparametric propensity score modeling are carried to settings where there is clustering of individuals, especially when there is unmeasured cluster-level confounding. In this work we examined the performance of logistic regression (all main effects), Bayesian additive regression trees and generalized boosted modeling for propensity score weighting in clustered settings, with the clustering being accounted for by including either cluster indicators or random intercepts. We simulated data for three hypothetical observational studies of varying sample and cluster sizes. Confounders were generated at both levels, including a cluster-level confounder that is unobserved in the analyses. A binary treatment and a continuous outcome were generated based on seven scenarios with varying relationships between the treatment and confounders (linear and additive, nonlinear/nonadditive, nonadditive with the unobserved cluster-level confounder). Results suggest that when the sample and cluster sizes are large, nonparametric propensity score estimation may provide better covariate balance, bias reduction, and 95% confidence interval coverage, regardless of the degree of nonlinearity or nonadditivity in the true propensity score model. When the sample or cluster sizes are small, however, nonparametric approaches may become more vulnerable to unmeasured cluster-level confounding and thus may not be a better alternative to multilevel logistic regression. We applied the methods to the National Longitudinal Study of Adolescent to Adult Health data, estimating the effect of team sports participation during adolescence on adulthood depressive symptoms.",

keywords = "clustering, machine learning, observational studies, propensity score weighting, unmeasured confounder",

author = "Chang, {Ting Hsuan} and Nguyen, {Trang Quynh} and Youjin Lee and Jackson, {John W.} and Stuart, {Elizabeth A.}",

note = "Publisher Copyright: {\textcopyright} 2022 John Wiley & Sons Ltd.",

year = "2022",

month = nov,

day = "10",

doi = "10.1002/sim.9551",

language = "English (US)",

volume = "41",

pages = "5016--5032",

journal = "Statistics in Medicine",

issn = "0277-6715",

publisher = "John Wiley and Sons Ltd",

number = "25",

}

TY - JOUR

T1 - Flexible propensity score estimation strategies for clustered data in observational studies

AU - Chang, Ting Hsuan

AU - Nguyen, Trang Quynh

AU - Lee, Youjin

AU - Jackson, John W.

AU - Stuart, Elizabeth A.

PY - 2022/11/10

Y1 - 2022/11/10

N2 - Existing studies have suggested superior performance of nonparametric machine learning over logistic regression for propensity score estimation. However, it is unclear whether the advantages of nonparametric propensity score modeling are carried to settings where there is clustering of individuals, especially when there is unmeasured cluster-level confounding. In this work we examined the performance of logistic regression (all main effects), Bayesian additive regression trees and generalized boosted modeling for propensity score weighting in clustered settings, with the clustering being accounted for by including either cluster indicators or random intercepts. We simulated data for three hypothetical observational studies of varying sample and cluster sizes. Confounders were generated at both levels, including a cluster-level confounder that is unobserved in the analyses. A binary treatment and a continuous outcome were generated based on seven scenarios with varying relationships between the treatment and confounders (linear and additive, nonlinear/nonadditive, nonadditive with the unobserved cluster-level confounder). Results suggest that when the sample and cluster sizes are large, nonparametric propensity score estimation may provide better covariate balance, bias reduction, and 95% confidence interval coverage, regardless of the degree of nonlinearity or nonadditivity in the true propensity score model. When the sample or cluster sizes are small, however, nonparametric approaches may become more vulnerable to unmeasured cluster-level confounding and thus may not be a better alternative to multilevel logistic regression. We applied the methods to the National Longitudinal Study of Adolescent to Adult Health data, estimating the effect of team sports participation during adolescence on adulthood depressive symptoms.

AB - Existing studies have suggested superior performance of nonparametric machine learning over logistic regression for propensity score estimation. However, it is unclear whether the advantages of nonparametric propensity score modeling are carried to settings where there is clustering of individuals, especially when there is unmeasured cluster-level confounding. In this work we examined the performance of logistic regression (all main effects), Bayesian additive regression trees and generalized boosted modeling for propensity score weighting in clustered settings, with the clustering being accounted for by including either cluster indicators or random intercepts. We simulated data for three hypothetical observational studies of varying sample and cluster sizes. Confounders were generated at both levels, including a cluster-level confounder that is unobserved in the analyses. A binary treatment and a continuous outcome were generated based on seven scenarios with varying relationships between the treatment and confounders (linear and additive, nonlinear/nonadditive, nonadditive with the unobserved cluster-level confounder). Results suggest that when the sample and cluster sizes are large, nonparametric propensity score estimation may provide better covariate balance, bias reduction, and 95% confidence interval coverage, regardless of the degree of nonlinearity or nonadditivity in the true propensity score model. When the sample or cluster sizes are small, however, nonparametric approaches may become more vulnerable to unmeasured cluster-level confounding and thus may not be a better alternative to multilevel logistic regression. We applied the methods to the National Longitudinal Study of Adolescent to Adult Health data, estimating the effect of team sports participation during adolescence on adulthood depressive symptoms.

KW - clustering

KW - machine learning

KW - observational studies

KW - propensity score weighting

KW - unmeasured confounder

UR - http://www.scopus.com/inward/record.url?scp=85136939111&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85136939111&partnerID=8YFLogxK

U2 - 10.1002/sim.9551

DO - 10.1002/sim.9551

M3 - Article

C2 - 36263918

AN - SCOPUS:85136939111

SN - 0277-6715

VL - 41

SP - 5016

EP - 5032

JO - Statistics in Medicine

JF - Statistics in Medicine

IS - 25

ER -

Flexible propensity score estimation strategies for clustered data in observational studies

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this