Telehealth/m-Health
Using Machine Learning to Predict Dropout from a Guided Self-Help Intervention Despite Low Sample Size: Testing a Novel Method
Isabella Starvaggi, B.S. (she/her/hers)
Ph.D. Student
Indiana University Bloomington
BLOOMINGTON, Indiana
Jacqueline Howard, B.A.
Lab Manager
Indiana University Bloomington
Bloomington, Indiana
Allison Peipert, B.S.
Graduate Student
Indiana University- Bloomington
Bloomington, Indiana
Robinson de Jesus-Romero, M.S. (he/him/his)
Graduate Student
Indiana University
Bloomington, Indiana
John F. Buss, B.S.
Ph.D. Student
Indiana University Bloomington
Bloomington, Indiana
Colton M. Lind, B.S. (he/him/his)
Undergraduate student
Indiana University Bloomington
Bloomington, Indiana
Kassandra Botts, B.A.
Coalition and Capacity Building Manager
National Viral Hepatitis Roundtable
Bloomington, Indiana
Lorenzo Lorenzo-Luaces, Ph.D.
Assistant Professor
Indiana University Bloomington
Bloomington, Indiana
Background: Low-intensity interventions are a promising way to address the public health burden of internalizing disorders. For example, unguided self-help is more scalable than face-to-face individual therapy. However, engagement is often low. Guided self-help is more costly but may improve engagement rates. The ability to accurately predict dropout from these interventions may aid in the efficient allocation of treatment resources. Supervised machine learning (ML) models have great potential utility for this application, but sample sizes in psychotherapy research are often too low to adequately train these models.
Aim: In this study, we aimed to build a ML model for predicting dropout in a guided self-help trial via a novel method: training the model on a large, inexpensive analog dataset in which individuals predicted their likelihood of hypothetically engaging with the same intervention.
Method: We used two samples: data previously collected in a trial of an online guided self-help intervention for internalizing distress (N = 198) and an analog dataset collected for the present study via the online research platform Prolific (N = 533). Participants in both samples met the same inclusion criteria and completed the same battery of questionnaires including demographics, emotion regulation, psychopathology, and personality. Participants in the Prolific sample read about the intervention and answered questions about their best guesses regarding their hypothetical engagement in the trial.
Analyses: We trained a random forest model on the Prolific data to predict dropout, following elastic-net variable screening. We then used that model to perform out-of-sample prediction on the trial sample. We also fit a logistic regression model to use as a benchmark comparator. For all analyses, we used 10-fold cross-validation and used propensity-score weighting to adjust for baseline differences between the two samples.
Results: Our ML model performed adequately on the held-out test set from the Prolific data (AUC = 0.70). However, in the out-of-sample predictions on the trial data, the model performed poorly (AUC = 0.53). The benchmark model performed approximately equally to the ML model both on the test set (AUC = 0.0.70) and out-of-sample (AUC = 0.52).
Conclusions: Our results suggest that survey respondents’ predictions about their engagement with guided self-help may have limited predictive utility regarding trial participants’ actual behavior. Our results also support common criticisms of ML models: our ML model appears to have overfit to the training data and failed to outperform the benchmark regression model.