Dissemination & Implementation Science
McCall A. Schruff, B.A.
Clinical Psychology Graduate Student
University of Mississippi
Oxford, Mississippi
John Young, Ph.D.
Professor of Psychology
University of Mississippi
University, Mississippi
Carolyn Humphrey, B.A.
Student
University of Mississippi
Oxford, Mississippi
Introduction: Intelligence technology is expected to transform mental health science (Luxton 2014), yet artificial intelligence (AI) approaches for assessing patient-provider interaction are underutilized (Miller et al., 2018). An AI-generated fidelity feedback system, Lyssn.io, evaluates recordings of patient-provider interactions and assigns a Cognitive Therapy Rating Scale (CTRS; Young & Beck, 1980) score through the use of natural language processing. The CTRS measures CBT fidelity, which can be useful in a variety of contexts related to dissemination and implementation science (Goldberg et al., 2020). The ability to produce automated feedback on the basis of uploaded recordings would thus be beneficial to efficiently enhancing fidelity monitoring and numerous aspects of training. Although commercially available products such as Lyssn have this capacity, the accuracy of such has not been extensively examined. The current study thus aims to evaluate the reliability of Lyssn compared to human raters scores.
Method: Data collection entailed evaluation of 35 role-play sessions conducted by several inexperienced students (2 females, ages 22 and 24 and one male, age 27) using the Unified Protocol (Barlow et al., 2017). Both of the females participated in a prior pilot study that also evaluated clinical skills but they otherwise had limited exposure to treatment materials. Videos were each rated by two human research assistants using the CTRS and independently by the Lyssn software. Ratings on the CTRS were analyzed for consistency using mixed-model Intraclass Correlations (ICC; McGraw & Wong, 1996; Shrout & Fleiss, 1979), first between the two human raters and then the AI’s scores.
Results: The reliability of human raters compared to each other was at the high end of the moderate range (ICC = 0.63), whereas inclusion of Lyssn ratings indicated an increase in reliability (ICC = 0.72). Examining individual ratings in comparison to Lyssn, the first rater again exhibited moderately high reliability (ICC = 0.57) and the second exhibited a high level of the same (ICC - 0.70). Notably, any discrepancy between the raters’ scores was systematically in an opposite direction when considering Lyssn’s ratings. As such, the human raters’ scores were averaged and used for final comparison to scores assigned by Lyssn’s AI software, which indicated strong concordance (ICC = 0.74).
Discussion:
These results suggest that Lyssn’s rating were reliable in comparison to human raters, potentially making the AI-generated feedback substantially useful for distinguishing between clinicians' strongest skills and areas in need of improvement. Application of this commercially available resource in applied training environments could thus enhance the efficiency of clinical skill development. As research in this area expands, many other uses relevant to quality assurance, clinical training, and practice management could also be feasible (which will be discussed).