Using Machine Learning in the Evolving Landscape of Real-World Data

Zeynep Icten, Ph.D., Director of Data Science Solutions at Panalgo

According to the Food and Drug Administration (FDA), the term real-world data (RWD) refers to routinely collected data relating to patient health status and the delivery of healthcare services, and real-world evidence (RWE) is the clinical evidence regarding the usage and potential benefits or risks of a medical product derived from the analysis of RWD. Both RWD and RWE have increasingly attracted attention in the healthcare industry for years now, and rightly so, considering that the healthcare analytics market is expected to expand at a compound annual growth rate of 28.9% between now and 2026. There’s no doubt that within this massive data trove, there exist countless insights that could streamline care delivery, help physicians diagnose disease faster, and improve treatment strategies – if only we could identify them. 

This data revolution we are experiencing in the healthcare industry necessitates the appropriate tools and approaches to work with higher dimensional data sources to truly harvest the insights buried in RWD. Machine learning, an area of artificial intelligence (AI) consisting of a collection of methodologies that focus on algorithmically learning efficient representations of data and extracting insights from data, offers promise and has consistently been gaining traction within the industry in the context of RWD. In fact, in January 2021, the FDA released an Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan, a clear indication of the technology’s explosive growth and legitimacy within healthcare. In 2018, the Centers for Medicare and Medicaid Services kicked off its first AI Health Outcomes Challenge, a cross-industry competition to innovate how AI can be implemented in current and future healthcare models.

Healthcare analytics is at a tipping point in adoption, and it is important that decision-makers develop clear strategies for leveraging ML-driven technologies. The true question is not why organizations should adopt machine learning, but how. How do companies use these advanced methodologies in a field like healthcare where there is an abundance of ambiguity and, at times, intuition-driven decision making? When are ML approaches preferable and when do they provide benefits over traditional statistical approaches? The answer depends on the type and quantity of data available, as well as the goals of the research. It is also crucial to understand the types of use cases and analyses where ML can shine and how these applications can help drive the industry forward.

The Two Cultures: Machine Learning and Statistics

ML offers several benefits that can improve healthcare analyses for initiatives like product development and launch, understanding patient populations, determining unmet medical needs, predicting patient outcomes or disease recurrence and scrutinizing real-world drug performance. ML methodologies work best with wide, clinically rich datasets where relationships between data elements can be highly nonlinear and complex. As the number of data elements and the complexity of associations among them increase, traditional approaches requiring rigid assumptions about the data start to underperform, unlike ML approaches, which learn the patterns in data and make few or no assumptions about the data generating processes. 

Traditional statistical approaches have a long-standing focus on inference, while the primary focus of ML has been on prediction because it can learn generalizable patterns in data. ML techniques can also operate without a priori hypotheses – one based on assumed principles and deductions from previous research – and can suggest novel insights that may be otherwise overlooked in complex datasets. The foundational truth is that the predictive performance of ML models increases at some sacrifice to model interpretability. For example, sophisticated neural network models can typically outperform other approaches, but they also require more effort to explain and interpret. The key questions, then, become what type of inference is the main goal and whether predictive performance or interpretability is more important. 

To summarize, the main advantages of ML approaches include their ability to operate without a priori hypothesis, which allows maximum utility from large volumes of data and ultimately, the ability to suggest novel insights and offer better confounding control. Additionally, ML allows for the efficient handling of high-dimensional datasets and modeling of non-linear phenomena and interactions, which can lead to better predictive performance and model generalizability. It’s also important to emphasize that despite the advantages of ML techniques, they are not necessarily a replacement for traditional statistical approaches, but need to be used to complement, expand upon and strengthen findings obtained through traditional methods. ML approaches are a valuable addition to the RWE toolbox, and when used in the right context, will help deliver the true value of RWD.

A Use Case for Machine Learning in Predicting Patient Outcomes  

One way to better understand how machine learning can improve healthcare analytics is by examining a use case where the technology has shown promising results. For example, we recently worked on a study aiming to identify predictors of inpatient relapse among multiple sclerosis (MS) patients’ administrative claims data and created a data-driven decision rule to discriminate between MS patients with and without an inpatient relapse. There are nearly a million patients in the U.S. living with MS, and relapse is usually associated with disability progression and worsening outcomes. Understanding what causes relapse can help lead to better disease management and stave off more significant impacts. To determine this, we used ML approaches focused on maximizing insights from RWD to develop robust, strong predictive models while performing feature selection.

Using machine learning models, we achieved an area under the curve (AUC) of 79.3%, sensitivity of 69% and specificity of 75%. We identified the primary predictors of MS relapse, which include a previous inpatient or emergency room visit with an MS diagnosis, the number of MS-related encounters, the number of comorbidities, the use of home care services and durable medical equipment, epilepsy or convulsions, paralysis, urinary tract infections, and the use of muscle relaxants, anticonvulsants and antidepressants. Notable factors protective against relapses were increased PDC, older age, being female and disease-modifying therapies administered as an infusion. Additionally, we were able to define a compact decision rule that indicated patients were more likely to have a relapse if they:

– Have 30 or more unique comorbidities, or

– Have a previous emergency room visit with an MS diagnosis and 10 or more previous MS-related encounters, or 

– Have 20 or more previous MS-related encounters. 

The advantage of this decision rule is that it is completely data-driven and evidence-based. It is very interpretable and intuitive. Lastly, it uses measures available in all claims datasets and thus it is portable to other databases. After appropriate external validation, this decision rule can be used as a proxy for disease severity and applied to stratify patients regarding their likelihood of relapse in database studies. It can also be utilized to optimize intervention management to proactively prevent potential relapses by identifying patients likely to experience the greatest benefit from additional preventative interventions to increase medication adherence or a new medication regimen. These proactive interventions can improve their quality of life and reduce the need for additional interactions with the healthcare system, which can minimize the pressure on already overburdened provider organizations. 

The Next Frontier for Machine Learning 

ML approaches have the uncanny ability to comb through high-dimensional, real-world data sets and separate out key insights which can lead to healthcare cost savings and improve patient quality of life. The sciences of AI and ML continue to evolve at a high speed, and new technologies and platforms enable the application of science to real-world questions, but the breakthrough for successfully adopting these methodologies will not only be achieved through increased computing power, more data or better software but also through a deep understanding of types of problems best suited to be formulated as ML problems as well as through tools and methods rendering the models transparent and explainable.  

Companies looking to implement ML in their analytics programs can benefit from a sophisticated, healthcare-specific platform to help them cut through the ambiguity and complexity of healthcare data. As more companies begin to embrace the technology and learn how to properly capitalize on its abilities, it has the potential to make huge changes to the healthcare industry and how we use data and analytics moving forward.


About Zeynep Icten, Ph.D.
Zeynep is the Director of Data Science Solutions at Panalgo where she oversees training and support of the IHD Data Science Module. Her expertise lies in data-driven approaches in healthcare analytics using real-world data sources. Zeynep has a wealth of experience working with various key players in the healthcare space including pharmaceutical companies, payers and providers, and helping them extract actionable insights using data science and machine learning methods. She holds an MS and Ph.D. in Industrial Engineering from the University of Pittsburgh.