Within the pharmaceutical industry, clinical trials are essential for the delivery of new drugs or treatment to the public. At minimum, a novel drug has to undergo 3 phases of study in order to receive FDA approval. Phase 3 trials are usually the last and most difficult hurdle to clear due to the scale of their complexity and duration, taking anywhere between 2 to 15 years.

Because of their significant logistical complexity and purpose, Phase 3 trials have a very high risk of both delays and failure, which directly corresponds to delays in getting life-saving drugs to patients. In addition, the success of trials is typically dependent on the personal experience of the trial organizer. We hypothesize that providing any kind of non-subjective and non-empirical insight for trial duration has the potential to increase chances of trial success.

Despite numerous trials and a global market valuation of $55.86 billion in clinical research services, there lacks a robust, publicly accessible tool for predicting trial durations. Our work aims to fill this gap by developing a machine learning pipeline that provides duration estimates for Phase 3 oncology trials, the most resource-intensive phase in drug development.

We propose a Clinical AI Web API that uses machine learning to predict the duration of clinical trials. This tool aims to serve Clinical Research Organizations (CROs), pharmaceutical companies, and researchers. By entering their preliminary study parameters, the API will provide them with accurate predictions that can aid in better resource management, budgeting, and strategic planning.

Our Team

With diverse academic and professional expertise, we enrich our project with in-depth industry knowledge, innovative analytical perspectives, and advanced data management skills.

Cynthia Xu

Data and ML Engineering Lead
Backend Developer

Applied ML Fellow,
Los Almost National Laboratory

Adeline Chin

Market Research Lead
Frontend Developer

Clinical Data Associate,
Translational Drug Development

Jooyeon Hahm

ML Research Lead
Website Deveopr

ML Engineer,
EBSCO Information Services

Data

Our project leverages a substantial dataset from ClinicalTrials.gov, a comprehensive global registry for clinical research studies. Specifically, we have sourced data from 19,049 completed interventional cancer studies conducted between January 1, 2011, and May 30, 2024, which includes 5,053 Phase 1 studies, 5,982 Phase 2 studies, and 1,634 Phase 3 studies. This dataset provides extensive details on each trial, including trial locations, enrollment numbers, participant eligibility, types of interventions, and methods of outcome measurement, offering a robust foundation for our machine learning model to accurately predict clinical trial durations.

Data to Features

Our feature engineering process transformed the study protocol data into trainable features. These features include unique study identifiers, durations of primary and overall study completions, and binned categories for these durations. We also detailed the number of conditions and groups examined, age groups of participants, and study locations. We included types and numbers of interventions, sponsor types, intervention models, responsible parties, the presence of a data monitoring committee, allocation types, masking levels, enrollment counts, and the inclusion of healthy volunteers. Additionally, the features also cover study purposes (treatment, diagnostic, prevention, supportive), types of interventions (procedural, device, behavioral, drug, radiation, biological), and outcome measures (overall survival, duration of response, adverse outcome).

Our Unique Features

We extracted from text data new variables such as the durations of primary and secondary outcome measurements and the numbers of inclusion and exclusion criteria. We also incorporated 5-year survival rates for specific cancer types to enhance the model's performance in predicting study durations.

Durations of primary and secondary outcome measures
Numbers of patient inclusion and exclusion criteria
5-year survival rates for categorical cancer types

Models

With our extensive and unique features, we have trained Random Forest, Light GBM, and XGBoost classification models using various binning strategies with 2, 3, 4, and 5 bins to forecast study durations. Our best-performing model achieved an accuracy of 0.803 and a precision of 0.805, making it the best model publicly available for forecasting study durations. This model's performance demonstrates the effectiveness of our extensive feature engineering process and the robustness of our approach in handling complex clinical trial data. Below we have sample model specs and metrics.

Model 1

Random Forest

accuracy 0.803

Study Phase

1,134

Training Set Size

Bins

Model 2

Light GBM

accuracy 0.735

Study Phase

4,187

Training Set Size

Bins

Model 3

XGBoost

accuracy 0.603

Study Phase

3,537

Training Set Size

Bins

Results

Our binary predictive model is more generalizable and accurate than the most recently published models. It successfully predicted whether Phase 3 trials, which are the most lengthy and complex, would align with the average duration, achieving an accuracy of 0.803. This result highlights the robustness and reliability of our approach, offering significant potential for improving the planning and management of clinical trials. By providing precise duration estimates, our model can help streamline operations and optimize resource allocation in clinical research organizations.

Long et al. (2024)

Our Model

Trained on Phase 1 lymphoma studies

Trained on all Phase 1, 2, 3 studies

Tested only on lymphoma and lung cancer

Tested on all cancer types

Small testing set size

Robust testing set size

Did not address text data

Extracted trainable features from text data

Highest Accuracy 0.725

Highest Accuracy 0.803

Acknowledgements

We would like to express our sincere gratitude to Professor Puya Vahabi (Berkeley MIDS) his invaluable guidance and support throughout this project. We also extend our heartfelt thanks to Professor Korin Reid (Berkeley MIDS) for her unwavering encouragement and assistance during challenging times. Their expertise and mentorship have been instrumental in the successful completion of this work. We also thank our classmates for their feedback and encouragement.

Our Team

Data

Data to Features

Our Unique Features

Models

Random Forest

accuracy 0.803

Light GBM

accuracy 0.735

XGBoost

accuracy 0.603

Our Product

Results

Long et al. (2024)

Our Model

Acknowledgements