Within the pharmaceutical industry, clinical trials are essential for the delivery of new drugs or treatment to the public. At minimum, a novel drug has to undergo 3 phases of study in order to receive FDA approval. Phase 3 trials are usually the last and most difficult hurdle to clear due to the scale of their complexity and duration, taking anywhere between 2 to 15 years.
Because of their significant logistical complexity and purpose, Phase 3 trials have a very high risk of both delays and failure, which directly corresponds to delays in getting life-saving drugs to patients. In addition, the success of trials is typically dependent on the personal experience of the trial organizer. We hypothesize that providing any kind of non-subjective and non-empirical insight for trial duration has the potential to increase chances of trial success.
Despite numerous trials and a global market valuation of $55.86 billion in clinical research services, there lacks a robust, publicly accessible tool for predicting trial durations. Our work aims to fill this gap by developing a machine learning pipeline that provides duration estimates for Phase 3 oncology trials, the most resource-intensive phase in drug development.
We propose a Clinical AI Web API that uses machine learning to predict the duration of clinical trials. This tool aims to serve Clinical Research Organizations (CROs), pharmaceutical companies, and researchers. By entering their preliminary study parameters, the API will provide them with accurate predictions that can aid in better resource management, budgeting, and strategic planning.
With diverse academic and professional expertise, we enrich our project with in-depth industry knowledge, innovative analytical perspectives, and advanced data management skills.
Applied ML Fellow,
Los Almost National Laboratory
Clinical Data Associate,
Translational Drug Development
ML Engineer,
EBSCO Information Services
Our project leverages a substantial dataset from ClinicalTrials.gov, a comprehensive global registry for clinical research studies. Specifically, we have sourced data from 19,049 completed interventional cancer studies conducted between January 1, 2011, and May 30, 2024, which includes 5,053 Phase 1 studies, 5,982 Phase 2 studies, and 1,634 Phase 3 studies. This dataset provides extensive details on each trial, including trial locations, enrollment numbers, participant eligibility, types of interventions, and methods of outcome measurement, offering a robust foundation for our machine learning model to accurately predict clinical trial durations.
Our feature engineering process transformed the study protocol data into trainable features. These features include unique study identifiers, durations of primary and overall study completions, and binned categories for these durations. We also detailed the number of conditions and groups examined, age groups of participants, and study locations. We included types and numbers of interventions, sponsor types, intervention models, responsible parties, the presence of a data monitoring committee, allocation types, masking levels, enrollment counts, and the inclusion of healthy volunteers. Additionally, the features also cover study purposes (treatment, diagnostic, prevention, supportive), types of interventions (procedural, device, behavioral, drug, radiation, biological), and outcome measures (overall survival, duration of response, adverse outcome).
We extracted from text data new variables such as the durations of primary and secondary outcome measurements and the numbers of inclusion and exclusion criteria. We also incorporated 5-year survival rates for specific cancer types to enhance the model's performance in predicting study durations.
Durations of primary and secondary outcome measures
Numbers of patient inclusion and exclusion criteria
5-year survival rates for categorical cancer types
With our extensive and unique features, we have trained Random Forest, Light GBM, and XGBoost classification models using various binning strategies with 2, 3, 4, and 5 bins to forecast study durations. Our best-performing model achieved an accuracy of 0.803 and a precision of 0.805, making it the best model publicly available for forecasting study durations. This model's performance demonstrates the effectiveness of our extensive feature engineering process and the robustness of our approach in handling complex clinical trial data. Below we have sample model specs and metrics.
3
Study Phase
1,134
Training Set Size
2
Bins
2
Study Phase
4,187
Training Set Size
2
Bins
1
Study Phase
3,537
Training Set Size
3
Bins
Our product takes information about your clinical trial, feeds it into our model, and returns a prediction of your study duration. You can enter the information using our user-friendly interface. Watch our demo for detailed guideline.
Our binary predictive model is more generalizable and accurate than the most recently published models. It successfully predicted whether Phase 3 trials, which are the most lengthy and complex, would align with the average duration, achieving an accuracy of 0.803. This result highlights the robustness and reliability of our approach, offering significant potential for improving the planning and management of clinical trials. By providing precise duration estimates, our model can help streamline operations and optimize resource allocation in clinical research organizations.