Data sets provided to the Project Data Sphere cancer research platform include four key components:

Research protocol: 

Each clinical trial has a master plan called a protocol. This plan explains how the trial will be conducted and outlines the criteria by which patients are to be selected for the trial, the procedures and tests that patients will receive and the types of data that will be collected from the patients.

Annotated case report form (CRF):

A CRF is the paper or electronic instrument used to record the patient data in a clinical trial.  An “annotated” CRF indicates how the recorded information relates to the structure of the stored electronic patient data.

Data dictionary:

The data dictionary describes the details of the electronic patient data on a field-by-field basis, indicating in which data tables individual fields can be found, how the individual data tables are related and various levels of detail regarding the data fields themselves.

Patient-level data sets: 

The patient-level data sets represent the individual data points that have been captured for each patient.  Through careful understanding of the research protocol, the annotated CRF and the data dictionary, data scientists can apply analytical tools to the patient-level data sets and discover new scientific insights. The patient-level data sets available within the Project Data Sphere cancer research platform can be investigated individually, or can be aggregated for more comprehensive investigation.  Although industry data standards such as CDISC SDTM and ADaM  ( are now widely adopted, there may be considerable differences with regard to how the data sets provided to the Project Data Sphere platform are structured. These differences are based upon a variety of factors, including each provider organizations’ interpretations of the standards, the maturity of the data standards at the time each trial was completed and whether the trial was considered for registration purposes. Through ongoing curation and standardization efforts, Project Data Sphere is continually working to increase the efficiency with which researchers can aggregate and investigate patient-level data sets that span data providers, research domains (industry and academia) and data standards eras.