Kit Howard, owner and principal at Kestrel Consultants, explains how data standardisation is key to ensuring its quality; however, it’s important that users fully understand all aspects of those standards.

Organisations are increasingly aware of the value of leveraging data across their enterprise. For this, they must have ‘quality’ data, data that is ‘fit for use’, which can be appropriately used to answer their questions. These may be study questions, metrics questions or even questions never considered when writing the protocol. The data may come from studies, submissions or even repositories, and may be within or across therapy areas. Part of the solution to ensuring this quality is standards, but to be successful, users must understand all aspects of those standards.

There are three main facets of clinical data standards:

  1. technical: common variable names and formats, associated questions, dataset structures, derivations etc.
  2. content: what the variables hold; for instance, the answers to the questions, often using controlled terminology (code lists)
  3. process: business and scientific rules used in generating the data; for example, implementation, handling and cleaning rules reflecting assumptions about the data and how to store it.

The first two are fairly straightforward, and reflect most definitions of standards. The third is more complex, poses much higher potential risks to organisations, and is best understood using an example (see figure).

The figure imagines three studies. Study one decides to start collecting adverse events (AEs) and serious adverse events (SAEs) when the informed consent (IC) is signed. Study two starts capturing SAEs at IC and AEs on day one (first day of treatment). Study three starts collecting both on day one. Each study by itself is fine, but when all three are added to a data repository and someone looks for the incidence of AEs in the IC-day one period, they may not see all the AEs because study two didn’t record AEs in that baseline period. Worse, the user does not know that.

This example illustrates a process decision typically made on a study-by-study basis, and there is rarely a mechanism or incentive to pass this information to the next study team. The resulting differences may appear between sites in a study; for example, surveys completed at different sites, or more commonly between studies in a submission. The biggest impact is between studies in a repository, which is compounded by the time it gets to Janus, the FDA’s data repository, by having studies from different companies.

Process consistency

Many of these processes are defined in the protocol, monitoring guidelines, CRF completion guidelines or data management plans, but these rarely remain available after the study completes. Even when protocols are accessible, searching 200 PDF files to find process rules is not feasible. There are potentially dozens to hundreds of such rules for each study, and the greatest risk is that users are unaware of them and cannot judge the quality of the data for the desired uses.

The solution can’t be to make everyone follow the same processes, as there is too much legitimate variability in research. A mechanism must be developed to identify these processes and document the decisions so that they accompany the data throughout its lifecycle.

Within an organisation, the issue can be partly handled using centralised cross-functional standards called ‘Data Lifecycle Plans®’ (DLPs) that are organised by domain (AEs, demographics) and define all data-related requirements for each function – including protocol, statistical plan, study report, database, tables/listings/graphs and CRFs. DLPs integrate data requirements from protocol to submission, in the same way that many organisations integrate CRFs and databases, and provide an excellent framework for identifying and documenting the related process decisions.

This can address the process definition issue within an organisation but it’s more challenging across companies. In the absence of a common recognition of this issue, there is little incentive to address it. Perhaps this can eventually be an add-on to the CDISC electronic data submission standards, or have standards of its own. Regardless of how it could be done, until the process assumptions made during data generation, capture and storage travel with the data, repositories will lack a critical component of their quality definition.