The Health Informatics Trial Enhancement Project (HITE): Using routinely collected primary care data to identify potential participants for a depression trial

Background Recruitment to clinical trials can be challenging. We identified anonymous potential participants to an existing pragmatic randomised controlled depression trial to assess the feasibility of using routinely collected data to identify potential trial participants. We discuss the strengths and limitations of this approach, assess its potential value, report challenges and ethical issues encountered. Methods Swansea University's Health Information Research Unit's Secure Anonymised Information Linkage (SAIL) database of routinely collected health records was interrogated, using Structured Query Language (SQL). Read codes were used to create an algorithm of inclusion/exclusion criteria with which to identify suitable anonymous participants. Two independent clinicians rated the eligibility of the potential participants' identified. Inter-rater reliability was assessed using the kappa statistic and inter-class correlation. Results The study population (N = 37263) comprised all adults registered at five general practices in Swansea UK. Using the algorithm 867 anonymous potential participants were identified. The sensitivity and specificity results > 0.9 suggested a high degree of accuracy from the algorithm. The inter-rater reliability results indicated strong agreement between the confirming raters. The Intra Class Correlation Coefficient (Cronbach's Alpha) > 0.9, suggested excellent agreement and Kappa coefficient > 0.8; almost perfect agreement. Conclusions This proof of concept study showed that routinely collected primary care data can be used to identify potential participants for a pragmatic randomised controlled trial of folate augmentation of antidepressant therapy for the treatment of depression. Further work will be needed to assess generalisability to other conditions and settings and the inclusion of this approach to support Electronic Enhanced Recruitment (EER).


Background
Recruitment to clinical trials in primary care can be challenging [1]. Recent papers in Trials have reported a variety of strategies to improve trial recruitment [2][3][4]. Drawing on the expanding field of health informatics, we report on a strategy to identify potential trial participants using routinely collected anonymised data that complements other approaches to this question [5]. Virtually all general practices in the UK hold patient medical records in electronic format. This level of computerisation is in line with the NHS 1998 Information for Health Strategy's goal of full implementation of person-based Electronic Health Records (EHRs) at the primary care level by 2005. Routinely collected data are recorded in both narrative and structured formats. In the structured format, data are presented in codes. The coding system adopted by the Department of Health for general practice is the Read Terminology [6], although plans are underway to migrate to Systematised Nomenclature for Medicine -Clinical Terms (SNOMED CT) [6]; which has been selected as the standard terminology scheme for the NHS Care Records Service and for the National Programme for IT and will eventually replace the current Clinical (Read) codes. Large volumes of routinely collected data held in electronic format are becoming increasingly available. Improvements in data quality as well as technological advancement and expertise in retrieving, transporting, storing, linking and analysing these data is leading to Health Informatics emerging as a field rich with potential for research purposes [7].
Most randomised studies in general practice use conventional methods for patient selection, recruitment and data collection. One conventional method is through General Practitioner (GP) referral to research trials. Doctors normally recruit when patients present themselves at appointments. However a limitation with this method is that no referral will be made for those patients who do not attend their appointments, therefore little can be said about the generalisability of the data. Other recruitment strategies include manual searches through patient records or database searches using diagnostic criteria to select and recruit patients and then sending out participant information sheets. Further strategies include use of multi-media, such as the internet [5], newsletters and also mail shots. Further strategies include use of multi-media, such as the internet, newsletters and also mail shots. All searches for potential participants that involve disclosure of identifiable information (without patients' consent) are undertaken by the patients' direct healthcare team. However, if there is no other practicable alternative to conducting the research an application can be made to the National Information Governance Board (NIGB). NIGB oversees applications for the common law duty of confidentiality to be set aside in specific circumstances, in accordance with Section 251 of the NHS Act [8].
Many trials in primary care fail to achieve satisfactory levels of recruitment.
Difficulties with achieving the target recruitment populations within fixed timeframes were observed as common problems [9]. A number of barriers to clinician participation have been identified including time constraints, lack of staff and training, and concern about the impact on doctor-patient relationship [10,11]. In addition, barriers to GP referrals in depression trials have included the unsuitability of the content and style of depression consultations and the perceived intrusiveness of introducing research into a complex consultation [12]. It seems that the demands on patients and clinicians need to be kept to a minimum [10].
Routine data may overcome some of these issues. It may eliminate the need for doctors to identify suitable patients when they attend the practice. The significant advantage is that larger numbers of suitable patients can be identified by this method in a shorter period of time, thus maximising recruitment and minimising costs. However, it should be noted that routine data requires validation, which needs to be factored into the resource and economic planning.
The Health Information Research Unit (HIRU) [13] based in the School of Medicine, Swansea University has been formed to harness the potential of routinely collected data. HIRU has established the Secure Anonymised Information Linkage (SAIL) database, which is a vast data repository of anonymised person-level data, as provided by an expanding group of Data Providers [14]. In total so far, around 700 million records, pertaining to Health and Social care events have been loaded into the SAIL Data Bank. HIRU, in conjunction with Health Solution Wales, UK (HSW) have developed a robust anonymisation system to ensure confidentiality whilst making the data available for research [15,16].
The purpose of this study was to construct a methodology for identifying potential participants for a trial using the routinely collected data stored in the SAIL databank and to determine if the methodology could correctly identify potential participants for a clinical trial. The trial identified for this project is the FolATED study, which is a pragmatic randomised controlled trial of folate augmentation of antidepressant therapy in the treatment of depression. It is currently being conducted in Wales, UK [17].

Aim
To determine whether anonymised routine data can be used to accurately identify the numbers of eligible patients suitable for recruitment to an existing randomised controlled trial (RCT).

Objectives
• To construct an algorithm to identify suitable participants for a clinical trial using routinely collected, anonymised primary care data stored in the SAIL databank.
• To carry out a validation exercise to establish whether the algorithm could correctly identify potential participants.

Algorithm construction
The FolATED RCT inclusion and exclusion criteria and the timeframes for each criterion (see Tables 1 &2) were used as the basis to determine whether appropriate patients could be identified within the SAIL databank. These clinical criteria were translated into codified controlled measures (Read codes Version 2-[See Additional File 1]. The NHS Read Terminology Version 2 (5-byte) browser was used to identify appropriate read codes. To ensure that all exclusion criteria had been taken into account, the new GP Contract Qualities and Outcomes Framework (QOF) Version 10 Indicator Sets were used as an additional tool to identify codes for depression, learning disability, psychoses, Lithium prescription, palliative care and cancer. The British National Formulary (BNF) Version 54 [18] was checked for criteria relating to drugs and medicines, although all the relevant read codes were identified from the NHS browser, both by generic and brand names.

Inclusion and Exclusion Criteria
Inclusion and exclusion criteria are summarised in Table  3. From the five practises we included all persons aged 18 and over. We then used recent antidepressant therapy (within last three months) as a proxy measure to identify a large enough sample of people who were currently suffering from depression. We then applied a diagnostic criterion of a diagnosis of moderate to severe depression within their medical history recorded on the SAIL data-bank (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007). This latter criterion was intended to increase specificity and to reduce the possibility of capturing patients who suffered from mild depression, senile dementia with depression and other medical conditions. Finally we applied the FolATED study exclusion criteria to the algorithm in order to remove ineligible patients from the sample. Scoring on any one of these criteria led to exclusion.

Data analysis
A database query using the Structured Query Language (SQL) was constructed using the identified read codes. The algorithm was run against the General Practice Database (GPD) within SAIL.

Validation
Two samples were identified, in order to check the validity of this methodology for selecting potential participants. A 10% random sample of the eligible patients were chosen and a second sample of eighty ineligible patients was selected from the main SAIL database. The two samples were combined to form one dataset. Two independent mental health clinicians were given secure access to the anonymised health records of these selected patients and rated their eligibility for the trial. The sensitivity and specificity of the method of routine data capture to select eligible patients was calculated using clinical judgement of diagnosis as the 'gold standard'.

Statistical analysis
Sensitivity and specificity tests were carried out to measure the reliability and accuracy of the results from the validation exercise, comparing the algorithm decision with the clinical judgement. Intra-class correlation coefficient and the kappa Statistic were carried out to measure the agreement between the two clinicians. Statistical analyses were performed using SPSS version 13 [19].

Data anonymisation
HIRU has a protocol in place with HSW to ensure that all data is anonymised. This has been achieved through the split file approach to data management. The demographic data is separated from the clinical data by the source organisation and a system linking field is used to ensure that the data can be rejoined later. The demographic data is sent to HSW and the clinical data is sent to HIRU. HSW use encryption technology for pseudonymisation, replacing the personal data in each record with an Anonymous Linking Field (ALF). This product is then transferred to HIRU where it is joined to the clinical data via the system linking field. As a final safeguard HIRU further encrypts the ALF, thus ensuring that no single organisation can decrypt the records. This split file method ensures that anonymity and confidentiality is maintained, whilst maintaining the facility of data linkage at the individual level. The data is then ready for research applications [16]. Only the source organisation (i.e. the treating physician) has access to both personal and clinical data. The data is provided to the SAIL database on the grounds that it is never deanonymised, patient records can never be traced back to individual patients.

Ethics
The work of the SAIL databank is conducted in strict accordance with a suite of data management policies which take account of the Data Protection Act (1998) [20], the Principles of the Caldicott report (1997)

Identifying potential participants in SAIL
The practice data on the SAIL databank at the time of analysis included data up to 31/8/07. From the sample population of 37,263 patients aged 18 and over 2650 were identified as having been prescribed antidepressants within the last three months (7.11%). Of these 2650 patients 1247 had a diagnosis of moderate to severe depression within their medical history recorded on the SAIL databank (since ~1993). After application of all the inclusion and exclusion criteria 867 potential trial participants for the FolATED study were identified (Table 1).
Validation 88 cases identified using the algorithm and 80 randomly selected individuals who did not meet the inclusion criteria were selected from the SAIL data bank. The accuracy of the algorithm was compared against the judgment of two psychiatrists (PC) and (JC) who independently rated all 168 records for evidence of a diagnosis of current depression. Table 2 shows the sensitivity and specificity of the algorithm against the clinical gold standard.

Discussion
The algorithm for identifying suitable participants for the FolATED study appears to be valid based on the clinical judgment of the raters. The results from the sensitivity and specificity suggested a high degree of accuracy (>= 80%) from the algorithm. Although some minor methodological issues were encountered, we have demonstrated that it is possible to identify anonymous potential trial participants using the routinely collected primary care data.

Limitations of the proposed method
A system based on anonymised data cannot be applied directly to recruitment strategies, as for instance the data housed in SAIL can never be deanonymised.
So this method we are exploring is a two part process. Firstly creating, testing and validating an algorithm to identify suitable participants using the anonymised data in SAIL. Then making this algorithm available on a live practice based computer based facility (such as Audit+ [26]) whereby a physician can run the query themselves and generate a list of suitable named participants within the practice, with minimal time or effort. Thus this should reduce GPs workload, with the potential of maximising recruitment. The method ensures confidentiality of personal data as the identification and recruitment process remains within the practices. This process itself requires validation. Furthermore, missing or implausible data values in the electronic records cannot be corrected as it is not possible to identify the patient.
Additionally, there was also an additional requirement for the researcher to seek clinical expertise to identify appropriate read codes. For example, medical advice was sought as to whether to include read codes relating to post viral depression and pre-senile dementia with depression in the algorithm.

Limitations of routinely collected data
There are also a number of general limitations to the use of routinely collected data. The accuracy of using proxy measures needs to be evaluated. Lack of linkage between diagnosis and therapy makes the use of proxy measures unreliable. This issue is not limited to this methodology but applies to live database searches too. In this study recent antidepressant therapy was used as a proxy measure for depression to try to capture patients who were currently depressed, as the diagnosis may not be recorded as frequently as the treatment prescribed if it is an ongoing condition. The use of antidepressants as a proxy measure for depression is unreliable because disorders cannot be linked to specific interventions i.e. drugs [27]. An attempt to counter this was made by selecting people who had a diagnosis of moderate to severe depression in their medical history, however there was no way of knowing whether their current antidepressant therapy was related to that diagnosis. Antidepressant therapy may have been prescribed for other conditions, such as anxiety disorders, attention deficit disorder or dementia. It would be useful if there was a standardised 'problem number' field in all primary care data entry systems that linked the prescription to the diagnosis. The Meditel system has this field [28]. A particular challenge is establishing the end date of an episode of depression and whether or not the patient is in remission. The codes that might assist in identifying this, such as depression resolved, medication stopped and medication changed, may be infrequently employed and therefore cannot be relied upon as accurate measures in themselves.
Routinely collected data are captured for administrative reasons rather than for research purposes. To be fit for research purposes the validity, accuracy and completeness of the routine data itself need to be considered. Although studies have reported that routinely collected diagnostic data held on general practice information systems are accurate and reliable for research purposes [29][30][31], there is always room for initiatives to standardise systems and to improve data quality in primary care [32].
The purpose of this study was to model using anonymised data a new method of identifying suitable participants using routinely collected data that would make it easier for practices to identify potential subjects for a clinical trial and consequently reduce their workload, whilst potentially maximising recruitment and reducing costs. In the future we will seek to test this algorithm on clinical data sets within primary care settings. The algorithm that was created in this study successfully identified suitable anonymous participants for the trial within the SAIL environment. However the data within SAIL can never be deanonymised. Therefore the next phase is a pilot project for the translation of the algorithm running on anonymised SAIL data to run on live clinical systems, where the individual physician can generate a list of potential identifiable participants, with minimal time and effort. The method ensures confidentiality of personal data as the identification and recruitment process remains within the practices.

Conclusions
The use of routinely collected digitally stored clinical data from primary care can be used as a means of selecting anonymous possible participants for a trial of folate augmentation of antidepressant therapy. Future work is required to run this algorithm on patient identifiable systems within the primary care practice setting and then compare this method with the traditional non-electronic method of participant identification for recruitment, in terms of numbers recruited, time, cost and reliability.