Evaluating the re-identification risk of a clinical study report anonymized under EMA Policy 0070 and Health Canada Regulations

Background Regulatory agencies, such as the European Medicines Agency and Health Canada, are requiring the public sharing of clinical trial reports that are used to make drug approval decisions. Both agencies have provided guidance for the quantitative anonymization of these clinical reports before they are shared. There is limited empirical information on the effectiveness of this approach in protecting patient privacy for clinical trial data. Methods In this paper we empirically test the hypothesis that when these guidelines are implemented in practice, they provide adequate privacy protection to patients. An anonymized clinical study report for a trial on a non-steroidal anti-inflammatory drug that is sold as a prescription eye drop was subjected to re-identification. The target was 500 patients in the USA. Only suspected matches to real identities were reported. Results Six suspected matches with low confidence scores were identified. Each suspected match took 24.2 h of effort. Social media and death records provided the most useful information for getting the suspected matches. Conclusions These results suggest that the anonymization guidance from these agencies can provide adequate privacy protection for patients, and the modes of attack can inform further refinements of the methodologies they recommend in their guidance for manufacturers.

The second case was commissioned by the Department of Health and Human Services in the US to determine the re-identification risk of data de-identified using the HIPAA Safe Harbor standard [4], [5].
The researchers used a patient data set that consisted of 15,000 Safe Harbor de-identified admission records from a regional hospital and matched these against a marketing data set of 30,000 records with similar attributes purchased from InfoUSA, a marketing research firm. All of the records were from individuals who self-identified as Hispanic and the variables that were used to match records included age or year of birth, sex, the first 3 digits of ZIP codes, and marital status. The best results were obtained when matching on age (vs. year of birth) and other attributes. They found 20 patient records that matched to 22 records in the marketing data set, with 2 of these confirmed to be "true" identity matches. That is equivalent to a re-identification rate of 0.013% [4].
Elliot undertook a similar study in 2007, linking the Sample of Anonymised Records (SARs) from the 2001 UK census to the microdata from the Spring 2001 UK Labour Force Survey (LFS) [6]. The focus of this test was to assess whether the statistical disclosure control (SDC) methods used on the 2001 SARs protected against the risk of re-identification posed by linkage with external datasets. All unique matches found between the datasets were sent to ONS for verification. Between the released SARS file and the LFS, 3130 matches were found; however, no name or address could be found for many of these matches, reducing the number that could be verified to 2234. Of these matches, 51 were verified to be correct re-identifications (2.28%). When using a fishing method of attack to target high risk records in the data, Elliot had a similarly low level of success in re-identifying individuals. Elliot concludes that "the SDC that was employed on the 2001 SARs appears to seriously undermine intrusion attempts" [6].
In a 2011 study, a disclosure risk analysis of the supporting people dataset disseminated by the Department for Communities and Local Government (DCLG) was undertaken by Elliot [7]. "Cross matching attacks" and "Response knowledge based attacks" were examined in his analysis [7]. Similar to the previous study, the cross matching attack used a large data set of identifiable information to match and re-identify records in a de-identified data set. For this type of attack, Elliot aimed to uncover "the probability of a correct match given a unique match" [7]. In other words, what is the probability of a cross match being a "true" match that actually identifies an individual in the data set? For this data set, he found that the probability was very low: 0.0177. However, the response knowledge based attack increased the risk considerably due to the high number of unique records in the data set (12.18%). In this type of attack, an intruder would know that a particular individual is included in the data set. Therefore, if the individual is unique in the data set then they could be easily identified by an intruder.
In another study [8], Elliot and his colleagues simulated an attack by an intruder who has response knowledge about an individual in the data set, focusing on samples from two social service databases: the UK Labour Force Survey (LFS) and the Living Costs and Food Survey (LCF). They used web-based information and a commercial database to re-identify 50 sampled records from each survey. For the LFS, they found that they were able to correctly match 6 of the 50 sampled records (12%) using web-based information alone and 14 (28%) when the commercial data was used as well [8]. The LCF included an Output Area Classifier (OAC) which was not included in the LFS, and which can be derived from a postal code by someone with the knowledge of how to do so. Because this conversion is not obvious, the researchers tested matching with and without the OAC to see the results. Without OAC, they matched 20 records to 8 addresses, 2 of which were verified as true matches by the Office for National Statistics. With OAC, they matched 42 records to 27 addresses, 18 of which were confirmed to be true matches. In this case, more specific location information led to a greater number of individuals being re-identified.
Tudor, Spicer and Cornish [9], [10] conducted an intruder test on pre-publication census data in the UK to examine potential re-identification risk associated with the data release. The data they targeted was tabular in nature, consisting of 89 tables that were determined to be potentially high risk containing varying numbers of fields and more or less specificity in terms of location. The goal of the disclosure control techniques used for the 2011 Census was to create "sufficient uncertainty" as to the identity of any individual re-identified and targeted record swapping was chosen as the primary method of creating such uncertainty for tabular data. Not knowing which records have been swapped, an intruder in this case can never be sure if a re-identification results in a true disclosure. The authors recruited "intruders" from within the statistical agency to attempt to re-identify the data using only public information found on the web. They asked the volunteers to examine different re-identification scenarios, such as [9], [10]: Can they identify themselves or their household? II.
Can they identify someone they know, either individually or within a group, and their characteristics?
III. Starting with public information, can they then identify someone, or a group of people, in a table (and learn more than the public information)?

IV.
Starting with the census tables, can they identify a person or a group of people, and link this to some public information?
Eighteen intruders were recruited and there were more than 50 claims of identity and/or attribute disclosure from the group. The volunteers also noted the level of confidence they had in the correctness of their claims (from "Not at all confident" to "Very confident"). Researchers found that the volunteers claimed to identify only people about whom they had personal knowledge (save for 1 claim), and the claims were most accurate when they were about a family member or someone living in the same household as the intruder. In terms of confidence, the claims that testers had greater confidence in were more likely to be correct; however, there were more correct claims found at the "Reasonably confident" level than at the "Very confident" level [9], [10]. In the end, the majority of the claims were found to be incorrect, leading to the conclusion that "it is very difficult to re-identify respondents correctly in the 2011 UK Census and moreover, it is virtually impossible in this case to identify anyone correctly without any personal knowledge about them." [9], [10] The UK Department of Energy and Climate Change underwent intruder testing prior to the release of the National Energy Efficiency Data (NEED) in 2014 [11]. The anonymized NEED data release consisted of 2 datasets: a public use file (PUF) of 50,000 records and an end user license file which includes 4 million records. Government analysts from the department's IT sector were recruited to act as motivated intruders, using publically available information in conjunction with their own knowledge to attempt to reidentify records in NEED. In almost every case, energy performance certificate (EPC) information was used to try to identify households as EPC data is both publically available and includes dates. As a result, the EPC variables were determined to be high risk and were aggregated or removed in the final data release. Further intruder testing of the PUF was conducted by post-graduate students in Electronics and Computer Science. The students were unable to correctly identify any household in the PUF dataset.
Ramachandran, et al. conducted a case study looking at the re-identification of sensitive data [12]. They examined the success of efforts to re-identify a large dataset by matching with a publically available dataset purchased from wholesale data sellers. The de-identified dataset examined contained over 2 million people. The public dataset they purchased for re-identification of the de-identified data contained demographic data for 700,000 people. The variables contained in the public dataset included name, date of birth, address, ethnicity, sex, and income. Using an algorithm to match the two datasets, researchers found that the probability of a successful match was less than 0.005%. They concluded that there is little risk posed by this type of large scale data matching attack.
The Heritage Health Prize dataset was made available for individuals to participate in a data analysis competition with an ultimate $3m cash prize [13]. This longitudinal claims dataset covered 113 thousand patients over a three-year period, accounting for re-admissions. Before the dataset was made available for the competition the sponsor commissioned a re-identification attack [14]. The attack explored multiple avenues to re-identify patients with different characteristics but did not re-identify any data subjects in the competition dataset.