As mentioned above, we used various criteria to create a well-defined set of 35,121 trials, which we processed to yield 71,359 pairs of investigators’ names and their roles. We wished to determine the frequency of occurrence of individual names in these records but discovered that many “names” were junk information which prevented any meaningful assessment of the number of PIs or their frequency. Overall, we encountered four categories of errors with PI (or RP) information in ClinicalTrials.gov data as detailed below.
Missing data
In two of the several steps of data processing we found that a notable amount of data was missing. First, in trying to match name and role we found that one or both pieces of information were missing in 3729 (11.9%) of the 35,121 trial records (Table 1). Also, in 17 cases the number of names and number of roles in a given trial record did not match (Table 1). Second, since a given record may have more than one name and role, subsequent processing led us to a list of 71,359 pairs of names and roles. In 10,572 (17.4%) of these (Table 1), the “name” field contained junk information instead of the name of a real person. Examples of this “non-person” junk information were Bioscience Center, Central Contact, Chief Development Officer, Chief Medical Officer, Clinical Development Manager, Clinical Development Support, Clinical Director Vaccines, Clinical Program Leader, Clinical Project Lead, Clinical R&D, Clinical Sciences & Operations, Clinical Study Operations, Clinical Trial Management, Clinical Trials, [company’s call center number], [company’s name], Global Clinical Registry, Investigational Site, MD, Medical Director, Medical Director Clinical Science, Medical Monitor, Medical Responsible, Professor, Program Director, Sponsor’s Medical Expert, Study Physician, TBD TBD, Use Central Contact, Vice President Medical Affairs, VP Biological Sciences, and VP Clinical Science Strategy. After removing such junk “names”, we were left with 60,787 pairs of names and roles.
For the rejected records of both the Additional file 10: Table S8 and the Additional file 13: Table S11, we also wished to determine whether the PI had, at any point, been listed during the history of the trial. To do this we examined the history of a sample of records (Additional file 1: S1 Text and Additional file 15: Table S13). We used a 5% sample each of NCT IDs of the 3729 rejects of the Additional file 10: Table S8 and of the 8907 unique NCT IDs rejected in the Additional file 13: Table S11 (Additional file 14: Table S12), which amounted to 211 and 422 trials, respectively. We found that only 16 (7.5%) out of 211 Additional file 10: Table S8 rejects and only 9 (2%) out of 422 Additional file 13: Table S11 rejects had a PI in at least one history record. Overall, this amounted to 25 of 633, or 4% of the rejects overall. Taking into account these percentages, 3729 rejects of the Additional file 10: Table S8 are reduced to 3449, and 8907 rejects of the Additional file 13: Table S11 are reduced to 8729.
Finally, we summarized the data above. The overall number of records with missing or junk information was as follows: (a) 3449/35,121 in Additional file 10: Table S8; (b) 17/35,121 in Additional file 11: Table S9; and (c) 8729/35,121 in Additional file 13: Table S11. These add up to 12,195/35,121 (35%) of NCT IDs with missing or junk information in the PI field.
Variations in names
Next we wished to determine the frequency with which a given person’s name appeared as the PI in the set of 60,787 names in Additional file 13: Table S11. It turned out that, of the 60,787 names, 82% were those of a PI, with the rest being those of sub-investigators (5%), Study Directors (9%), and Study Chairs (4%). For the purpose of the results described below, however, this variety of designations did not matter. We took several steps to clean up the names to ensure that each individual was represented by a single name. However, there were different categories of problems with respect to the way names were entered in the database which made this process challenging. These issues are listed in 18 categories below.
-
a)
Extraneous information along with the name:
-
(i)
The name may have had a prefix (e.g., Prf., Prof. Dr., COL) or suffix (e.g., MD; Jr.; III; M.D., Principal Investigator; BSc, MBCHB, MD, Study Director) of varying lengths.
-
(ii)
The name may have included a punctuation mark.
-
b)
Different kinds of variations of the name:
-
(i)
The name may have had spelling mistakes.
-
(ii)
One or more parts of the name may have been abbreviated or truncated.
-
(iii)
Parts of the name may have been ordered differently.
-
(iv)
The middle name may or may not have been mentioned.
-
(v)
Parts of the name may or may not have been hyphenated.
-
(vi)
The surname may have been modified.
-
(vii)
The surname may have been repeated.
-
(viii)
The person’s initials may or may not have been separated by spaces.
-
(ix)
The entire name, or parts of it, may have been in capitals.
-
(x)
The name may have contained a non-English character or the closest English character.
-
(xi)
The first name may have been split into two, or the first and middle name may have been merged.
-
(xii)
The surname may have been split into two, or the middle and surname may have been merged.
-
(xiii)
A nickname, in brackets, may have been mentioned in the middle of the name.
-
(xiv)
The Americanized nickname of part of a foreign name may have replaced the original.
-
c)
Other complications with the names:
-
(i)
A person’s entire name may have been represented by just one word.
-
(ii)
Two individuals may have shared the same name.
We went on to eliminate or quantify categories a(i, ii), b (iv, ix, xiii) and c(i) (Additional file 1: S1 Text and Additional file 16: Table S14, Additional file 17: Table S15, Additional file 18: Table S16, Additional file 19: Table S17 and Additional file 20: Table S18), and obtained an estimate of 12.8% of names that could not be identified unambiguously. Although we have not quantified the other categories of errors, based on preliminary work, we believe that they are not numerous.
Multiple PIs per trial
Another category of error concerned trials that listed more than one person as PI. Examples included NCT01954056 (with 18 PIs), NCT00405704 (21 PIs), NCT01819272 (50 PIs), NCT00419263 (73 PIs), and NCT01361308 (74 PIs).
Missing RP tag
Finally we wished to know whether PI information was available from the RP tag. For this, we examined the 35,121 records from Additional file 10: Table S8. We found that the RP tag was missing in 1221 (3.5%) of 35,121 records (Additional file 21: Table S19). As explained in Additional file 1: S1 Text, the RP details were usually provided both at the top of the NCT ID record and at the bottom. At the top, the exact wording was usually “Information provided by (Responsible Party):...”. However, in 1221 records the wording was “Information provided by:...”. These records did not have the RP information at the bottom either. Thus, anybody using automated methods to search for RP information based on the RP tag would not find it.
In terms of whether the RP field was useful to obtain PI information, we used a sample of 500 records and found PI information only in 138 of them (Additional file 21: Table S19). All of these cases already had PI information, as determined in Additional file 10: Table S8. Thus, the RP field did not yield any additional PI information.