We present our findings of problems with the CTRI database below. Data processing for the first few sections of results are presented in Fig. 1. For most of these problems, error rates have been calculated over time, and are presented in Fig. 2 and Additional file 6. Aside from the results, we identified two challenges in accessing the data. These are also described below.
Type of Study
We first examined the Type of Study of the 12,673 trials. There were 1331 categories, which are listed, along with their frequencies in Additional file 5. The top five categories (Fig. 1) were (1) drugs (2732 or 22%), (2) Not Available (884, 7%), (3) Surgical/Anesthesia (850, 7%), (4) Ayurveda (737, 6%), which is a system of alternative medicine practiced in India, and (5) Cross Sectional Study (684, 5%).
Quantification of problem over time: In Fig. 2a and Additional files 5 and 6, we quantify the problem of too many categories of Types of Study. We examined the number of categories, with respect to the number of trials over time, in four 3-year intervals. The percentages were 6.5, 19.3, 25.2, and 48.9, respectively. As such, the number of categories increased more than sevenfold from time period one to four. Although this is not strictly an ‘error rate’, we have labeled it as such in Fig. 2a, since all other problems quantified in Fig. 2 are error rates.
In ClinicalTrials.gov, the equivalent field was Intervention. This had 11 categories: Behavioral, Biological, Combination Product, Device, Diagnostic Test, Dietary Supplement, Drug, Genetic, Other, Procedure, and Radiation. One category could be chosen multiple times, and more than one category could also be chosen. However, in downloaded data, multiple interventions were listed in a discrete and unambiguous manner. We give three examples of this, starting with the unique ID of the trial concerned: (1) NCT00736645 – Dietary Supplement: selenomethionine|Drug: finasteride|Other: placebo; (2) NCT01282515 – Drug: clobetasolpropionate|Drug: hexaminolevulinate; and (3) NCT00787969 – Biological: rituximab|Drug: cladribine|Drug: temsirolimus|Biological: Filgrastim|Biological: Pegfilgrastim.
In the following sections, we focused our attention on the largest category of trials, i.e., drugs, in all the sections except two, which are specified.
Countries of Recruitment
We then investigated the Countries of Recruitment of the 2732 drug trials. There were 2070 (76%) trials conducted only in India (hereafter, Indian trials), 640 (23%) were conducted in India as well as in other countries (Multinational trials), and 22 (1%) were conducted only outside India (Foreign trials), as shown in Fig. 1.
We looked at the set of 22 Foreign trials more closely (Additional file 5). Although none of them listed India as a Country of Recruitment, in one case no country was listed. Further examination of this case showed that (1) Recruitment Status of Trial (Global) was “Not applicable” and Recruitment Status of Trial (India) was “Open to recruitment”; (2) Date of First Enrollment (Global) did not list a date but Date of First Enrollment (India) did; and (3) all 200 subjects were recruited from India. This appeared to be an Indian trial.
Of the remaining 21 Foreign trials, only three trials appeared to be truly foreign, since they had no recruitment from India, and other fields were also as expected for Foreign trials. Thus, for each of these three trials, (1) Recruitment Status of Trial (Global) was either “Completed” or “Not yet recruiting”, and Recruitment Status of Trial (India) was “Not applicable”; (2) a Date of First Enrollment (Global) was provided, whereas a Date of First Enrollment (India) was not; (3) Total Sample Size had a non-zero value, whereas Sample Size from India was nil.
For the remaining 18 trials, (1) in no case was Recruitment Status of Trial (Global) or Recruitment Status of Trial (India) listed as “Not applicable”; (2) all of them listed a date in Date of First Enrollment (Global), and all but two listed one in Date of First Enrollment (India); and (3) Total Sample Size ranged from 120 to 10,000 and the Sample Size from India ranged from 1 to 1000.
In order to ascertain whether, in fact, there were 2070 Indian trials and 640 Multinational trials, we proceeded to examine these three other pairs of fields for both those datasets as well. Details of each step of processing are available in Additional file 5. Only 1764 (85%) of the 2070 were unambiguously Indian trials, and only 609 (95%) of the 640 were unambiguously Multinational trials. Additional file 5 summarizes the processing of the Indian, Multinational, and Foreign trial data to determine the unambiguously correct cases in each of these three categories.
Quantification of problem over time: In Fig. 2a, Additional file 6 and Additional file 5, we quantify the problem of unambiguously identifying (1) the Indian trials, and (2) the Multinational trials. The percentages of trials with errors, over four time periods, were 63.1, 23.9, 4.4, and 3, respectively, for the Indian trials and 9.2, 5.1, 0.9, and 3.8, respectively, for the Multinational trials. As such, for the Indian trials, the error rates decreased 21-fold from time period one to four, and for the Multinational trials, they decreased 10-fold from time period one to three, but then increased again to 40% of the peak value.
Of these four fields, or pairs of fields, ClinicalTrials.gov only had Country (of recruitment). Therefore, for a global trial, there was no way to check the status of the trial in the USA or elsewhere.
Relationship of Type of Trial and Phase of Trial
We went on to look at Type of Trial. The four options for this field were Observational, Interventional, PMS (that is, postmarketing surveillance), and BA/BE (Bioavailability/Bioequivalence). The 1764 Indian trials fell into either the Interventional (1655, or 94%) or PMS (109, or 6%) categories. Likewise, the 609 Multinational trials fell into either the Interventional (606, or 99.5%) or PMS (3, or 0.5%) categories. We proceeded to use the two sets of Interventional cases for all the analyses mentioned below, except where specified.
We first explored Phase of Trial. The options for this field were Phase 1, Phase 1/2, Phase 2, Phase 2/3, Phase 3, Phase 3/4, Phase 4, N/A, and PMS. For the Multinational set there were no cases of Phase listed as PMS, but for the Indian set there were 55 (3%) PMS cases (Additional file 5).
Quantification of problem over time: In Fig. 2a and Additional files 5 and 6, we quantify the problem of Interventional trials with Phase listed as PMS, for the Indian trials. The percentages of trials with errors, over four time periods, were 0, 0.7, 4.2, and 4.6, respectively. As such, the error rate increased from 0 to almost 5% over the four time periods.
In ClinicalTrials.gov, Study type had three options: Interventional studies, Observational studies (including Patient Registries), and Expanded Access studies. PMS was not an option, and therefore we could not compare this field in the two databases.
Confusion between PMS and Phase 4 trials
Continuing from the preceding “Relationship of Type of Trial and Phase of Trial” section, we examined whether trials which listed PMS as the Type of Trial had Phase 4 as Phase of Trial, and identified such cases among the Indian, but not the Multinational trials.
Quantification of problem over time: In Fig. 2b and Additional files 5 and 6, we quantify the problem of the Type of Trial being PMS, but the Phase being Phase 4. This was done for a redefined set of Indian trials (wherein we started with the PMS trials rather than the Interventional trials), as detailed in Additional file 5. The percentages of trials with errors, over four time periods, were 100, 33.3, 22.5, and 37.9, respectively. As such, the error rates decreased more than fourfold, but then increased to 40% of the peak value.
As mentioned above, PMS was not an option in ClinicalTrials.gov, and therefore we could not compare this field in the two databases.
Type of Trial: BA/BE versus Phases 1–4
Continuing with problems related to Type of Trial, we found that although BA/BE was a separate category, such trials were sometimes classified as having Phases 1, 1/2, 2, 2/3, 3, 3/4, or 4 (Additional file 5). Most of such cases were among the Indian trials.
Quantification of problem over time: In Fig. 2b and Additional files 5 and 6, we quantify the problem of the Type of Trial being BA/BE, but the Phase being 1, 1/2, 2, 2/3, 3, 3/4, or 4. This was done for a redefined set of Indian trials (wherein we started with the original set of 12,673 trials, used all the filters to generate the unambiguously Indian trials, and used the filter BA/BE for Type of Trial) as detailed in Additional file 5. The percentages of trials with errors, over four time periods, were 83.3, 19.6, 6.5, and 13.7, respectively. As such, for the Indian trials, the error rates decreased 13-fold over three time periods, but then increased again to almost 20% of the initial value.
In the Multinational set, there was just one trial each in Phases 1 and 2, so we could not investigate the error rates over time.
For ClinicalTrials.gov, there were three options in Study type: Interventional, Observational (subsection: Patient registries), and Expanded access. BA/BE was not an option, and therefore we could not compare this field in the two databases.
Sites of study: incorrect listing of cities
In investigating the cities in which trials took place, we found some cases with incomplete or non-standard information.
Quantification of problem over time: In Fig. 2b and Additional files 5 and 6, we quantify the problem of the incorrect listing of cities in the Indian and the Multinational trials. The percentages of trials with errors, over four time periods, were 0, 2.5, 2.8, and 5.3, respectively for the Indian trials and 0.3, 1.3, 3.3, and 1.3, respectively for the Multinational trials. As such, for the Indian trials, the error rates increased from 0 to 5% from time period one to four, and for the Multinational trials, they increased 10-fold from time period one to three, but then decreased to 40% of the peak value in time period four.
It is not known how well the cities were classified in ClinicalTrials.gov.
Missing data
Above, we noted that there was missing data in the section on Countries of Recruitment. We identified four additional fields for which there was missing data. These were (1) Name of Principal Investigator (PI), (2) Type of Study, (3) Name of Primary Sponsor, and (4) the state hosting a trial. We quantify these errors in the following sections.
Name of PI not listed
In examining the Details of Principal Investigator or overall Trial Coordinator (multi-center study) we found that for the Indian and Multinational cases, 5% and 40%, respectively, did not have any details in this field (Additional file 5).
Quantification of problem over time: In Fig. 2c and Additional files 5 and 6, we quantify the problem of the PI name not being listed for the Indian and Multinational trials. The percentages of trials with errors, over four time periods, were 10.3, 9, 4.4, and 3.5, respectively for the Indian trials and 54.1, 42.1, 32, and 33.1, respectively, for the Multinational trials. As such, for the Indian trials, the error rates decreased threefold from time period one to four, and for the Multinational trials, they decreased twofold from time period one to three, but then plateaued.
In earlier work, we found that in ClinicalTrials.gov, too, PI names were missing in many records, since it was a non-compulsory field [17].
Type of Study not listed
We identified trials that had no information for Type of Study but that listed Phases 1, 1/2, 2, 2/3, 3, 3/4, or 4. We identified such cases only among the Indian trials.
Quantification of problem over time: In Fig. 2c and Additional files 5 and 6, we quantify the problem of the Type of Study not being listed, but the Phase being 1, 1/2, 2, 2/3, 3, 3/4, or 4. This was done for a redefined set of Indian trials (wherein we started with the original set of 12,673 trials, used all the filters to generate the unambiguously Indian trials, and used the filter “Not available” for Type of Study) as detailed in Additional file 5. The percentages of trials with errors, over four time periods, were 87.5, 20.7, 6.2, and 13.4, respectively. As such, the error rates decreased 14-fold, but then doubled in time period four.
Name of Primary Sponsor not listed
We identified trials that did not mention the name of the Primary Sponsor. We identified such cases only among the Indian trials.
Quantification of problem over time: In Fig. 2c and Additional files 5 and 6, we quantify the problem of the Primary Sponsor not being named. The percentages of trials with errors, over four time periods, were 1.5, 2.7, 1.8, and 1, respectively. As such, the error rates almost doubled from time period one to two, but then dropped in the next two time periods to end up at 40% of the peak value.
The state hosting a trial not listed
We identified trials that did not list the state in which the trial took place.
Quantification of problem over time: In Fig. 2c and Additional files 5 and 6, we quantify the problem of the state hosting the trial not being listed, for the Indian and Multinational trials. The percentages of trials with error rates, over four time periods, were 5.9, 2.1, 0.2, and 0.1, respectively for the Indian trials and 12.5, 3.3, 1.3, and 0.1, respectively, for the Multinational trial. As such, for both the Indian and the Multinational trials, the low initial error rates dropped to almost nothing over time.
Many fields in ClinicalTrials.gov were compulsory, and it is therefore likely that the record of each trial was much more complete.
Variations in a PI’s name
There were several possible variants of the name of a PI, which made it difficult to unambiguously establish that two names represented the same person, for instance. This became a particular challenge if automated methods were being used to process large numbers of names.
Examples of categories of these variations are listed below, where we have substituted the actual letters in names by the letters a, b, or c to protect the identity of the PI. CTRI records that illustrate these examples are listed in Additional file 5:
- 1.
The presence or absence of the middle name (example Dr Aaaaa Bbbbb Ccccc and Dr Aaaaa Ccccc)
- 2.
Parts of the name abbreviated (Dr Aaaaaaaa B Cccc and Dr A B Cccc)
- 3.
Spelling mistakes (Dr Aaaaaa Bbbbbbb Cccccc and Dr Aaaaaa BbbbbbbbbCccccc)
- 4.
Different ordering of parts of the name (DrAaaaaaaaaB and Dr B Aaaaaaaaa)
- 5.
Different spacings in the name (Aaaaaa B C and Aaaaaa BC)
- 6.
Variable use of capitals (Dr Aaaaaa Bbbbbbb and Dr AAAAAA BBBBBBB)
- 7.
Extraneous information with the name (Aaaa Bbbbbb and Aaaa Bbbbbb MD).
In earlier work, we identified many such problems with the names of PIs in ClinicalTrials.gov as well [17].
The name and classification of the Primary Sponsor
There were many cases of variations in the name of a given Primary Sponsor. Examples included the following variations for a given company: (1) Bristol Myers Squibb, BRISTOL MYERS SQUIBB, Bristol Myers Squibb India Pvt. Ltd., BristolMyers Squibb India pvt Ltd., and BristolMyers Squibb Research and Development; (2) Merck Sharp Dohme, Merck Sharp Dohme Corp, and Merck Sharp Dohme Corp a subsidiary of Merck Co Inc.; (3) Novo Nordisk India Private Limited, Novo Nordisk India Private Limited AS, and Novo Nordisk India Private Ltd.; and (4) Sanofi Synthelabo India Limited, SanofiSynthelabo IndiaLtd, and SanofiSynthelabo India Limited.
In ClinicalTrials.gov, the sponsor name seemed to have been chosen through a drop-down menu, since each organization appeared to be represented by just one version of a name. By way of examples, each of the following organizations was listed multiple times in the database in exactly the same manner: Acotec Scientific Co., Ltd.; Merck Sharp & Dohme Corp.; National Institute of Allergy and Infectious Diseases (NIAID); Albert Einstein College of Medicine; Baxter Healthcare Corporation; and Bausch & Lomb Incorporated.
Aside from variations in a given company’s name, we also noted variations in a given organization’s classification. For example, (1) each of the following companies was variably classified as Pharmaceutical industry-Global or Pharmaceutical industry-Indian in different trials: AstraZeneca, Boehringer Ingelheim, BristolMyers Squibb India Pvt. Ltd., and Eisai Limited; (2) Biogen Idec was classified as Other [Biotech Company], whereas Biogen Idec MA Inc. and Biogen Idec United Kingdom were classified as Pharmaceutical industry-Global; (3) Forest Research Institute Inc. was classified either as Research institution or as Pharmaceutical industry-Global; and (4) The National Institute of Allergy and Infectious Diseases of the National Institutes of Health, USA was classified either as a Government funding agency or as a Research institution and hospital.
For the classification of the Primary Sponsor, CTRI had quite a large number of categories, as follows: (1) Pharmaceutical industry-Global, (2) Pharmaceutical industry-Indian, (3) Contract research organization, (4) Government funding agency, (5) Research institution, (6) Research institution and hospital, and (7) Others. The following are examples of Others: Other [Healthcare industry], Other [international non-governmental and not-for-profit organization], Other [National public health institute of the United States], Other [Non profit organization works to improve health focused on Neglected Tropical Diseases], Other [Not for Profit Organisation], and so on.
In contrast to CTRI, the six organizations listed above as test cases appeared to be classified in ClinicalTrials.gov in one category each.
Details of Ethics Committee
Next, we investigated Details of Ethics Committee and made several observations. These were (1) the lack of enough information to identify each ethics committee (EC) unambiguously; (2) lack of clarity on which site sought approval from which committee; (3) the listing of more ECs than sites of a given trial; and (4) the listing of foreign ECs along with Indian ones, for certain Multinational trials. For (1) and (2) we just identified a few examples, whereas for (3) and (4) we identified all the cases, and quantified the problem. Further details are provided in Additional file 5.
Lack of enough information to identify each EC unambiguously
All ECs did not have an address, or clear hospital affiliation, and may have been listed only by their names. As such, the affiliations and locations of such ECs could not always be established unambiguously. Examples of committee names were (1) Human welfare Ethics Committee for Human Sciences and Research; (2) Institutional Ethics Committee For Human Research; (3) Integrity Ethics Committee; (4) Regional Ethics Committee; and (5) LPR Ethics Committee.
Lack of clarity on which site sought approval from which committee
It was unclear which site sought approval from which EC. Multiple ECs may have approved a given trial, and if, for each site in Sites of Study, we looked for the corresponding institution or address in Details of Ethics Committee, we could not always infer which committee it was linked to.
The listing of more ECs than sites of a given trial
There were trials for which there were more ECs than sites. An example was one which had seven trial sites but 28 committees.
Quantification of problem over time: In Fig. 2a and Additional files 5 and 6, we quantify the problem of there being more ECs than trial sites in the Indian and Multinational trials. The percentages of trials with errors, over four time periods, were 9.4, 9.9, 4.6, and 9.5, respectively, for the Indian trials and 7.4, 6.6, 7.1, and 3.4, respectively, for the Multinational trials. As such, for the Indian trials, the error rate was close to 10% in all time periods except the third, when it halved. For the Multinational trials, it was around 7% in all time periods except the last, when it halved.
The listing of foreign ECs along with Indian ones
In the Multinational dataset there were two trial records in which foreign committees were included in the list of ECs. Examples of such committees included (1) Comite National D’Ethique pour la Recherche en Sante, Senegal; (2) Comite National d’Ethique et de Recherche (CNER) de Côte dIvoire; and (3) Convite nacional De Bioetica Para A Saude, Mozambique.
ClinicalTrials.gov did not have a field for EC approval, and therefore we could not compare this field in the two databases.
Finally, and aside from the findings listed above, we noted two challenges related to accessing data in the CTRI database, one concerned with the search function and the other with the download options. These are described in the following sections.
The search function of the database
The search function of the database did not work well, as illustrated by the following examples. (1) If, for Type of Trial we chose “Interventional”, 16 records were pulled up instead of thousands. (2) Likewise, if, for Phase of Trial we chose Phase 3, five records were pulled up instead of thousands. (3) Another example, concerning the search for trials run by one particular hospital, is detailed in Additional file 5. (4) If one wanted the list of all the trials hosted by the database, unless one entered the term “CTRI” as a keyword, no records were pulled up.
We did not carry out a systematic exploration of the search function of ClinicalTrials.gov.
Download options for trial data
It was not a straightforward task to download data related to a large number of trials at a time. The obvious option was to select individual trial records, open each in the browser, and download one HTML record at a time. Users with programming skills could use Python both as a web-scrapping bot as well as a parser to reformat the data from an unstructured, hard-to-query HTML format to a structured SQLite database.
At ClinicalTrials.gov, for up to 10,000 trials, up to 25 fields of information could be downloaded into a single file at the click of a button. This file could be in any of the following formats: comma-separated values, tab-separated values, plain text, PDF, or XML.