In the last 10 years, the term "big data" has become a keyword in many different contexts. Despite being a widespread term, though, it is not always clear what it refers to. The most appropriate definition is the one that breaks it down to the so-called "4 Vs": Volume, Variety, Velocity and Veracity, i.e. a big amount of heterogeneous data that are rapidly analysable and undergo quality checks systematically.1
"Big data" has become a more and more cited term in health care, for the potential use of the huge amount of data collected from digital medical records or administrative data (e.g. drug prescriptions, hospital dismissal forms, healthcare services etc.)2 and also as a support for regulatory decisions, pharmacovigilance included. Since the Nineties, the use of health databases has become widespread especially in Europe and America, and more recently in Asia, and health databases are used to evaluate the post-marketing phase of pharmacological treatments, in particular for the analysis of their prescription appropriateness, comparative effectiveness and safety.3-5
Post-marketing safety analysis use wide data sources, among which the informative systems for suspect adverse reactions spontaneous reporting.6 The Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) is available for public consultation. In 2006, FAERS received almost half a million reports, reaching 1.2 million in 2014.7 Other reporting systems that collect information in their databases are Vigibase, the World Health Organisation (WHO) informative system that so far has received almost 15 million adverse reaction reports,8 and EudraVigilance by the European Agency for Medicines (EMA), that has collected almost 11 million reports.9
Besides databases for spontaneous reporting, other types of databases are used in pharmacovigilance, for example the databases that collect administrative data and electronic medical records (EMR). Administrative databases are very common in Italy, such as those concerning services and products reimbursed by the National Health Service. Administrative data include drug prescriptions and hospital discharge letters, which record diagnosis and medical procedures for every hospitalisation. By combining these data with the data collected by the National Health Service, it is possible to recover the medical history of any patient. In many Italian regions, health administrative data are managed by the Local Health Authorities, which use them for pharmacovigilance activities and drug safety evaluations.10,11 The EMR databases on the other hand contain patients' demographic and medical information, such has diagnoses and prescriptions, that are recorded by the general practitioner every time he visits the patient.
The simultaneous use of several health database to evaluate the effectiveness and safety of drugs, also by integrating databases of different typologies (for example, administrative and EMR databases), has become more and more frequent. Multi-database initiatives have shown a tendency to create wide data infrastructures for the post-marketing evaluation of effectiveness and safety of drugs, but also to increase the power of studies on rare diseases that, because of the scarce number of cases, can hardly become object of randomised clinical studies (RCT).12
The increasing use of health database networks has allowed the development of new analytical methodologies through the elaboration of big volumes of heterogeneous data, for example, Observational Health Data Sciences and Informatics (OHDSI) has developed methods and instruments for the construction of database network infrastructures. Similarly, the initiative called PROTECT (http://www.imi-protect.eu/) has shown that it is possible to conduct analysis on more than one database, by adopting common protocols instead of a centralised data analysis. In order to foster cooperative research by creating health database networks, EMA has been coordinating for almost a decade a network of Pharmacoepidemiology and Pharmacovigilance (ENCePP) that includes around 200 Public Institutions and Contract Research Organization (CRO) involved in activities related to pharmacoepidemiology and pharmacovigilance.13 In regard to the database networks used for pharmacovigilance, two relevant initiatives have been implemented internationally: Sentinel and EU-ADR.14
Sentinel is a post-marketing surveillance system developed in 2008 by FDA, based principally on the medical data of 193 million people in the United States.15 All data are collected and uniformed according to a common data model, thanks to which it has been possible to manage and elaborate the data with a single coordinating centre, guarantying the privacy of each patient.
EU-ADR on the other hand is an initiative funded by the European Union in 2008, with the goal of early identifying drug safety signals by means of data-mining techniques applied on eight medical databases of four European countries (Denmark, Italy, Netherland and United Kingdom)16 for a total of around 30 million subjects.
Other international networks have been developed in the last years, in order to increase the power of post marketing studies on medicines and vaccines, among which ARITMO (http://www.aritmo-project.org), SAFEGUARD (http://www.safeguard-diabetes.org/), ADVANCE (http://www.advance-vaccines.eu/), SOS (https://www.sosnsaids-project.org/), EUROmediCAT in Europe (http://www.euromedicat.eu/), CNODES in Canada (https://www.cnodes.ca/) and Asian Pharmacoepidemiology Network (AsPEN) in Asia and Australia (http://aspennet.asia/index.html). In databases, most part of the information is usually coded: for example, medical diagnosis are usually coded with the International Classification of Diseases (ICD) or the International Classification of Primary Care (ICPC), whilst information on drugs are often coded with the Anatomical Therapeutic Chemical (ATC) classification system. The data of interest can be selected by identifying the relevant codes, ideally in accordance with validation studies on the identification of diseases available in literature to ensure the accurateness and reproducibility of the results. However, clinical information in these databases can also be entered as free text, as from specialists or general practitioners whom record the details of their clinical practices with text notes in their digital medical records. Artificial intelligence could be used to analyse data collected as free text, and, specifically, "machine learning". This methodology can be supervised or unsupervised and both approached have been used in pharmacovigilance for the automatic identification of safety signals.
The unsupervised machine learning (UML) is an automatic learning system whose aim is to identify links among the provided inputs, without labelling them either correct or incorrect. This approach has been used to identify drug safety signals and usage pattern.17,18 On the other hand, possible inputs and their desired outputs are given to supervised machine learning (SML), with the goal of identifying patterns that associate the input to the output. An example of SML is the correct decoding of free text.19 The identification of adverse reactions in databases is carried out by a particular type of learning machine called Natural Language Processing (NLP). In order to identify potential adverse events, NPL has been applied not just to health databases, but to social media as well, such as Twitter, that mostly contains strings of data in text format. In particular, free text data from social network forums where users share their clinical experience and data from case reports that described potential adverse events have been aggregated.20 In health care, the importance of social media is acknowledged because of the huge amount of users registered on social networks like Facebook (almost 2 billion users) and Twitter (more than 300 million users).21
Increasing interest has been given to research in pharmacovigilance employing analysis of data from social media. Although social media can play a role in the post marketing evaluation of drug safety, to date it seems that these instruments do not specifically improve the identification of the drug safety signals.22 With more and more devices collecting data, such as apps or electronic devices for monitoring clinical parameters or life style ones (known as wearables), it will be interesting to see if and how rapidly these technologies will have an impact on pharmacovigilance.23-25 Despite the several advantages associated with using medical databases, there are also some limits that must be taken into consideration. One of the principal criticalities concerns the quality of the contents: if poor (for example because of erroneous data registration or high frequency of missing data), the value of the results will have limited. Therefore, it is essential to thoroughly understand the limits of the employed data sources and to choose the most appropriate study design for the analysis that needs to be carried out, accurately evaluating whether the data source is fit to assess the clinical question correctly instead of adapting the clinical question at the core of the study to the data source.
In general, the availability of different data sources and the increase of analysis instruments represent an opportunity for carrying out studies on the use and safety of drugs, on a wider and wider scale and with more and more details. However, it is important to remember that the results obtained from the analysis of big data from different sources must be supported by a robust and critical clinical interpretation. The process that leads to the identification of a drug safety signal cannot be completely automatic and careful evaluations from experts in the field remain essential. All the data present in administrative databases, in disease registers, in the EMRs from general practitioners and in the adverse event spontaneous reporting systems should not be evaluated separately, but should be considered parts of a wider context. In general, these data have a very limited value in itself, but when correctly analysed and interpreted they can represent a very useful instrument that can support and guide regulatory decisions on drugs.26
- Yearb Med Inform 2014;9:14-20. CDI NS
- Yearb Med Inform 2014;9:97-104. CDI
- J Am Med Inform Assoc 2013;20:117-21. CDI
- J Med Syst 2012;36:3029-49. CDI NS
- JAMA 2011;305:400-1. CDI
- Mann's Pharmacovigil 2014:331-54.
- https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Survei... Accessed 30-01-2018
- https://www.who-umc.org/vigibase/vigibase/ Accessed 30-01.2018
- Expert Opin Drug Saf 2016;15(sup2):61-7. CDI
- PLoS One 2013;8(12):e82990. CDI
- Br J Clin Pharmacol 2015;80:304-14. CDI
- Pharmacoepidemiol Drug Saf 2012;21:690-6. CDI
- N Engl J Med 2009;361:645-7. CDI
- https://www.fda.gov/Safety/FDAsSentinelInitiative/ucm149340.htm Accessed 30-01-2018
- Stud Health Technol Inform 2009;148:43-9.
- Clin Pharmacol Ther 2012;91:1010-21.
- BMC Bioinform 2011;12(Suppl 10):S11. CDI
- J Biomed Inform 2015;56:356-68.
- J Biomed Inform 2015;53:196-207. CDI
- Drug Saf 2017;40:317-31. CDI
- PLoS Med 2016;13(2):e1001953. CDI
- Int J Commun Syst 2012;25(9):1101.
- Proc IEEE. 2010;98(11):1947-60.
Janet Sultana1,2, Valentine Ientile3, Gianluca Trifirò1,2
1 Dentistry and functional imaging biomedical sciences Department, University of Messina
2 Medical Informatics Department, Erasmus Medical Centre, Rotterdam, Olanda
3 Clinical pharmacology Unit, University Hospital 'G.Martino', Messina