J Korean Med Sci.  2024 Mar;39(9):e92. 10.3346/jkms.2024.39.e92.

Dark Data in Real-World Evidence: Challenges, Implications, and the Imperative of Data Literacy in Medical Research

Affiliations
  • 1Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Seoul, Korea
  • 2Division of Endocrinology and Metabolism, Department of Internal Medicine, Seoul St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Korea

Abstract

Randomized controlled trials (RCTs) and real-world evidence (RWE) studies are crucial and complementary in generating clinical evidence. RCTs provide controlled settings to validate the clinical effect of specific drugs or medical devices, while RWE integrates extrinsic factors, encompassing external influences affecting real-world scenarios, thus challenging RCT results in practical applications. In this study, we explore the impact of extrinsic factors on RWE outcomes, focusing on “dark data,” which refers to data collected but not used or excluded from the analyses. Dark data can arise in many ways during research process, from selecting study samples to data collection and analysis. However, even unused or unanalyzed dark data hold potential insights, providing a comprehensive view of clinical contexts. Extrinsic factors lead to divergent RWE outcomes that could differ from RCTs beyond statistical correction’s scope. Two main types of dark data exist: “known-unknown” and “unknown-unknown.” The distinction between these dark data types highlights RWE’s complexity. The transformation of unknown into known depends on data literacy—powerful utilization capabilities that can be interpreted based on medical expertise. Shifting the focus to excluded subjects or unused data in real-world contexts reveals unexplored potential. Understanding the significance of dark data is vital in reflecting the complexity of clinical settings. Connecting RCTs and RWEs requires medical data literacy, enabling clinicians to decipher meaningful insights. In the big data and artificial intelligence era, medical staff must navigate data complexities while promoting the core role of medicine. Prepared clinicians will lead this transformative journey, ensuring data value shapes the medical landscape.

Keyword

Big Data; Dark Data; Data Literacy; Real-World Data; Real-World Evidence; Randomized Controlled Trials

Figure

  • Fig. 1 Contrasting study designs of RWE and RCT. The significance of inclusion criteria is widely acknowledged, whereas RCT places greater emphasis on exclusion criteria because of the pivotal role of circumstances and reasons for exclusion. In RCTs, research is conducted on groups selected through inclusion criteria. Therefore, it consists of a homogenous group. However, in RWE, excluded patients other than the inclusion group are also important. In this case, it mainly consists of heterogenous groups and actively reflects the actual clinical field.RWE = real-world evidence, RCT = randomized controlled trial.

  • Fig. 2 Various research designs in clinical fields—RCT and RWE. (A) Exemplary RCT design. This conventional design involves a homogeneous sample, and the findings are extrapolated to represent the entire population. (B) Archetypal RWE design. Although the sample is heterogeneous, the similarity in sample size to the total population facilitates a representative reflection of the larger population. (C) Inadequate RWE design. Characterized by small, heterogeneous sample size, this design combines the limitations of both RCT and RWE. Such situations often arise owing to rigorous research design, leading to substantial data exclusion. (D) Ideal RWE design. Aims to minimize exclusions and employs subgroup analysis on heterogeneous samples. This design strives for personalized treatment tailored to individual patients.RCT = randomized controlled trial, RWE = real-world evidence.

  • Fig. 3 Illustrations of “known-unknowns” dark data. (A) Examples of numerous data excluded from the clinical study design. Often, a mere 10% of extracted data finds utilization in research endeavors. (B) Various examples of data excluded by exclusion criteria. Exclusions, based on diagnosis and operational definition discrepancies, encompass a range of scenarios. The precision of data extraction’s accuracy becomes questionable, potentially impacting research outcomes. Notably, patients excluded owing to alternative drug use or impaired liver/kidney function frequently employ study drugs in real-world practice. Unexpectedly terminating or altering medication constitutes a shift in the patient’s condition, which could align with the study’s objective.

  • Fig. 4 Perspectives on data, information, knowledge, and wisdom/theory.33


Reference

1. Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB. Evidence-Based Medicine: How to Practice and Teach EBM. 2nd ed. Edinburgh, UK: Churchill Livingstone;2000. p. 173–177.
2. Stanley K. Design of randomized controlled trials. Circulation. 2007; 115(9):1164–1169. PMID: 17339574.
3. Sherman RE, Anderson SA, Dal Pan GJ, Gray GW, Gross T, Hunter NL, et al. Real-world evidence - what is it and what can it tell us? N Engl J Med. 2016; 375(23):2293–2297. PMID: 27959688.
4. Kim HS, Kim JH. Proceed with caution when using real world data and real world evidence. J Korean Med Sci. 2019; 34(4):e28. PMID: 30686950.
5. Klonoff DC. The expanding role of real-world evidence trials in health care decision making. J Diabetes Sci Technol. 2020; 14(1):174–179. PMID: 30841738.
6. Jager KJ, Zoccali C, Macleod A, Dekker FW. Confounding: what it is and how to deal with it. Kidney Int. 2008; 73(3):256–260. PMID: 17978811.
7. Grimes DA, Schulz KF. Bias and causal associations in observational research. Lancet. 2002; 359(9302):248–252. PMID: 11812579.
8. Edelman SV, Polonsky WH. Type 2 diabetes in the real world: the elusive nature of glycemic control. Diabetes Care. 2017; 40(11):1425–1432. PMID: 28801473.
9. Brod M, Rana A, Barnett AH. Adherence patterns in patients with type 2 diabetes on basal insulin analogues: missed, mistimed and reduced doses. Curr Med Res Opin. 2012; 28(12):1933–1946. PMID: 23150949.
10. Collins R, Bowman L, Landray M, Peto R. The magic of randomization versus the myth of real-world evidence. N Engl J Med. 2020; 382(7):674–678. PMID: 32053307.
11. Andrade S. Compliance in the real world. Value Health. 1998; 1(3):171–173. PMID: 16674348.
12. Ard J, Cannon A, Lewis CE, Lofton H, Vang Skjøth T, Stevenin B, et al. Efficacy and safety of liraglutide 3.0 mg for weight management are similar across races: subgroup analysis across the SCALE and phase II randomized trials. Diabetes Obes Metab. 2016; 18(4):430–435. PMID: 26744025.
13. Park JH, Kim JY, Choi JH, Park HS, Shin HY, Lee JM, et al. Effectiveness of liraglutide 3 mg for the treatment of obesity in a real-world setting without intensive lifestyle intervention. Int J Obes. 2021; 45(4):776–786.
14. Zabor EC, Kaizer AM, Hobbs BP. Randomized controlled trials. Chest. 2020; 158(1S):S79–S87. PMID: 32658656.
15. Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. N Engl J Med. 2000; 342(25):1878–1886. PMID: 10861324.
16. Kim HS, Lee S, Kim JH. Real-world evidence versus randomized controlled trial: clinical research based on electronic medical records. J Korean Med Sci. 2018; 33(34):e213. PMID: 30127705.
17. Kim HS, Kim DJ, Yoon KH. Medical big data is not yet available: why we need realism rather than exaggeration. Endocrinol Metab (Seoul). 2019; 34(4):349–354. PMID: 31884734.
18. Harrington L. New data of the digital age: big, dark, and deep. AACN Adv Crit Care. 2017; 28(3):239–242. PMID: 28847857.
19. Hand DJ. Dark Data: Why What You Don’t Know Matters. Princeton, NJ, USA: Princeton University Press;2020.
20. Zhang C, Shin J, Ré C, Cafarella M, Niu F. Extracting databases from dark data with deepdive. Proc ACM SIGMOD Int Conf Manag Data. 2016; 2016:847–859. PMID: 28316365.
21. Suto Y. In : Stepp LM, Gilmozzi R, Hall HJ, editors. Unknowns and unknown unknowns: from dark sky to dark matter and dark energy. Proceedings of the SPIE Astronomical Telescopes + Instrumentation, Ground-based and Airborne Telescopes III, Vol 7733; 2010 June 27-July 2; San Diego, CA, USA. Bellingham, WA, USA: SPIE;2010. p. 1–11.
22. Faurholt-Jepsen M, Busk J, Frost M, Vinberg M, Christensen EM, Winther O, et al. Voice analysis as an objective state marker in bipolar disorder. Transl Psychiatry. 2016; 6(7):e856. PMID: 27434490.
23. Truesdell AG, Sauer AJ, Alasnag M. Known knowns, known unknowns, and unknown unknowns. Cardiovasc Revasc Med. 2020; 21(12):1472–1473. PMID: 32988744.
24. Koltay T. Data governance, data literacy and the management of data quality. IFLA J. 2016; 42(4):303–312.
25. Shin SI, Kwon MM. Dark data: why what you don’t know matters. J Inf Technol Case Appl Res. 2023; 25(2):112–118.
26. Perini DJ, Batarseh FA, Tolman A, Anuga A, Nguyen MA. 16 - Bringing dark data to light with AI for evidence-based policymaking. Batarseh FA, Freeman LJ, editors. AI Assurance: Towards Trustworthy, Explainable, Safe, and Ethical AI. 1st ed. Cambridge, MA, USA: Academic Press;2022. p. 531–557.
27. Qiu S, Liu Q, Zhou S, Wu C. Review of artificial intelligence adversarial attack and defense technologies. Appl Sci. 2019; 9(5):909.
28. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In : Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 June 7-12; Boston, MA, USA. New York, NY, USA: IEEE;2015. p. 1–9.
29. Zednik C. Solving the black box problem: a normative framework for explainable artificial intelligence. Philos Technol. 2021; 34(2):265–288.
30. Koltay T. Data literacy for researchers and data librarians. J Librarian Inform Sci. 2017; 49(1):3–14.
31. Lee JA. Data, information, and knowledge. Lancet Oncol. 2002; 3(6):384. PMID: 12107028.
32. Georgiou A. Data information and knowledge: the health informatics model and its role in evidence-based medicine. J Eval Clin Pract. 2002; 8(2):127–130. PMID: 12180361.
33. Hänsel K, Dudgeon SN, Cheung KH, Durant TJS, Schulz WL. From data to wisdom: biomedical knowledge graphs for real-world data insights. J Med Syst. 2023; 47(1):65. PMID: 37195430.
Full Text Links
  • JKMS
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2024 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr