Advancing the democratization of generative artificial intelligence in healthcare: a narrative review

Anjun Chen; Lei Liu; Tongyu Zhu

doi:10.21037/jhmhp-24-54

Review Article

Advancing the democratization of generative artificial intelligence in healthcare: a narrative review

Anjun Chen^1,2 , Lei Liu², Tongyu Zhu²

¹Health System Sciences, ELHS Institute, Palo Alto, CA, USA; ²Healthcare AI Institute, Fudan University Medical School, Shanghai, China

Contributions: (I) Conception and design: A Chen; (II) Administrative support: All authors; (III) Provision of study materials or patients: All authors; (IV) Collection and assembly of data: A Chen; (V) Data analysis and interpretation: A Chen; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Anjun Chen, PhD. PI, Health System Sciences, ELHS Institute, 748 Matadero Ave, Palo Alto, CA 94306, USA; Healthcare AI Institute, Fudan University Medical School, Shanghai, China. Email: aj@elhsi.org.

Background and Objective: The emergence of ChatGPT-like generative artificial intelligence (GenAI) has dramatically transformed the healthcare landscape, bringing new hope for the democratization of artificial intelligence (AI) in healthcare—a topic that has not been comprehensively reviewed. This review aims to analyze the reasons propelling the democratization of healthcare GenAI, outline the initial evidence in the literature, and propose future directions to advance GenAI democratization.

Methods: We conducted a deep literature search for GenAI studies using Google Scholar, PubMed, ChatGPT, Journal of the American Medical Association (JAMA), Nature, Springer Link, and Journal of Medical Internet Research (JMIR). We performed an abstraction analysis on the nature of GenAI versus traditional AI and the applications of GenAI in medical education and clinical care.

Key Content and Findings: (I) A detailed comparison of traditional and GenAI in healthcare reveals that large language model (LLM)-based GenAI’s unprecedent general-purpose capabilities and natural language interaction ability, coupled with its free public availability, make GenAI ideal for democratization in healthcare. (II) We have identified plenty of initial evidence for GenAI democratization in medical education and clinical care, marking the start of the emerging trend of GenAI democratization in a host of impactful applications. Applications in medical education include medical exam preparation, medical teaching and training, and simulation. Applications in clinical care include diagnosis assistance, disease risk prediction, new generalist chatbots, treatment decision support, surgery support, medical image analysis, patient communication, physician communication, documentation automation, clinical trial automation, informatics tasks automation, and specialized or custom LLMs. (III) Responsible AI is essential for the future of healthcare GenAI. National initiatives and regulatory efforts are working to ensure safety, efficacy, accountability, equity, security and privacy are built into healthcare GenAI. Responsible GenAI requires a human-machine collaboration approach, where AI augments human expertise rather than replaces it.

Conclusions: The democratization of GenAI in healthcare has just begun, driven by the nature of GenAI and guided by the principle of human-machine collaboration. To further advance GenAI democratization, we propose three key future directions: integrating GenAI in medical education curricula, democratizing GenAI clinical evaluation research, and building learning health systems (LHS) with GenAI for system-level enforcement of democratization. Democratizing GenAI in healthcare will revolutionize medicine and significantly impact care delivery and health policies.

Keywords: Generative AI (GenAI); ChatGPT; AI democratization; healthcare; learning health system (LHS)

Received: 21 March 2024; Accepted: 07 June 2024; Published online: 27 June 2024.

doi: 10.21037/jhmhp-24-54

Introduction

In 2019, the U.S. National Academy of Medicine (NAM) issued a landmark report highlighting the role of artificial intelligence (AI) and machine learning (ML) in healthcare, catalyzing discussions on the democratization of healthcare AI technologies (1). By 2023, under the leadership of NAM President Dr. Dzau, a critical evaluation of AI’s progress exposed substantial challenges in equity, oversight, and regulation, emphasizing the need for concerted efforts to establish a robust foundation for AI to effectively address complex global health issues (2).

Amidst this backdrop, the emergence of generative AI (GenAI) technologies based on large language models (LLMs), exemplified by the capabilities of chatbots to perform tasks ranging from creative writing to disease prediction, marked a pivotal shift, bringing AI capabilities directly to the public domain through platforms such as OpenAI’s ChatGPT. GenAI refers to a class of AI techniques that are designed to generate new contents similar to but distinct from the data on which they are trained. In one common form, anyone can simply talk to a GenAI chatbot like ChatGPT in human language, and the chatbot will respond back in fluent human language. Surprisingly, GenAI chatbot’s responses can exhibit clear logic and reasoning like humans, demonstrating an unprecedented level of AI.

Furthermore, the general-purpose LLMs trained with all the data available on the Internet can provide correct answers to many healthcare questions, indicating they may be primed for evaluation in healthcare applications. To encourage healthcare GenAI research, the Journal of the American Medical Association (JAMA) calls for AI papers in a wide spectrum of topics (3). Future rigorous original research will examine whether the promise of optimizing healthcare with AI can be delivered to improve patient outcomes, clinician experience, medical education, and health systems. Healthcare AI democratization will mean that everyone, whether patient or provider, has the opportunity to use GenAI and benefit from AI in every aspect and step of healthcare, increasing diagnosis accuracy and treatment efficacy and reducing health disparities at scale globally.

This literature review aims to review the reason propelling GenAI democratization in healthcare and distill evidence from original research highlighting the initial stages of GenAI democratization across medical education, clinical research, and healthcare delivery, thereby contributing to a comprehensive understanding of its impacts and future potential. We present this article in accordance with the Narrative Review reporting checklist (available at https://jhmhp.amegroups.com/article/view/10.21037/jhmhp-24-54/rc).

Methods

Literature search

Our review process involved a deep literature search and reference traversal for peer-reviewed publications across various databases and publisher platforms, including Google Scholar, PubMed, ChatGPT, JAMA, Nature, Springer Link, and Journal of Medical Internet Research (JMIR). Utilizing a comprehensive set of search terms and their variants (see Table 1), we aimed to capture the breadth of GenAI applications within the healthcare domain. The search was focused primarily on studies exploring the utilization of leading LLM models and publicly accessible generative pre-trained transformer (GPT) chatbots, such as OpenAI’s ChatGPT and Google Gemini, in healthcare.

Table 1

Search strategy summary

Items	Specification
Date of search	3/1/2024–3/15/2024
Databases and other sources searched	Google Scholar, PubMed, ChatGPT, JAMA, Nature, Springer Link, and JMIR
Search terms	Democratization of generative AI in healthcare or medicine
	Democratization of ChatGPT in healthcare or medicine
	Generative AI or ChatGPT in healthcare or medicine
	Generative AI or ChatGPT and medical education
	Generative AI or ChatGPT and diagnosis, risk prediction, primary care, treatment, surgery, imaging, patient question, professional question, automation, documentation, ethics, equity
Timeframe	Up to March 2024
Inclusion criteria	Include peer-reviewed articles primarily and preprint articles only if no similar published articles are available, and NAM reports. English only
Selection process	The corresponding author conducted literature search and selection
Additional considerations	Preferred for papers from top journals and studies from top universities

ChatGPT, Chat Generative Pre-trained Transformer; JAMA, Journal of the American Medical Association; JMIR, Journal of Medical Internet Research; AI, artificial intelligence; NAM, National Academy of Medicine.

Abstraction analysis

GenAI was first abstractly compared to traditional AI in the context of healthcare. Comparing every key aspect of AI in detail was intended to reveal the potential for democratization of these two distinct AI technologies. To synthesize the expansive array of healthcare tasks applicable for GenAI, we reviewed each original GenAI study in the selected papers, abstracted the GenAI tasks, and dissected the underlying reasons for their democratization potential. From this analysis, we organized the collected studies and AI tasks under the abstracted application concepts, presenting initial evidence of GenAI’s democratizing effects in two critical domains, namely medical education and clinical care. The results were expected to shed light on any emerging trend of democratizing access to GenAI tools for enhancing healthcare education and delivery.

Simple definitions of key technical terms used in this review

Neural networks: neural networks are computational models inspired by the structure and function of the human brain. They are designed to recognize patterns by processing data through layers of interconnected nodes or neurons. These networks perform tasks by learning from examples, generally without being programmed with task-specific rules.
Deep learning: deep learning is a subset of ML that employs neural networks with multiple layers (deep neural networks) to analyze various levels of abstract features in data. This technique automates learning from experience without the need for human-defined rules.
Transformer models: transformer models are a type of deep learning architecture that utilizes mechanisms called attention and self-attention to weigh the influence of different parts of the input data. This enables them to process sequences of data in parallel and capture complex dependencies.
LLM: an LLM is a type of deep learning program developed to understand and generate human-like text by training on vast amounts of textual data. It performs tasks such as translation, summarization, and question answering.
GenAI agent: a GenAI agent refers to a specialized AI system designed to generate content or perform tasks by simulating human-like abilities in creative or analytical processes. It integrates generative models like language models or image generators to autonomously produce outputs such as text, images, or even actions in response to specific tasks or prompts.
Retrieval-augmented generation (RAG): RAG is a ML method that enhances the generative capabilities of a model by dynamically retrieving relevant information from a large dataset during the generation process. This approach combines the strengths of neural network-based generation with information retrieval to improve the accuracy and richness of the generated content.

Results

Challenges of democratization in healthcare AI

The democratization of AI in healthcare aims to make advanced AI tools broadly accessible to a wide range of healthcare stakeholders, including patients, professionals, researchers, and organizations. This movement seeks to empower the healthcare community to utilize AI for enhancing medical education, clinical research, patient care, and public health outcomes, thereby promoting effective, ethical, equitable, and personalized healthcare solutions. Dr. Rajkomar and colleagues from Harvard Medical School have underscored the democratization potential for healthcare AI, envisioning a future where ML will analyze all relevant data to optimize patient care globally (4). Yet, traditional healthcare AI (THAI) encounters substantial barriers to widespread adoption. Despite abundant research, AI’s impact on clinical practice has been minimal compared to its influence in other sectors (5).

The democratization of THAI faces the following key obstacles:

Frequent deployment failure: THAI models, trained on limited datasets, often lack generalizability, leading to successful deployment in some healthcare settings but failure in others (6,7).
Model inefficiency: THAI typically develops one model for a specific task or a small set of tasks, necessitating numerous models for various healthcare tasks (8).
Technical complexity: THAI involves complex algorithms and requires deep AI expertise, making it challenging for non-experts to develop or apply AI solutions (8).
Lack of training data access: due to significant privacy and security concerns, access to large patient datasets for ML training is very limited (9).
High computational costs: THAI models, particularly deep learning networks, require significant computational resources, making them often cost-prohibited (10).
Limited accessibility and usability: tools based on THAI models have been traditionally designed for expert use, demanding structured patient input data and presenting steep learning curves, thus limiting broader access and use (11).

GenAI: new hope for AI democratization in healthcare

In contrast, GenAI, exemplified by GPT models, offers a promising path to overcoming these barriers. The release of ChatGPT-4 in early 2023 highlighted GenAI’s key advantages, including general-purpose utility, ease of use, cost-effectiveness, and widespread availability, marking a significant step towards democratizing AI. Comparing the democratization potential of generative healthcare AI (GHAI) versus THAI in detail (Table 2), we posit that GHAI theoretically is more feasible to reach the democratization stage in healthcare than previous technologies. Echoing this, Stanford University researchers argued that traditional AI and GenAI differ in their capabilities and current deployment modes, which have implications for the effective adoption of GenAI in health systems (12).

Table 2

Comparison of traditional and generative healthcare AI for considerations of democratization potential

Democratization consideration*	Traditional healthcare AI	Generative healthcare AI
Purpose	Designed for specific tasks with limited application scopes. A large number of AI models are required to cover numerous healthcare tasks	General-purpose for a wide range of tasks in clinical care, public health, and medical education. A smaller number of models are enough for numerous healthcare tasks
Technology	Highly technical and complex to build AI models, often for AI experts	Using available general GPT models, accessible to most developers
Data	Requiring structured patient datasets for training, complex and resource-incentive, with safety and privacy concerns. Available to only a select few researchers	No need for training data in most cases, requiring small patient datasets only if models require fine-tuning
Computation	High computation costs	No front-end cost, low model usage costs
Accessibility	Often within guarded walls	Publicly accessible
Usability	Usually requires unavailable structured data for input, difficult to use	Using any human natural languages in AI chatbots, easy to use
Deployment	Models often lack generalizability, performing well in some healthcare settings but failing in others	Expecting similar performance by the same general GPT models across various healthcare settings
Ethics	Requiring regulations	Requiring stronger regulations to control GenAI’s creative power
Equity	Potentially worsen disparities in low-resource healthcare settings	Improve access to AI in low-resource settings and promote equity

*, generative healthcare AI overcomes many challenges of traditional healthcare AI and thus becomes more feasible to be democratized. AI, artificial intelligence; GPT, generative pre-trained transformer; GenAI, generative artificial intelligence.

We believe the following reasons are key to GHAI democratization:

General purpose: GHAI models are versatile LLMs not tailored to specific tasks but capable of performing a wide range of healthcare tasks, benefiting all healthcare stakeholders.
Ease of use: GHAI tools, like chatbots, can interact in human natural language, simplifying their integration into healthcare delivery processes.
Free public availability: high-performing GHAI chatbots are accessible online to anyone for free or at a low cost, encouraging widespread experimentation with AI, a key aspect of technology democratization.

Reviewing selected literature on GenAI in healthcare reveals a trend towards democratization. Many studies have shown initial evidence for an unexpected milestone: general-purpose models, such as OpenAI’s ChatGPT, can achieve predictive outcomes comparable to or better than healthcare professionals without needing retraining on costly health-specific datasets. With this early-arrived milestone, we propose a simple strategy to democratize GenAI in healthcare: leveraging general GPT/LLM models for most predictable healthcare tasks, supplemented by specialized models for specific challenges.

To illustrate how clinical AI research will differ and improve by using GenAI instead of traditional AI, consider an important use case: developing and implementing a diagnostic monitoring process across all diseases in a large hospital as recommended by the 2015 report from the National Academies, “Improving Diagnosis in Health Care” (13). If there are 1,000 different diseases, the hospital would need to develop 1,000 different ML models for diagnosis using the traditional AI approach, which is usually cost-prohibited and contributes to the slow adoption of recommendations for enhancing diagnostic decision-making and detecting errors. However, with the GenAI approach, only one automated process needs to be developed by combining several top general-purpose LLMs, including ChatGPT, Gemini, Llama, and Claude. This method makes tasks traditionally seen as impossible over 1,000 times more efficient in the new world of GenAI, setting a technological foundation for every patient and provider to use GenAI in future healthcare journeys.

JAMA publications recognize that GenAI could deliver significant healthcare improvements more swiftly than earlier technologies (14,15). The 2024 annual AI reflection survey by Nature Machine Intelligence journal highlights ongoing developments in LLMs and GenAI (16). The subsequent sections will review original studies evaluating general LLM applications in medical education and clinical care, followed by studies on specialized LLM models. Selected GenAI applications and corresponding initial evidence for GenAI democratization were listed in Table 3.

Table 3

Selected initial evidence for democratization of generative AI in healthcare: educational and clinical applications

Domain	Application	Initial evidence for GenAI democratization
Medical education	Medical exam preparation	ChatGPT passed the USMLE
		ChatGPT scored high in specialty exams, symptom checking, and diagnosis
		Self-directed learning with GenAI
	Medical teaching and training	GenAI can be integrated into medical school curricula
	Medical teaching and training	ChatGPT served as a copilot, preparing LHS training content
	Simulation	ChatGPT simulated a patient in history-taking practice
	Simulation	GenAI was used in surgical procedure simulations
Clinical care	Diagnosis assistance	ChatGPT outperformed traditional AI chatbots in diagnosing patient cases
		Top general LLM models scored high in theoretical diagnosis benchmarking in different languages
		ChatGPT followed clinicians’ clinical reasoning processes in making diagnoses
		ChatGPT analyzed complex patient cases, providing useful diagnostic considerations to doctors
	Disease risk prediction	ChatGPT outperformed traditional chatbots in symptom checking benchmarking
		ChatGPT made comparable CVD risk predictions from biobank data
		ChatGPT predicted in-hospital risks and generated alerts
	New generalist chatbots	A new idea of foundation models for generalists was proposed
	New generalist chatbots	ChatGPT outperformed family medicine residents in a residency test
	Treatment decision support	ChatGPT provided useful treatment information for prostate cancer
	Treatment decision support	ChatGPT was used as a decision support tool in a breast cancer board
	Surgery support	ChatGPT generated postoperative instructions tailored to the patient’s health literacy levels
	Medical image analysis	GenAI produced comparable reports from radiology images in the ER
	Medical image analysis	ChatGPT made diagnoses from imaging data
	Patient communication	ChatGPT answered patient questions, outperforming traditional chatbots
	Patient communication	ChatGPT answered questions well in eye care and skin care
	Physician communication	ChatGPT answered physician questions effectively
	Physician communication	Prompt engineering improved answer accuracy
	Documentation automation	ChatGPT generated patient emails, clinical notes, visit summaries, consent forms
	Clinical trial automation	ChatGPT created ML models from clinical trial data
	Informatics tasks automation	ChatGPT extracted medical terms from clinical texts
	Informatics tasks automation	RAG approach to deliver medical guidelines
	Specialized LLMs	Google Med-PaLM passed the USMLE
		Google LLM optimized for differential diagnosis outperformed clinicians
		GatorTron LLM was built from EHR data
	Custom LLMs	The open-source LLaMA-7B model was fine-tuned for DRG prediction
		The custom SkinGPT-4 model predicted skin diseases
		Generate synthetic data to reduce training data bias
	GenAI agents	Google LLM-based agent conversed autonomously with patients
	GenAI agents	Multi-agent collaboration to solve clinical tasks

AI, artificial intelligence; USMLE, US Medical Licensing Exam; GenAI, generative artificial intelligence; LHS, learning health systems; LLM, large language model; CVD, cardiovascular diseases; ER, emergency department; ML, machine learning; RAG, retrieval-augmented generation; EHR, electronic health record; DRG, diagnosis-related group.

Democratization of general LLM models in medical education

GenAI is expected to drastically transform medical education in both learning and teaching. A report from the National Academies of Sciences, Engineering, and Medicine recognizes AI currently has a minimal presence in medical education curriculum. It anticipates that AI will be firmly integrated into curricula in the coming years (17). Echoing this, Dean Minor from Stanford University Medical School predicts that GenAI will radically affect how we educate physicians, enabling doctors to focus less on memorization and more on understanding and deploying AI models.

GenAI impact in medical exam and clinical learning

Dr. Chang, Dean for Medical Education at Harvard Medical School, believes future physicians will be AI-enabled (18). With GenAI, medical students can progress more quickly from the basics to advanced levels of reasoning and communication, supported by AI for fundamental decision analysis and communication. GenAI may enable physicians to return to being more humanistic, spending more time with patients.

ChatGPT has passed the US Medical Licensing Exam (USMLE) (19,20). One study compared the performance of first- and second-year Stanford School of Medicine students versus GenAI models in clinical reasoning final examinations, and the results showed that the best GenAI model on average scored higher than the students (21). These unprecedented AI achievements suggest that GenAI chatbots can serve as a “study buddy” from which medical students may learn during medical exam preparation. ChatGPT also performed well in answering exam questions for various specialties such as ophthalmology (22) and neurology (23). Similarly, a study showed that the ChatGPT-4 model outperformed human physicians in diagnosing a collection of complex diagnostic cases (24). Using the Mayo Clinic Symptom Checker as a benchmark, a comprehensive benchmarking study showed that ChatGPT-4 achieved high accuracy in predicting disease causes from patient symptoms (25). All this initial evidence supports the use of GenAI tools as supplementary aids for medical students in clinical case learning (26). A literature review by Stanford University physicians revealed diverse potential applications for GenAI in medical education, including self-directed learning with GenAI as a new competency in “AI literacy” (27).

GenAI impact in medical teaching and training

University of California, San Francisco (UCSF) educators have recommended integrating GenAI for admissions, learning, assessment, and medical education research to help medical educators navigate and start planning for the new AI environment (28). A responding letter added that, because GenAI holds the potential to democratize AI research for all students and residents, medical schools should train every student and clinician to use new GenAI tools for evaluating the impact of GenAI on their specific health care tasks (26). A medical educator can use an LLM to create a wide array of simulated patient scenarios, which can be highly realistic and varied, enabling students to gain exposure to a broad range of medical conditions and patient interactions (29). The multilingual nature of LLMs enables advanced machine translation, fostering global collaboration and knowledge exchange. ChatGPT was used to prepare contents for learning health system (LHS) training, exemplifying the potential of making full medical knowledge accessible to all populations speaking different languages (30,31). Medical schools are evolving curricula to reflect the new GenAI era, moving students more quickly towards higher levels of cognitive analysis, understanding of individual patient nuances, compassionate and culturally competent communication, while returning to the primacy of the physical exam (18).

GenAI usage in simulation for medical training

GenAI has been evaluated in patient simulation and surgery simulation for clinical training. For instance, a medical student could interact with a simulated patient with a rare disease, ask questions, and receive responses that mimic real patient responses, allowing the student to practice clinical reasoning skills in a safe and controlled environment. Using ChatGPT for practicing history taking, a feasibility study showed that the GenAI chatbot can provide a simulated patient experience and yield a good user experience with a majority of plausible answers (32). A case study showed that ChatGPT could simulate the complete medical record interrogation mode without disconnection (33). Extending AI-based surgical procedure simulation, GenAI is being tested for surgical planning, answering surgical questions, real-time decision-making, and providing feedback on proper technique in an interactive surgery environment (34).

Democratization of general LLM models in clinical care

General LMM models as enablers and assistants have vast potential to expand the healthcare services that physicians can provide to patients (35). Various applications of GenAI chatbots, like ChatGPT, without specific training with health data, have demonstrated GenAI’s promising impact in nearly every aspect of clinical care delivery, including diagnosis, treatment, screening, management, communication, and the doctor-patient relationship (36). GenAI chatbots are expected to be increasingly used by both medical professionals and patients (37). Before applying GenAI in clinical practices, the general models and tools must undergo evaluation in clinical studies for accuracy, safety, and validity. Ethical considerations must also be addressed. Currently, there is no publicly available, nationwide mechanism for objective evaluation of health AI models in clinical care settings. Thus, a public-private partnership idea is proposed to support a nationwide health AI assurance labs network (38). This network would apply community best practices in testing health AI models, producing reports on their performance that can be widely shared to manage the lifecycle of AI models over time and across populations and sites where these models are deployed. The democratization of general LLMs will increasingly benefit our society by supporting the clinical decision-making process with human oversight, rather than replacing physicians.

Assisting diagnosis

Utilizing 38 complex clinical case challenges with comprehensive full-text information published online by NEJM, ChatGPT-4 was compared with NEJM journal readers in diagnosing patient case challenges (39). ChatGPT-4 correctly diagnosed 57% cases, whereas the journal readers on average correctly diagnosed 36% of cases. A similar study used 70 clinicopathological cases published by NEJM to evaluate ChatGPT’s diagnostic accuracy, which showed good performance (40). Another study assessed LLM diagnostic accuracy across JAMA Pediatrics and NEJM pediatric case challenges, with promising results (41). In evaluating the clinical accuracy of ChatGPT for suggesting initial diagnosis, examination steps, and treatment for 110 medical cases across diverse clinical disciplines, ChatGPT-4 performed well (42). Systematic diagnosis benchmarking of a wide range of diseases across various specialties using hypothetical patient cases showed that OpenAI ChatGPT-4, Google Gemini, and Baidu Ernie-4 performed well in both English and Chinese (unpublished results from Anjun Chen).

A study from Stanford University found that ChatGPT-4 could be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy (43). By providing an interpretable rationale, ChatGPT can offer physicians a way to evaluate whether an LLM response is likely correct and can be trusted for patient care. This feature of LLMs has the potential to mitigate the “black box” limitations of LLMs, bringing them one step closer to safe and effective use in medicine. Harvard Medical School researchers used 36 clinical vignettes to demonstrate that ChatGPT can recommend diagnostic workup, decide on the clinical management course, and ultimately make the diagnosis, thus working through the entire clinical encounter with an overall accuracy of 71.7% (44).

In analyzing the clinical history of patients with complex and delayed diagnoses, ChatGPT-4 suggested diagnoses not considered by clinicians before definitive investigations (45). The GenAI analysis may increase confidence in diagnosis and earlier commencement of appropriate treatment, alert clinicians to important missed diagnoses, and offer suggestions similar to specialists to achieve the correct clinical diagnosis. This is especially valuable in low-income countries with a lack of specialist care.

Making a correct and timely diagnosis remains a challenge in modern medicine despite decades of technological advances. Therefore, any emerging technology with the potential to reduce diagnostic errors warrants serious examination (46). ChatGPT-4’s surprisingly high accuracy in diagnosis led researchers to conclude that GenAI may prove useful as clinical decision-making support for complex diagnoses today.

Predicting disease risks

A systematic benchmarking study evaluated ChatGPT’s ability to check symptoms for 194 diseases across different specialties covered in the Mayo Clinic Symptom Checker (25). The disease prediction accuracy by ChatGPT-4 reached approximately 78.8%, outperforming previous symptom chatbot technologies. This result provided initial evidence for the potential use of ChatGPT-4 in symptom checking by patients and for triage applications. For example, an exploratory study found that ChatGPT provided largely appropriate responses to simple cardiovascular disease prevention questions as evaluated by preventive cardiology clinicians (47). Using the UK Biobank data, a study showed that ChatGPT could achieve a 10-year risk prediction performance for cardiovascular diseases comparable to conventional models (48). ChatGPT may also be applied to in-hospital risk prediction and alert generation (49). These early studies suggest the promising application of GenAI chatbots in risk-based screening in various settings.

Empowering primary care

A review proposed the concept of a generalist medical AI based on foundation models, which could tackle a variety of applications, from chatbots with patients to note-taking, all the way to bedside decision support for doctors (50). AI based on LLMs can enable first-line screening, performed either in general practice or by patients themselves. ChatGPT-4 scored 82.4% in the University of Toronto Family Medicine Residency Progress Test, compared to 56.9% by Family Medicine residents (51). This study suggested that general practice may be improved to higher standards with the support of GenAI.

Standardizing and personalizing disease treatment

ChatGPT can provide standard information and evidence on cancer care, although a human in the loop is required to avoid potential misinformation (52). For example, LLMs could provide correct and useful information on prostate cancer screening, prevention, treatment options, and postoperative complications, contributing to the democratization of medical knowledge (53). In a proof-of-concept study, ChatGPT was used as a decision support tool in a breast tumor board with promising results (54). However, for GenAI chatbots to provide treatment option information in precision cancer care, studies found the AI recommendations did not reach the quality and credibility of human experts (55,56).

Assisting surgery

LLMs hold promise for enhancing surgical efficiency while still prioritizing patient care (57). The review authors recommend that the academic surgical community further investigate the potential applications of LLMs while being cautious about potential harms. For generating postoperative patient instructions, while ChatGPT currently cannot supplant a human clinician, it can serve as a medical knowledge source (58). The qualitative study assessed the value of ChatGPT in augmenting patient knowledge and generating postoperative instructions for use in populations with low educational or health literacy levels.

Prediction from images

Many traditional AI tools are already in use to help radiologists analyze complex scans more quickly and accurately. GenAI’s image recognition capabilities may have potential in medical imaging (59). In a diagnostic study of a developed GenAI model on a representative sample of 500 emergency department chest radiographs from 500 unique patients, a GenAI model produced reports of similar clinical accuracy and textual quality to radiology reports while providing higher textual quality than teleradiology reports (60). The results suggest that use of the GenAI tool may facilitate timely interpretation of chest radiography by emergency department physicians. Another feasibility study of GenAI in diagnostic imaging showed that ChatGPT-4 attained a concordance of 68.8% with expert consensus at determining top differential diagnoses based on imaging patterns, and 93.8% of differential diagnoses proposed by ChatGTP-4 were deemed acceptable alternatives (61). ChatGPT-4 was used to process ophthalmic imaging data of 136 ophthalmic cases and was able to answer 70% of all multiple-choice questions correctly (62).

Improving answers to patient questions and patient communication

In a study comparing physician and GenAI chatbot responses to 195 randomly drawn patient questions posted to a public social media forum, the chatbot responses were preferred over physician responses and rated significantly higher for both quality and empathy (63). These results suggest that GenAI assistants may be able to aid in drafting responses to patient questions. A follow-up study showed ChatGPT consistently provided evidence-based answers to public health questions (64). ChatGPT outperformed benchmark evaluations of other AI assistants. Given the same addiction questions, Amazon Alexa, Apple Siri, Google Assistant, Microsoft’s Cortana, and Samsung’s Bixby collectively recognized 5% of the questions and made one referral, compared with 91% recognition and two referrals with ChatGPT.

In assessing ChatGPT responses to 200 eye care questions from an online advice forum, a study found that ophthalmologists and ChatGPT may provide comparable quality of ophthalmic advice (65). For specific eye diseases, LLM chatbot demonstrated comparative proficiency as glaucoma and retina subspecialists in addressing ophthalmological questions and patient case management (66). For questions concerning six skin conditions and common queries, ChatGPT answered with high accuracy, but inaccurate and incomplete responses existed, emphasizing the importance of professional dermatologist consultations (67).

Improving answers to professional questions

Studies showed ChatGPT is promising as a tool for providing accurate medical information in clinical settings (68). Chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists, although it had important limitations (69). One study examined prompt engineering in consistency and reliability with the evidence-based guideline for LLMs, revealing that appropriate prompt could improve the accuracy of responses to professional medical questions (70).

Automating documentation

GenAI LLMs have been tested to draft clinical contents or documents in support of clinicians, including responses to patient emails, clinical notes, visit summaries, consent forms, postoperative instructions (58,71,72). Using ambient listening in an exam room with patient consent, GenAI is being tested to automatically generate a formatted medical note of the clinical interaction.

There are studies raising issues with the automated documentation by GenAI tools, such as inaccurate information or statements without evidence (73,74). A consensus strategy to address these issues is to create human-in-the-loop interaction with the AI, allowing care providers to edit and approve the document before it is finalized. GenAI-based document automation may reduce clinicians’ feelings of burnout.

Automating clinical trial analysis

ChatGPT has been explored to automatically create ML prediction models from clinical trial data with promising results (75). The auto-created ML models are comparable to their respective manually crafted counterparts revealed in traditional performance metrics, opening a possibility of automating clinical trial data ML analysis.

Automating medical informatics tasks

ChatGPT was leveraged for clinical named entity recognition tasks including extraction of medical problems, treatments, and tests from clinical notes (76). Although ChatGPT’s performance trails state-of-art models like BioClinicalBERT, the study suggested that fine-tuning the general LLMs with task-specific knowledge may improve performance. One study applied the RAG approach to improve the interpretation of medical guidelines for chronic hepatitis C virus infection management, demonstrating RAG approach can enhance the efficacy of LLM integrations to clinical decision support systems for guideline delivery (77).

Democratization of GenAI agents and specialized or custom GenAI in healthcare

There are a few healthcare-specialized LLMs built from healthcare data from scratch (78,79). Med-PaLM, Google’s LLM, is designed to provide high-quality answers to medical questions (80). Med-PaLM 2 set a record with 86.5% accuracy on the MedQA medical exam benchmark, achieving accuracy comparable to that of human experts (81). Google’s LLM for differential diagnosis exhibited higher accuracy than clinicians and has potential to improve clinicians’ diagnostic reasoning and accuracy in challenging cases. This merits further real-world clinical evaluation for its ability to empower physicians and expand patients’ access to specialist-level expertise (82).

Google introduced Articulate Medical Intelligence Explorer (AMIE), an LLM-based AI system optimized for diagnostic dialogue (83). AMIE’s performance was compared to that of primary care physicians in a randomized, double-blind crossover study of text-based consultations with validated patient actors for 149 clinical case scenarios. AMIE demonstrated greater diagnostic accuracy and superior performance. While further research is required before AMIE can be applied in real-world settings, the results represent a new milestone towards conversational diagnostic AI. Researchers say this type of autonomous GenAI agent could help democratize medicine.

Researchers at the University of California commented that we should view LLMs as intelligent “agents” capable of interacting with stakeholders in open-ended conversations and even influencing clinical decision-making. These LLM agents can be developed for a variety of clinical use cases by providing the LLM access to different sources of information and tools, including clinical guidelines, electronic health records, and clinical software tools. Different agents can also collaborate with each other in “multi-agent” settings to model medical conversations and decision-making processes (84). Drawing from an analogy to autonomous vehicle development, Stanford University researchers proposed a pathway to autonomous behavioral healthcare using GenAI agents (85).

GatorTron is an LLM utilizing over 90 billion words of clinical text (86). It showed improvement in five clinical natural language processing (NLP) tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference, and medical question answering. NYUTron is an LLM for medical language, subsequently fine-tuned across a wide range of clinical and operational predictive tasks (87).

For specific tasks, customizing or fine-tuning existing open-source LLMs with health data is a more cost-effective GenAI strategy. Using LLaMA-7B as the foundational model and optimizing it through Low-Rank Adaptation (LoRA) on hospital discharge summaries, a fine-tuned model surpassed the performance of prior leading models, such as ClinicalBERT and CAML, in diagnosis-related group prediction (88). SkinGPT-4, a fine-tuned LLM for skin disease images, could evaluate user-uploaded skin images, identify the characteristics and categories of skin conditions, perform in-depth analysis, and provide interactive treatment recommendations. A pilot study of SkinGPT-4 demonstrated approximate 78% diagnosis accuracy (89).

When making specialized or custom models for a specific domain in healthcare, domain generalization is a ubiquitous challenge, often due to data bias and underrepresentation of data in some patient groups or conditions. A promising strategy to address this issue is to use generative models to generate synthetic data from limited sample data to improve data representations in training datasets. For example, a recent study demonstrated that complementing real data with synthetic ones generated by a denoising diffusion probabilistic model (DDPM) improved the robustness of medical image classifiers and increased fairness by improving the accuracy of clinical diagnosis within underrepresented groups under distribution shifts (90). DDPM may be applied to small dataset although more research is needed (91).

Responsible GenAI in healthcare

Because GenAI, like ChatGPT, demonstrates unprecedented capabilities in many areas, profound ethical issues are affecting the new relationship between GenAI and our human society. It is essential to address these issues appropriately for humans to benefit from GenAI democratization in healthcare. New laws and regulations will be required to enforce responsible GenAI development and applications that ensure safety, efficacy, privacy, and equity. Although the rapid pace of GenAI development poses challenges for governments to create effective regulations in time, Dr. Robert Califf, the commissioner of US Food and Drug Administration (FDA), described the regulation of LLM models as critical to healthcare’s future in May 2023.

Drawing from real-life scenarios and insights shared at the Responsible AI for Social and Ethical Healthcare (RAISE) conference, a Nature Medicine commentary highlights the critical need for AI in healthcare to primarily benefit patients and address current shortcomings in health care systems, such as medical errors and access disparities (92). Badal et al. proposed guiding principles for the responsible development of AI tools for healthcare, such as reducing health disparities, improving clinical outcomes, reducing overdiagnosis and overtreatment, promoting learning healthcare system, and facilitating shared decision-making (93).

Ethics concerns: safety, efficacy, transparency, accountability, security, and privacy

Observing that applications powered by LLMs are increasingly used to perform medical tasks without appropriate clinical verification, Shah et al. have called for regulations that require GenAI tools to specify the desired benefits and evaluate these benefits through testing in real-world deployments (94). Meskó and Topol argued that regulatory oversight should assure medical professionals and patients can use LLMs without causing harm or compromising their data or privacy, summarizing some practical recommendations for regulations (95). For data privacy and security, regulatory oversight should establish robust frameworks to protect patient data and prevent unauthorized access or misuse of these powerful healthcare LLM models.

At their current stage, LLMs may generate factually incorrect outputs, i.e., GenAI’s hallucination behavior, that can result in harm to patients. Implementing transparency and accountability policies for GenAI applications is critical to patient safety. For instance, healthcare professionals and patients should be made aware of the AI’s involvement in the decision-making process and be provided with explanations for the AI’s recommendations.

However, AI presents an ethical dilemma regarding liability and accountability. When an AI system makes a decision that results in patient harm, it can be difficult to determine who is at fault—the healthcare provider who used the AI, the developers of the AI, or even the AI itself. For instance, if an AI diagnostic tool incorrectly identifies a benign tumor as malignant and the decision is not critically evaluated, leading to unnecessary treatment, determining liability can be complex. Oniani et al. defined accountability as the property of being able to trace activities on a system back to individuals who may then be held responsible for their actions (96). They recommended monitoring GenAI applications for potential errors, deactivating to prevent more damage when an error occurs, and remedying errors in time.

Mello and Guha considered ChatGPT and its risk for malpractice, believing that physicians should use LLMs only to supplement more traditional forms of information seeking (97). Using GenAI predictions as a new supporting information source can help capture the distinctive value of LLMs today while avoiding their pitfalls. This recommendation highlights the role of human oversight in ensuring patient safety. Haupt and Marks distinguish three GPT use cases: AI within the patient-physician relationship that augments rather than replaces clinician judgment, patient-facing AI in healthcare delivery that substitutes for clinician judgment, and direct-to-consumer health advice-giving (98). Clinicians have distinct concerns in each context. Patients might benefit from using GPT as a medical resource. However, unless AI-generated medical advice is filtered through healthcare practitioners, false or misleading information could endanger patient safety, which emphasizes the importance of clinician oversight in applying GenAI to healthcare settings.

Gottlieb and Silvis discussed how to safely integrate LLMs into healthcare, focusing on the crucial aspect of data sharing and interoperability (99). They expect LLMs to be poised to assist physicians in diagnosing, treating, and managing diseases, thereby enhancing medical care. These new GenAI tools could be especially valuable in managing high-volume and routine encounters and by providing more ongoing support to patients than what clinicians are able to achieve unassisted. With physician oversight, GenAI will drive some routine aspects of healthcare, aiming to broaden the scope of patient engagement rather than replacing interactions with clinicians.

Equity concerns: data bias and access disparities

Evaluating potential racial and ethnic biases of ChatGPT in the diagnosis and triage of health conditions, Ito et al. demonstrated that ChatGPT-4’s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians and the performance of ChatGPT-4 did not vary by patient race and ethnicity (100). However, when Kim et al. assessed biases in medical decisions via GenAI chatbot responses to patient vignettes, they observed that GenAI chatbots provided different recommendations based on a patient’s gender, race and ethnicity, and socioeconomic status in certain clinical scenarios (101). Omiye et al. asked some LLMs, including ChatGPT, medical questions and found that these LLMs could potentially cause harm by perpetuating debunked, racist ideas (102). They raised an alarm in their study report titled “Large language models propagate race-based medicine”. We should caution that the original studies of LLM bias used small sample size, and their conclusions should be viewed as inconclusive. More research covering a broader range of diseases and using a larger number of patient cases is needed to create reliable evidence on the issue of LLM bias behavior.

In the real-world, patient datasets used in ML training are mostly underrepresented by people at disadvantage, i.e., data bias, and at the same time, the same underserved populations have less access to advanced medical AI tools, i.e., access disparities, a phenomenon sometimes called the “double whammy” situation. Regardless of the above few negative reports, general-purpose LLM-based GenAI has been developed and tested as a potential large-scale revolutionary force against access bias in underserved populations because it is publicly accessible to anyone (92). To ease data bias, new strategies leveraging generative models have been explored, such as generating synthetic data by generative models to improve data representation (90). Moreover, we have proposed a new architecture for deploying LHS over clinical research network as an advanced strategy to combat data bias and access disparities simultaneously (103).

Developing responsible GenAI: policies and regulations

NAM launched the AI Code of Conduct initiative in mid-2023, convening health, tech, research, and bioethics leaders to produce a code of conduct for the development and use of AI in health, medical care, and health research. Among the most challenging is developing AI that does not preserve biases built over generations into the US healthcare system (104). Given that the use of algorithms in medicine is ubiquitous and the automation of these algorithms is increasingly common, Dorr et al. proposed that LHS can and should use algorithms to improve the effectiveness, reliability, person-centeredness, and efficiency of care as they seek constant improvement, but do so responsibly (105).

To address the explosive promise and grave risks posed by GenAI, the Biden administration of the United States introduced a landmark executive order on AI on October 30, 2023, outlining a vision for ensuring that AI is developed and used responsibly (106). The executive order calls on agencies to establish standards for safe and trustworthy AI and policies for ensuring those standards are met. The secretary of US Health and Human Services (HHS) will initiate development of “AI assurance policy” that will provide infrastructure for conducting evaluations of healthcare AI tools in real-world settings and taking steps to address concerns identified. In early 2024, the European Union (EU) has established the first comprehensive AI Act that sets out mandatory requirements for AI systems and clarifies which AI uses will be acceptable across the EU.

The US FDA has been adapting its regulatory framework to specifically provide regulatory policies or guidelines for AI/ML-based software solutions that are used in the prevention, diagnosis, treatment, or monitoring of various diseases or conditions. The FDA released a discussion paper proposing a total product lifecycle (TPLC) approach to regulating Software as a Medical Device (SaMD). This approach focuses on the continuous monitoring and improvement of these technologies throughout their lifespan (107).

While there has been progress in regulating AI before the new GenAI era, the FDA has found bigger regulatory challenges with more advanced GenAI, which has characteristics very different from traditional AI. Notably, GenAI produces generative instead of deterministic outputs, and unsupervised training and self-learning by GenAI may evolve unexpectedly. These major uncertainties inherent in GenAI make it difficult to ensure qualities and control risks in the existing regulatory frameworks. Meskó and Topol proposed creating a new regulatory category for LLMs and, similar to the FDA’s Digital Health Pre-Cert Program, suggest regulating companies that develop LLMs instead of regulating every single LLM iteration (95).

Discussion

Compared to THAI, GHAI possesses unique features that make it well-suited for democratization, potentially transforming healthcare into intelligent LHS in the future. In this review, we identified initial evidence of GenAI democratization in healthcare, marking the start of a global trend towards democratizing healthcare AI. Equally important, we outlined the ethical and equity issues that must be addressed to ensure GenAI augments rather than replaces physicians, thereby promoting responsible AI that delivers safe, effective, and equitable healthcare to everyone.

For the implementation of various aspects of GenAI democratization in healthcare, we propose three promising future directions:

Democratizing GenAI research to foster innovations in healthcare delivery: Initial evidence suggests that GenAI has the potential to provide valuable insights and prediction for numerous tasks and diseases in clinical care delivery. It is now time to systematically evaluate GenAI chatbots and tools across all specific tasks and diseases to generate reliable evidence for GenAI’s safety and efficacy required for clinical applications. Unlike previous AI technologies, GenAI can be utilized by all health professionals to conduct clinical evaluation research within their own care delivery environments. Study methodologies such as comparative effectiveness research and pragmatic clinical trial are applicable for doctors to study real-world data, generating real-world evidence. We believe this democratization of clinical research will foster innovations that solve clinical problems on a large scale, reinforcing the democratization process itself.
Integrating GenAI education into medical education to prepare an AI-equipped workforce: Students are highly enthusiastic about using GenAI tools like ChatGPT for learning and training. New courses focusing on GenAI concepts and applications, chatbots, basic prompt engineering, study designs, research methods, critical thinking, and GenAI-enabled scientific writing should be seamlessly integrated into the current curriculum. Regular classes can also be supplemented by webinar series focusing on the progress of GenAI healthcare applications and facilitating cross-disciplinary collaborations, including clinical care, public health, computer science, media, philosophy, law, and policy. The goal is to make GenAI a fundamental skill for every medical student, providing a desired workforce capable of democratizing GenAI practice and research in healthcare.
Implementing LHS to provide system-level enforcement for responsible AI development and applications: As suggested by the perspectives of NAM and academic colleagues (30,79,105), the LHS vision from NAM for healthcare transformation can be perfectly applied to responsibly developing and applying GenAI in care delivery. We have three main reasons for this match between GenAI and LHS. First, LHS requires the embedding of GenAI research in care delivery, allowing for AI model refinement and evaluation using the full spectrum of diseases and patient populations, thereby reducing data bias. Second, LHS operates in cycles of continuous learning and validation, thus can measure the effectiveness of any GenAI interventions and promote only safer and more effective models for routine care practices. Third, as proposed in our work on ML-enabled LHS simulation and real data ML-LHS initialization (103,108), LHS can form clinical research networks encompassing lead hospitals and clinics from communities and remote areas, effectively enabling low-resource clinics to access high-performance GenAI tools developed at the lead hospitals.

Available publicly, GenAI allows any hospital and medical school to explore these future directions. As an example, the Healthcare AI Institute at Fudan University Medical School is exploring the three directions, i.e., supporting clinical teams to evaluate GenAI’s clinical benefits, designing an elective course on healthcare GenAI for all medical students, and collaborating with clinical teams to build LHS empowering community and rural clinics. We believe the collective efforts around the world will demonstrate that augmenting physicians with GenAI in LHS infrastructure can improve patient outcomes effectively, efficiently, and sustainably.

An initial study from Stanford Medical School shows that doctors are receptive to AI collaboration in simulated clinical cases without introducing bias. The study paired physicians with a prototype ChatGPT-like tool in a mock medical environment and found that the doctors were willing to collaborate with the AI tool to improve patient outcomes. We expect GenAI-enabled LHS will provide such an unbiased, human-machine collaborative care delivery environment.

Because this review aims to highlight early evidence of healthcare GenAI democratization and propose ways to advance this new trend, it has some limitations. The review is restricted to research on GenAI in healthcare delivery, not including other important fields like life sciences, drug discovery, and medical devices. It focuses on the clinical applications and impact of GenAI, not LLM algorithm development. The review does not cover all peer-reviewed publications on GenAI healthcare exhaustively but focuses on selected studies relevant to the democratization of GenAI in healthcare.

Conclusions

By identifying the intrinsic reasons for GenAI to be democratized in healthcare and reviewing the initial evidence generated from in original GenAI studies in many aspects of healthcare, this review outlines the beginning of the democratization of GenAI in medical education, clinical research, and care delivery. Responsible development and applications of GenAI require a human-machine collaboration approach, where GenAI augments rather than replaces human expertise, ensuring that everyone benefits from GenAI safely and effectively. Immediate future directions are proposed for democratizing clinical research on GenAI impact, integrating GenAI into medical education, and building LHS with GenAI. These steps may further accelerate the democratization of GenAI in healthcare, benefiting patients and society as a whole.

Acknowledgments

The authors thank Professor Zhigang Pan of Fudan University Affiliated Zhongshan Hospital for discussion of AI-enabled primary care and its democratization.

Funding: None.

Footnote

Reporting Checklist: The authors have completed the Narrative Review reporting checklist. Available at https://jhmhp.amegroups.com/article/view/10.21037/jhmhp-24-54/rc

Peer Review File: Available at https://jhmhp.amegroups.com/article/view/10.21037/jhmhp-24-54/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jhmhp.amegroups.com/article/view/10.21037/jhmhp-24-54/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Matheny ME, Whicher D, Thadaney Israni S. Artificial Intelligence in Health Care: A Report From the National Academy of Medicine. JAMA 2020;323:509-10. [Crossref] [PubMed]
Dzau VJ, Laitner MH, Temple A, et al. Achieving the promise of artificial intelligence in health and medicine: Building a foundation for the future. PNAS Nexus 2023;2:pgad410. [Crossref]
Khera R, Butte AJ, Berkwits M, et al. AI in Medicine-JAMA's Focus on Clinical Outcomes, Patient-Centered Care, Quality, and Equity. JAMA 2023;330:818-20. [Crossref] [PubMed]
Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med 2019;380:1347-58. [Crossref] [PubMed]
Deo RC. Machine Learning in Medicine. Circulation 2015;132:1920-30. [Crossref] [PubMed]
Cohen JP, Cao T, Viviano JD, et al. Problems in the deployment of machine-learned models in health care. CMAJ 2021;193:E1391-4. [Crossref] [PubMed]
Wilkinson J, Arnold KF, Murray EJ, et al. Time to reality check the promises of machine learning-powered precision medicine. Lancet Digit Health 2020;2:e677-80. [Crossref] [PubMed]
Chen PC, Liu Y, Peng L. How to develop machine learning models for healthcare. Nat Mater 2019;18:410-4. [Crossref] [PubMed]
Seastedt KP, Schwab P, O'Brien Z, et al. Global healthcare fairness: We should be sharing more, not less, data. PLOS Digit Health 2022;1:e0000102. [Crossref] [PubMed]
Egger J, Gsaxner C, Pepe A, et al. Medical deep learning-A systematic meta-review. Comput Methods Programs Biomed 2022;221:106874. [Crossref] [PubMed]
Kelly CJ, Karthikesalingam A, Suleyman M, et al. Key challenges for delivering clinical impact with artificial intelligence. BMC Med 2019;17:195. [Crossref] [PubMed]
Jindal JA, Lungren MP, Shah NH. Ensuring useful adoption of generative artificial intelligence in healthcare. J Am Med Inform Assoc 2024;31:1441-4. [Crossref] [PubMed]
Committee on Diagnostic Error in Health Care; Board on Health Care Services; Institute of Medicine; et al. Improving Diagnosis in Health Care. Balogh EP, Miller BT, Ball JR, editors. Washington (DC): National Academies Press (US); 2015 Dec 29.
Wachter RM, Brynjolfsson E. Will Generative Artificial Intelligence Deliver on Its Promise in Health Care? JAMA 2024;331:65-9. [Crossref] [PubMed]
Howell MD, Corrado GS, DeSalvo KB. Three Epochs of Artificial Intelligence in Health Care. JAMA 2024;331:242-4. [Crossref] [PubMed]
Ferruz N, Zitnik M, Oudeyer PY, et al. Anniversary AI reflections. Nature Machine Intelligence 2024;6:6-12. [Crossref]
National Academies of Sciences, Engineering, and Medicine. Artificial Intelligence in Health Professions Education: Proceedings of a Workshop. Washington, DC: The National Academies Press. 2023. Available online: https://doi.org/10.17226/2717410.17226/27174
Chang BS. Transformation of Undergraduate Medical Education in 2023. JAMA 2023;330:1521-2. [Crossref] [PubMed]
Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023;9:e45312. Erratum in: JMIR Med Educ 2024;10:e57594. [Crossref] [PubMed]
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198. [Crossref] [PubMed]
Strong E, DiGiammarino A, Weng Y, et al. Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations. JAMA Intern Med 2023;183:1028-30. [Crossref] [PubMed]
Mihalache A, Popovic MM, Muni RH. Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment. JAMA Ophthalmol 2023;141:589-97. [Crossref] [PubMed]
Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board-Style Examination. JAMA Netw Open 2023;6:e2346721. [Crossref] [PubMed]
Abdullahi T, Singh R, Eickhoff C. Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models. JMIR Med Educ 2024;10:e51391. [Crossref] [PubMed]
Chen A, Chen DO, Tian L. Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases. J Am Med Inform Assoc 2023;ocad245. [Crossref] [PubMed]
Chen A, Chen W, Liu Y. Impact of Democratizing Artificial Intelligence: Using ChatGPT in Medical Education and Training. Acad Med 2024;99:589. [Crossref] [PubMed]
Preiksaitis C, Rose C. Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review. JMIR Med Educ 2023;9:e48785. [Crossref] [PubMed]
Boscardin CK, Gin B, Golde PB, et al. ChatGPT and Generative Artificial Intelligence for Medical Education: Potential Impact and Opportunity. Acad Med 2024;99:22-7. [Crossref] [PubMed]
Karabacak M, Ozkara BB, Margetis K, et al. The Advent of Generative Language Models in Medical Education. JMIR Med Educ 2023;9:e48163. [Crossref] [PubMed]
Smith M, Saunders R, Stuckhardt L, et al. Best Care at Lower Cost: The Path to Continuously Learning Health Care in America. Washington (DC): National Academies Press (US); May 10, 2013.
Chen A, Chen DO. Accuracy of Chatbots in Citing Journal Articles. JAMA Netw Open 2023;6:e2327647. [Crossref] [PubMed]
Holderried F, Stegemann-Philipps C, Herschbach L, et al. A Generative Pretrained Transformer (GPT)-Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study. JMIR Med Educ 2024;10:e53961. [Crossref] [PubMed]
Liu X, Wu C, Lai R, et al. ChatGPT: when the artificial intelligence meets standardized patients in clinical training. J Transl Med 2023;21:447. [Crossref] [PubMed]
Grigorian A, Shipley J, Nahmias J, et al. Implications of Using Chatbots for Future Surgical Education. JAMA Surg 2023;158:1220-2. [Crossref] [PubMed]
Haug CJ, Drazen JM. Artificial Intelligence and Machine Learning in Clinical Medicine, 2023. N Engl J Med 2023;388:1201-8. [Crossref] [PubMed]
Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res 2023;25:e48568. [Crossref] [PubMed]
Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med 2023;388:1233-9. [Crossref] [PubMed]
Shah NH, Halamka JD, Saria S, et al. A Nationwide Network of Health AI Assurance Laboratories. JAMA 2024;331:245-9. [Crossref] [PubMed]
Eriksen A, Möller S, Ryg J. Use of GPT-4 to Diagnose Complex Clinical Cases. NEJM AI 2023;1:AIp2300031. [Crossref]
Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA 2023;330:78-80. [Crossref] [PubMed]
Barile J, Margolis A, Cason G, et al. Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies. JAMA Pediatr 2024;178:313-5. [Crossref] [PubMed]
Sandmann S, Riepenhausen S, Plagwitz L, et al. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun 2024;15:2050. [Crossref] [PubMed]
Savage T, Nayak A, Gallo R, et al. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit Med 2024;7:20. [Crossref] [PubMed]
Rao A, Pang M, Kim J, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res 2023;25:e48659. [Crossref] [PubMed]
Shea YF, Lee CMY, Ip WCT, et al. Use of GPT-4 to Analyze Medical Records of Patients With Extensive Investigations and Delayed Diagnosis. JAMA Netw Open 2023;6:e2325000. [Crossref] [PubMed]
Kulkarni PA, Singh H. Artificial Intelligence in Clinical Diagnosis: Opportunities, Challenges, and Hype. JAMA 2023;330:317-8. [Crossref] [PubMed]
Sarraju A, Bruemmer D, Van Iterson E, et al. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA 2023;329:842-4. [Crossref] [PubMed]
Han C, Kim DW, Kim S, et al. Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data. iScience 2024;27:109022. [Crossref] [PubMed]
Liu S, Wright AP, Patterson BL, et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc 2023;30:1237-45. [Crossref] [PubMed]
Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259-65. [Crossref] [PubMed]
Huang RS, Lu KJQ, Meaney C, et al. Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study. JMIR Med Educ 2023;9:e50514. [Crossref] [PubMed]
Iannantuono GM, Bracken-Clarke D, Floudas CS, et al. Applications of large language models in cancer care: current evidence and future perspectives. Front Oncol 2023;13:1268915. [Crossref] [PubMed]
Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 2023;21:269. [Crossref] [PubMed]
Sorin V, Klang E, Sklair-Levy M, et al. Large language model (ChatGPT) as a support tool for breast tumor board. NPJ Breast Cancer 2023;9:44. [Crossref] [PubMed]
Benary M, Wang XD, Schmidt M, et al. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw Open 2023;6:e2343689. [Crossref] [PubMed]
Chen S, Kann BH, Foote MB, et al. Use of Artificial Intelligence Chatbots for Cancer Treatment Information. JAMA Oncol 2023;9:1459-62. [Crossref] [PubMed]
Rengers TA, Thiels CA, Salehinejad H. Academic Surgery in the Era of Large Language Models: A Review. JAMA Surg 2024;159:445-50. [Crossref] [PubMed]
Ayoub NF, Lee YJ, Grimm D, et al. Comparison Between ChatGPT and Google Search as Sources of Postoperative Patient Instructions. JAMA Otolaryngol Head Neck Surg 2023;149:556-8. [Crossref] [PubMed]
Koohi-Moghadam M, Bae KT. Generative AI in Medical Imaging: Applications, Challenges, and Ethics. J Med Syst 2023;47:94. [Crossref] [PubMed]
Huang J, Neill L, Wittbrodt M, et al. Generative Artificial Intelligence for Chest Radiograph Interpretation in the Emergency Department. JAMA Netw Open 2023;6:e2336100. [Crossref] [PubMed]
Kottlors J, Bratke G, Rauen P, et al. Feasibility of Differential Diagnosis Based on Imaging Patterns Using a Large Language Model. Radiology 2023;308:e231167. [Crossref] [PubMed]
Mihalache A, Huang RS, Popovic MM, et al. Accuracy of an Artificial Intelligence Chatbot's Interpretation of Clinical Ophthalmic Images. JAMA Ophthalmol 2024;142:321-6. [Crossref] [PubMed]
Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183:589-96. [Crossref] [PubMed]
Ayers JW, Zhu Z, Poliak A, et al. Evaluating Artificial Intelligence Responses to Public Health Questions. JAMA Netw Open 2023;6:e2317517. [Crossref] [PubMed]
Bernstein IA, Zhang YV, Govil D, et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open 2023;6:e2330320. [Crossref] [PubMed]
Huang AS, Hirabayashi K, Barna L, et al. Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management. JAMA Ophthalmol 2024;142:371-5. Erratum in: JAMA Ophthalmol 2024;142:393. [Crossref] [PubMed]
Ferreira AL, Chu B, Grant-Kels JM, et al. Evaluation of ChatGPT Dermatology Responses to Common Patient Queries. JMIR Dermatol 2023;6:e49280. [Crossref] [PubMed]
Harris E. Large Language Models Answer Medical Questions Accurately, but Can't Match Clinicians' Knowledge. JAMA 2023;330:792-4. [Crossref] [PubMed]
Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open 2023;6:e2336483. [Crossref] [PubMed]
Wang L, Chen X, Deng X, et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med 2024;7:41. [Crossref] [PubMed]
Decker H, Trang K, Ramirez J, et al. Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures. JAMA Netw Open 2023;6:e2336997. [Crossref] [PubMed]
Nayak A, Alkaitis MS, Nayak K, et al. Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents. JAMA Intern Med 2023;183:1026-7. [Crossref] [PubMed]
Goodman KE, Yi PH, Morgan DJ. AI-Generated Clinical Summaries Require More Than Accuracy. JAMA 2024;331:637-8. [Crossref] [PubMed]
Tang L, Sun Z, Idnay B, et al. Evaluating large language models on medical evidence summarization. NPJ Digit Med 2023;6:158. [Crossref] [PubMed]
Tayebi Arasteh S, Han T, Lotfinia M, et al. Large language models streamline automated machine learning for clinical studies. Nat Commun 2024;15:1603. [Crossref] [PubMed]
Hu Y, Chen Q, Du J, et al. Improving large language models for clinical named entity recognition via prompt engineering. J Am Med Inform Assoc 2024;ocad259. [Crossref] [PubMed]
Kresevic S, Giuffrè M, Ajcevic M, et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit Med 2024;7:102. [Crossref] [PubMed]
Wornow M, Xu Y, Thapa R, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med 2023;6:135. [Crossref] [PubMed]
Clusmann J, Kolbinger FR, Muti HS, et al. The future landscape of large language models in medicine. Commun Med (Lond) 2023;3:141. [Crossref] [PubMed]
Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature 2023;620:172-80. [Crossref] [PubMed]
Singhal K, Tu T, Gottweis J, et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023. arXiv:2305.09617.
McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models. arXiv 2023. arXiv:2312.00164.
Tu T, Palepu A, Schaekermann M, et al. Towards conversational diagnostic AI. arXiv 2024. arXiv:2401.05654.
Mehandru N, Miao BY, Almaraz ER, et al. Evaluating large language models as agents in the clinic. NPJ Digit Med 2024;7:84. [Crossref] [PubMed]
Stade EC, Stirman SW, Ungar LH, et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. Npj Ment Health Res 2024;3:12. [Crossref] [PubMed]
Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med 2022;5:194. [Crossref] [PubMed]
Jiang LY, Liu XC, Nejatian NP, et al. Health system-scale language models are all-purpose prediction engines. Nature 2023;619:357-62. [Crossref] [PubMed]
Wang H, Gao C, Dantona C, et al. DRG-LLaMA: tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ Digit Med 2024;7:16. [Crossref] [PubMed]
ZhouJHeXSunLPre-trained Multimodal Large Language Model Enhances Dermatological Diagnosis using SkinGPT-4.medRxiv 2023. doi: .10.1101/2023.06.10.23291127
Ktena I, Wiles O, Albuquerque I, et al. Generative models improve fairness of medical classifiers under distribution shifts. Nat Med 2024;30:1166-73. [Crossref] [PubMed]
Kim HK, Ryu IH, Choi JY, et al. A feasibility study on the adoption of a generative denoising diffusion model for the synthesis of fundus photographs using a small dataset. Discov Appl Sci 2024;6:188. [Crossref]
Goldberg CB, Adams L, Blumenthal D, et al. To do no harm - and the most good - with AI in health care. NEJM AI 2024;1:AIp2400036. [Crossref]
Badal K, Lee CM, Esserman LJ. Guiding principles for the responsible development of artificial intelligence tools for healthcare. Commun Med (Lond) 2023;3:47. [Crossref] [PubMed]
Shah NH, Entwistle D, Pfeffer MA. Creation and Adoption of Large Language Models in Medicine. JAMA 2023;330:866-9. [Crossref] [PubMed]
Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med 2023;6:120. [Crossref] [PubMed]
Oniani D, Hilsman J, Peng Y, et al. Adopting and expanding ethical principles for generative artificial intelligence from military to healthcare. NPJ Digit Med 2023;6:225. [Crossref] [PubMed]
Mello MM, Guha N. ChatGPT and Physicians' Malpractice Risk. JAMA Health Forum 2023;4:e231938. [Crossref] [PubMed]
Haupt CE, Marks M. AI-Generated Medical Advice-GPT and Beyond. JAMA 2023;329:1349-50. [Crossref] [PubMed]
Gottlieb S, Silvis L. How to Safely Integrate Large Language Models Into Health Care. JAMA Health Forum 2023;4:e233909. [Crossref] [PubMed]
Ito N, Kadomatsu S, Fujisawa M, et al. The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study. JMIR Med Educ 2023;9:e47532. [Crossref] [PubMed]
Kim J, Cai ZR, Chen ML, et al. Assessing Biases in Medical Decisions via Clinician and AI Chatbot Responses to Patient Vignettes. JAMA Netw Open 2023;6:e2338050. [Crossref] [PubMed]
Omiye JA, Lester JC, Spichak S, et al. Large language models propagate race-based medicine. NPJ Digit Med 2023;6:195. [Crossref] [PubMed]
Chen A, Wu E, Huang R, et al. Development of inclusive and practical machine learning risk prediction models from electronic medical records for lung cancer screening. JMIR AI 2024; [Crossref]
Hswen Y, Voelker R, New AI. Tools Must Have Health Equity in Their DNA. JAMA 2023;330:1604-7. [Crossref] [PubMed]
Dorr DA, Adams L, Embí P. Harnessing the Promise of Artificial Intelligence Responsibly. JAMA 2023;329:1347-8. [Crossref] [PubMed]
Mello MM, Shah NH, Char DS. President Biden's Executive Order on Artificial Intelligence-Implications for Health Care Organizations. JAMA 2024;331:17-8. [Crossref] [PubMed]
Abràmoff MD, Tarver ME, Loyo-Berrios N, et al. Considerations for addressing bias in artificial intelligence for health equity. NPJ Digit Med 2023;6:170. [Crossref] [PubMed]
Chen A, Chen DO. Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data. Sci Rep 2022;12:17917. [Crossref] [PubMed]

doi: 10.21037/jhmhp-24-54
Cite this article as: Chen A, Liu L, Zhu T. Advancing the democratization of generative artificial intelligence in healthcare: a narrative review. J Hosp Manag Health Policy 2024;8:12.