An independent study published in Nature Medicine on February 24, 2026, reveals that ChatGPT Health, OpenAI’s AI-powered health guidance tool, fails to recommend emergency care in over half of serious medical scenarios. The findings raise urgent safety concerns as millions of Americans increasingly rely on AI for health advice.
Study Details and Key Findings
Researchers at the Icahn School of Medicine at Mount Sinai conducted a rigorous evaluation involving 60 clinical scenarios across 21 medical specialties. These ranged from minor ailments to life-threatening emergencies. Three independent physicians, using guidelines from 56 medical societies, established the correct level of urgency for each case. The team then tested ChatGPT Health under 16 different contextual variations—such as changes in patient race, gender, and social dynamics—resulting in 960 total interactions .
The results were alarming: ChatGPT Health under-triaged 52% of cases that physicians deemed true emergencies. In scenarios involving diabetic ketoacidosis or impending respiratory failure, the AI advised a 24–48-hour evaluation rather than immediate emergency department care . While the tool performed adequately in textbook emergencies like stroke and anaphylaxis, it struggled with more nuanced, rapidly escalating conditions .
Suicide Risk Alerts and Inconsistencies
The study also uncovered troubling inconsistencies in ChatGPT Health’s crisis intervention system. Although the tool is designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations, alerts appeared more reliably when users described vague self-harm thoughts than when they articulated a concrete plan. This inversion of risk and safeguard activation was described by Dr. Girish Nadkarni, Mount Sinai’s Chief AI Officer, as “beyond inconsistency” .
Expert Reactions and Safety Concerns
Experts have voiced strong concerns about the potential harm posed by these findings. Alex Ruani, a doctoral researcher at University College London, called the results “unbelievably dangerous,” noting that in one asthma scenario, ChatGPT Health advised waiting rather than seeking emergency treatment despite clear signs of respiratory failure . In another simulation, 84% of responses sent a suffocating patient to a future appointment they likely wouldn’t survive to attend .
Lead author Dr. Ashwin Ramaswamy emphasized the study’s core question: “If someone is having a real medical emergency and asks ChatGPT Health what to do, will it tell them to go to the emergency department?” .
OpenAI’s Response and Context
OpenAI acknowledged the study and welcomed independent evaluations of its systems. However, a spokesperson argued that the study does not reflect how people typically use ChatGPT Health in real life and emphasized that the model is continuously updated and refined .
The timing of the study coincides with rapid consumer adoption of AI health tools. OpenAI reports that approximately 40 million people use ChatGPT for health-related questions every day, with over 200 million users asking at least one health-related prompt each week . Meanwhile, the nonprofit patient safety organization ECRI has ranked misuse of AI chatbots in healthcare as the top health technology hazard in early 2026 .
Implications for Stakeholders
Patients and Consumers
Millions of users rely on ChatGPT Health for medical guidance. Under-triaging emergencies could delay critical care, potentially resulting in severe harm or death.
Healthcare Providers
Clinicians may need to reinforce that AI tools are not substitutes for professional judgment. The study underscores the importance of verifying AI-generated advice, especially in ambiguous or high-stakes situations.
Policymakers and Regulators
The findings highlight the urgent need for independent auditing, safety standards, and oversight of AI health tools. Regulators may consider requiring transparency in triage logic and performance metrics.
AI Developers
OpenAI and other developers must address the tool’s blind spots, particularly in nuanced emergencies and crisis detection. Future iterations should prioritize safety and consistency across diverse scenarios.
Future Directions and Research
The Mount Sinai team plans to continue evaluating updated versions of ChatGPT Health and other consumer AI tools. Future research will expand into pediatric care, medication safety, and non-English-language use .
Independent oversight and real-world testing are essential to ensure AI tools support, rather than endanger, public health.
Conclusion
The study’s findings are a stark reminder that AI health tools, while promising, are not infallible. ChatGPT Health’s under-triaging of over half of simulated medical emergencies and inconsistent suicide-risk alerts expose critical safety gaps. As millions of Americans turn to AI for health guidance, developers, regulators, and clinicians must act swiftly to ensure these tools are reliable, transparent, and safe.
Frequently Asked Questions
What does “under-triaged” mean in this context?
Under-triaged refers to the AI recommending less urgent care—such as a 24–48-hour follow-up—instead of immediate emergency department evaluation when the situation required urgent attention .
How many scenarios did the study test?
The study evaluated 60 clinical scenarios across 21 specialties, each tested under 16 different contextual variations, totaling 960 interactions .
Did ChatGPT Health perform well in any cases?
Yes. The tool handled textbook emergencies like stroke and anaphylaxis correctly. However, it struggled with more complex or subtle emergencies .
Were there demographic biases in the AI’s performance?
The study found no statistically significant differences in triage outcomes based on patient race, gender, or socioeconomic barriers. However, the confidence intervals did not rule out clinically meaningful differences .
What about suicide-risk detection?
ChatGPT Health’s crisis alerts were inconsistent. Alerts appeared more reliably when users described vague self-harm thoughts, but disappeared when users described concrete plans—an inversion of expected risk response .
What steps are being taken to improve safety?
Researchers plan further evaluations across pediatric, medication safety, and non-English contexts. OpenAI continues to refine the model, and experts call for independent auditing and safety standards .