The Healthcare Vision of ChatGPT-4o and Multimodal LLMs
Medical AI can’t interpret complex cases yet. The arrival of multimodal large language models like ChatGPT-4o starts the real revolution.

Key Takeaways
The development of multimodal large language models (M-LLMs) is crucial for the future of medicine as they can process and interpret multiple types of data simultaneously, unlike current unimodal AI systems. This will enable comprehensive analysis in medicine, facilitate communication between healthcare providers and patients speaking different languages, and serve as a central hub for various unimodal AI applications in hospitals.
While the public debut of Large Language Models (LLMs) like ChatGPT has been a resounding success, current AI systems lack the capability to process multiple types of data, making them inadequate for the multimodal nature of medicine. The transition to M-LLMs will be necessary to substantially reduce the workload of healthcare professionals.
This journey is challenging but necessary to move medical AI from being a ‘calculator’ to matching the ‘supercomputers’ we call doctors
This article was originally published on 5 September 2023 and updated in May 2024 to include significant new developments in the segment.
The future of medicine is undoubtedly inextricably linked to the development of artificial intelligence (AI). Although this revolution has been brewing for years, the past few months marked a major change, as algorithms finally moved out of the specialized labs and into our daily lives.
This revolution accelerated as major tech companies began rolling out their multimodal large language models, promising they will soon be available to the general public. The latest – and perhaps biggest – hit was the announcement of ChatGPT-4o from OpenAI. The 4o model is described as “natively multimodal,” a feature also claimed for Google’s Gemini at its launch. However, while layman subscribers still lack access to Gemini’s multimodal features, 4o’s partial multimodality is available -albeit with limited use -for free accounts.
So how did we get here and why is this important? Let’s take a look at the path we’ve traveled in the past 18 months and look into the future so we can understand the significance!
The public debut of Large Language Models (LLMs), like ChatGPT which became the fastest-growing consumer application of all time, has been a roaring success. LLMs are machine learning models trained on a vast amount of text data which enables them to understand and generate human-like text based on the patterns and structures they’ve learned. They differ significantly from prior deep learning methods in scale, capabilities, and potential impact.

Large language models will soon find their way in to everyday clinical settings, simply because the global shortage of healthcare personnel is becoming dire and AI will lend a hand with tasks that do not require skilled medical professionals. But before this can happen, before we have a sufficiently robust regulatory framework in place we are already seeing how this new technology is being used in everyday life.
To better understand what lies ahead, let’s explore another key concept that will play a significant role in the transformation of medicine: multimodality.
Doctors and nurses are supercomputers, medical AI is a calculator
A multimodal system can process and interpret multiple types of input data, such as text, images, audio, and video, simultaneously. Current medical AIs only process one type of data, for example, text or X-ray images.
However, medicine, by nature, is multimodal as are humans. To diagnose and treat a patient, a healthcare professional listens to the patient, reads their health files, looks at medical images and interprets laboratory results. This is far beyond what any AI is capable of today.
The difference between the two can be likened to the difference between a runner and a pentathlete. A runner excels in one discipline, whereas a pentathlete must excel in multiple disciplines to succeed.
Most current Large Language Models (LLMs) are the runners, they are unimodal. Humans in medicine are champions of pentathlon teams.

At the moment most Large Language Models (LLMs) are unimodal, meaning they can only analyze texts. GPT-4 can analyze images and understands voice commands in the phone app, and so does ChatGPT-4o. These models can also generate images. The rest of the multimodal capabilities are not yet available to everyday subscribers. Other widely used LLMs, like Google’s Gemini or Claude AI, can interpret image prompts (such as a chart), but can’t generate image responses yet. Meanwhile, Google is reportedly working on pioneering the medical large language model arena with a range of models, including the latest: Med-Gemini.
All in all, from The Medical Futurist’s perspective, it’s clear that multimodal LLMs (M-LLMs) with full functionality will arrive soon, otherwise AI won’t be able to significantly contribute to the multimodal nature of medicine and care. These systems will considerably reduce the workload of – but not replace- human healthcare professionals.
The future is M-LLMs
The development of M-LLMs will have at least three significant consequences:
1. AI will handle multiple types of content, from images to audio
An M-LLM will be able to process and interpret various kinds of content, which is crucial for a comprehensive analysis in medicine. We could list hundreds of examples regarding the benefits of such a system but will mention only a few in the following five categories:
- Text analysis: M-LLMs will be capable of handling a vast amount of administrative, clinical, educational and marketing tasks, from updating electronic medical records to solving case studies
- Image analysis: another broad area in terms of potential use cases, which spans from reading handwritten notes to analysing radiology (ophthalmology, neurology, pathology, etc.) images
- Sound analysis: M-LLMs will eventually become competent in disease monitoring such as checking heart and lung sounds for abnormalities to ensure early detection, but sounds can also provide valuable info in mental health and rehabilitation applications
- Video analysis: an advanced algorithm will be able to guide a medical student in virtual reality surgery training regarding how to aim precisely, move, proceed, but videos could also be used to detect neurological conditions or to support patients communicating with sign language.
- Complex document analysis: this will include assistance in literature review and research, analysis of medical guidelines for clinical decision-making, and clinical coding among many other forms of use
2. It will break language barriers
These M-LLMs will easily facilitate communication between healthcare providers and patients who speak different languages, translating between various languages in real time. Just as we’ve seen how live translation works with ChatGPT-4o. It’s obvious what a potential removing language barriers holds for medical appointments.
Specialist: “Can you please point to where it hurts?”
M-LLM (Translating for Patient): “¿Puede señalar dónde le duele?”
Patient points to lower abdomen.
M-LLM (Translating for Specialist): “The patient is pointing to the lower abdomen.”
Specialist: “On a scale from 1 to 10, how would you rate your pain?”
M-LLM (Translating for Patient): “En una escala del 1 al 10, ¿cómo calificaría su dolor?”
Patient: “Es un 8.”
M-LLM (Translating for Specialist): “It is an 8.
3. Finally, the arrival of interoperability can connect and harmonise various hospital systems
An M-LLM could serve as a central hub that facilitates access to various unimodal AIs used in the hospital, such as radiology software, insurance handling software, Electronic Medical Records (EMR), etc. The situation today is as follows:
One company manufactures software for the radiology department which use a certain format of AI in their daily work. Another company’s algorithm works with the hospital’s electronic medical records, and yet another third-party suplier creates AI to compile insurance reports. However, doctors typically only have access to the system strictly related to their field, for example, a radiologist has access to the radiological AI, but a cardiologist does not. And of course, these algorithms don’t communicate with each other. If the cardiology department used an algorithm that analysed heart and lung signs, gastroenterologists or psychiatrists very likely wouldn’t have access to it – even though its findings may be useful for their diagnosis as well.
The significant step will be when M-LLMs – eventually – become capable of understanding the language and format of all these software applications and help people communicate with them. An average doctor will then be able to easily work with the radiological AI software, the AI software managing the EMRs, and the fourth, and eighth (etc. ) AI used in the hospital.
This potential is very important because such a breakthrough won’t come about in any other way. No single company will come up with such software because they don’t have access to the AI data developed by individual companies. The M-LLM however will be able to communicate with these systems individually and, as a central hub, will provide a tool of immense importance to doctors.
The transition from unimodal to multimodal AI is a necessary step to fully harness the potential of AI in medicine. By developing M-LLMs that can process multiple types of content, break language barriers, and facilitate access to other AI applications, we can revolutionize the way we practice medicine. The journey from being a calculator to matching the supercomputers we call doctors is challenging, but it is a revolution happening in front of our eyes.