Subscribe to read | Financial Times
Nightingale Open Science data show what actually happened to patients rather than relying on doctor’s notes
Ziad Obermeyer launched Nightingale Open Science last month. He has made the data sets free to use © Handout
Madhumita Murgia in London
Ziad Obermeyer, a physician and machine learning scientist at the University of California, Berkeley, launched Nightingale Open Science last month — a treasure trove of unique medical data sets, each curated around an unsolved medical mystery that artificial intelligence could help to solve.
The data sets, released after the project received $2m of funding from former Google chief executive Eric Schmidt, could help to train computer algorithms to predict medical conditions earlier, triage better and save lives.
The data include 40 terabytes of medical imagery, such as X-rays, electrocardiogram waveforms and pathology specimens, from patients with a range of conditions, including high-risk breast cancer, sudden cardiac arrest, fractures and Covid-19. Each image is labelled with the patient’s medical outcomes, such as the stage of breast cancer and whether it resulted in death, or whether a Covid patient needed a ventilator.
Obermeyer has made the data sets free to use and mainly worked with hospitals in the US and Taiwan to build them over two years. He plans to expand this to Kenya and Lebanon in the coming months to reflect as much medical diversity as possible.
“Nothing exists like it,” said Obermeyer, who announced the new project in December alongside colleagues at NeurIPS, the global academic conference for artificial intelligence. “What sets this apart from anything available online is the data sets are labelled with the ‘ground truth’, which means with what really happened to a patient and not just a doctor’s opinion.”
This means that data sets on cardiac arrest ECGs, for example, have not been labelled depending on whether a cardiologist detected something suspicious, but with whether that patient eventually had a heart attack. “We can learn from actual patient outcomes, rather than replicate flawed human judgment,” Obermeyer said.
In the past year, the AI community has undergone a sector-wide shift from collecting “big data” — as much data as possible — to meaningful data, or information that is more curated and relevant to a specific problem, which can be used to address issues such as ingrained human biases in healthcare, image recognition or natural language processing.
Until now, many healthcare algorithms have been proven to amplify existing health disparities. For instance, Obermeyer found that an AI system used by hospitals treating up to 70m Americans, which allocated extra medical support for patients with chronic illnesses, was prioritising healthier white patients over sicker black patients who needed help. It was assigning risk scores based on data that included an individual’s total healthcare costs in a year. The model was using healthcare costs as a proxy for healthcare needs.
The crux of this problem, which was reflected in the model’s underlying data, is that not everyone generates healthcare costs in the same way. Minorities and other underserved populations may lack access to and resources for healthcare, be less able to get time off work for doctors’ visits, or experience discrimination within the system by receiving fewer treatments or tests, which can lead to them being classed as less costly in data sets. This doesn’t necessarily mean they have been less sick.
The researchers calculated that nearly 47 per cent of black patients should have been referred for extra care, but the algorithmic bias meant that only 17 per cent were.
“Your costs are going to be lower even though your needs are the same. And that was the root of the bias that we found,” Obermeyer said. He found that several other similar AI systems also used cost as a proxy, a decision that he estimates is affecting the lives of about 200m patients.
Nightingale Open Science data show what actually happened to patients rather than relying on doctor’s notes
Ziad Obermeyer launched Nightingale Open Science last month. He has made the data sets free to use © Handout
Madhumita Murgia in London
Ziad Obermeyer, a physician and machine learning scientist at the University of California, Berkeley, launched Nightingale Open Science last month — a treasure trove of unique medical data sets, each curated around an unsolved medical mystery that artificial intelligence could help to solve.
The data sets, released after the project received $2m of funding from former Google chief executive Eric Schmidt, could help to train computer algorithms to predict medical conditions earlier, triage better and save lives.
The data include 40 terabytes of medical imagery, such as X-rays, electrocardiogram waveforms and pathology specimens, from patients with a range of conditions, including high-risk breast cancer, sudden cardiac arrest, fractures and Covid-19. Each image is labelled with the patient’s medical outcomes, such as the stage of breast cancer and whether it resulted in death, or whether a Covid patient needed a ventilator.
Obermeyer has made the data sets free to use and mainly worked with hospitals in the US and Taiwan to build them over two years. He plans to expand this to Kenya and Lebanon in the coming months to reflect as much medical diversity as possible.
“Nothing exists like it,” said Obermeyer, who announced the new project in December alongside colleagues at NeurIPS, the global academic conference for artificial intelligence. “What sets this apart from anything available online is the data sets are labelled with the ‘ground truth’, which means with what really happened to a patient and not just a doctor’s opinion.”
This means that data sets on cardiac arrest ECGs, for example, have not been labelled depending on whether a cardiologist detected something suspicious, but with whether that patient eventually had a heart attack. “We can learn from actual patient outcomes, rather than replicate flawed human judgment,” Obermeyer said.
In the past year, the AI community has undergone a sector-wide shift from collecting “big data” — as much data as possible — to meaningful data, or information that is more curated and relevant to a specific problem, which can be used to address issues such as ingrained human biases in healthcare, image recognition or natural language processing.
Until now, many healthcare algorithms have been proven to amplify existing health disparities. For instance, Obermeyer found that an AI system used by hospitals treating up to 70m Americans, which allocated extra medical support for patients with chronic illnesses, was prioritising healthier white patients over sicker black patients who needed help. It was assigning risk scores based on data that included an individual’s total healthcare costs in a year. The model was using healthcare costs as a proxy for healthcare needs.
The crux of this problem, which was reflected in the model’s underlying data, is that not everyone generates healthcare costs in the same way. Minorities and other underserved populations may lack access to and resources for healthcare, be less able to get time off work for doctors’ visits, or experience discrimination within the system by receiving fewer treatments or tests, which can lead to them being classed as less costly in data sets. This doesn’t necessarily mean they have been less sick.
The researchers calculated that nearly 47 per cent of black patients should have been referred for extra care, but the algorithmic bias meant that only 17 per cent were.
“Your costs are going to be lower even though your needs are the same. And that was the root of the bias that we found,” Obermeyer said. He found that several other similar AI systems also used cost as a proxy, a decision that he estimates is affecting the lives of about 200m patients.