Making sense of COVID-19 predictions: Part 1
A data scientist’s guide to coronavirus forecasts
If you weren’t familiar with data science before COVID-19, you likely are now. Everything from estimates on mortality rates to predictions of the demand for hospital beds are being plastered in charts and graphs across our screens. Some days it can feel as though the news, as well as life and death decisions by heads of state, is being driven by a strong faith in the accuracy of COVID-19 predictions. Data science is having its moment.
Sometimes the models presented in the media can be highly fluid, with the details appearing blurry or even contradictory. For example, a report from Imperial College London seemed to completely reverse course, initially predicting 500,000 deaths in the U.K. then revising that number to 20,000. With so many ostensibly incompatible results, even within a single lab, it can be confusing to say the least.
Fortunately, the models, when taken together and understood well, are often not as contradictory as they seem. Most models will give ranges of predictions that are far more trustworthy than the headline attached to them. Models will also make different assumptions, such as the level of social distancing enacted, so comparing their predictions simply shows how things will turn out if we make different decisions. Understanding the data quality issues behind these models will help better frame some of the challenges data scientists and epidemiologists are encountering as they try to predict the future.
“Errors using inadequate data are much less than those using no data at all.” – Charles Babbage
Predictive models take things that happened in the past, make some assumptions about how they’ll continue into the future, and then draw the line forward. If we don’t know what happened in the past or don’t make the right assumptions about the relationship between past and future data, our predictions won’t be accurate.
So far, limited or poor data about the past is the main issue impacting predictions on the future of the coronavirus. There are several factors causing these data quality issues:
- Access to tests
- Measurement inaccuracies
- Lack of standardized reporting
Obstacles to accuracy in COVID-19 predictions
With the coronavirus, we’ve mostly collected data about people who have shown strong symptoms of the disease. Tests are still limited in availability across the U.S., so testing centers prioritize people who are the most in danger. We call this a “non-random sample,” where the people that we have data on are not selected to represent the population as a whole, but rather to serve some other purpose. Non-random sampling is a type of biased sampling, and it forces professional epidemiologists to make educated guesses about the true values of inputs to their models, given that the information they have isn’t representative of the whole population.
The fact that we aren’t testing everyone presents another issue: we don’t have a good sense of how many people are asymptomatic. This also affects many of our inputs, like the infection rate, how many people each patient spreads the disease to. If a person tests positive for the coronavirus and their spouse does as well, then they’ll have an infection rate of 1. But what if their two children had the disease as well, but didn’t show any symptoms and hence didn’t receive testing? In reality, the person’s infection rate would be 3. Again, epidemiologists are making guesses about values like these based on prior experience.
Tests for the coronavirus have shown an unusually high “False Negative” rate, where someone is incorrectly told that they don’t have COVID-19. This might be happening because the sample that was taken didn’t pick up any of the virus – the coronavirus is primarily a respiratory disease, and it’s difficult to access the places where it typically lives – or because there’s an inherent problem with the test. Regardless of the reason, there have been credible estimates for a general false-negative rate as high as 30 percent.
Lack of standardized reporting
Governments and health workers are still developing standardized methods to record when a person was diagnosed with or died from COVID-19. Because of the inaccuracy and limited availability of tests, some countries are relying on the presence of symptoms to count a person as having the disease. Different clinicians are making different decisions about whether to report a death as coronavirus related, if the patient also had an underlying condition There’s also debate about whether to classify a death as coronavirus related if a person did not have COVID-19, but died because they couldn’t access medical equipment that was being used by a COVID-19 patient, like a ventilator.
While the communication lines are improving, throughout the crisis governments and media outlets have used makeshift systems to collect data from hospitals and morgues. As an example, the dashboard provided by Johns Hopkins’s Center for Systems Science and Engineering relies on an array of sources, such as press briefings, reports, tweets, and Facebook posts, all from trustworthy health departments and media sources. There are still some discrepancies between major data sources, such as the U.S. CDC, the Covid Tracking Project, or The New York Times; however, the datasets are becoming more homogenous as the processes mentioned here improve.
Forecasting the novel coronavirus is very, very hard. It requires data being captured on the fly and a rigorous understanding of a disease that’s only existed for a few months. Poor data quality only accounts for some of the disparate predictions that are being passed around.
In part 2 of this 2-part article, we’ll look at the problems researchers face when making models to give you greater insight into how to put them to use.
To learn more about how PK is changing the landscape of patient data in healthcare, download our whitepaper on using artificial intelligence to predict hospital readmission rates.
About the Author
Isaac Slaughter is a Data Scientist in PK’s Intelligence and Analytics Practice. He has a background in statistics and has worked on predictive analytics projects for clients including Daimler and Cabi.Tags: COVID-19, Data Science