Making sense of COVID-19 predictions: Part 2
A data scientist’s guide to coronavirus forecasts
In part one of this two-part blog post, we looked at the issues involved in aggregating data about COVID-19. As time passes, testing becomes more available, and workers hone their systems for communicating data, our understanding of COVID-19 is improving. There is little doubt that the COVID-19 data environment is evolving rapidly.
States and countries are now deciding the extent to which they’ll allow people back into their daily routines. But even as researchers resolve data quality issues, differences in modeling techniques are still causing some disparity between forecasts. In part two of the series, we’ll take a look at the methods researchers rely on to turn past data into predictions.
The majority of labs base their models on a handful of general types, although each research group has developed their own procedures and tweaks for coding their versions. Teams release thousands of pages of new innovations each year, so understanding the precise details of each model isn’t practical, or even useful for most audiences. But having an overview of some of the general techniques can hopefully help you feel a bit more comfortable with the predictions you come across.
The technique receiving the most news coverage early on was a type of compartmental modeling, called SIR modeling. This technique is being used by Neil Ferguson at Imperial College London, for a model mentioned in part one of this series, which reportedly spurred strong action from both the U.K. and the White House. Compartmental models try to anticipate how a disease will travel between different groups. For COVID-19 forecasting, SIR models track how the disease spreads through the Susceptible (S), Infected (I), and Recovered (R) populations.
The entire population often starts as Susceptible, assuming that there is no natural immunity to the virus and that the body’s general defenses are not enough to prevent it. (This appears to be the case with the novel coronavirus, although as with all things COVID-19, the jury is still out.) A group of people from the Susceptible group will first get the disease and move into the Infected group, and as they encounter those remaining in the Susceptible population, the Infected population grows, and the Susceptible population shrinks. Next, people will hopefully move into the recovered population.
Much of the difficulty in implementing SIR models lies in determining how people move between the three groups. The most basic model assumes that each infected person will infect a constant number of susceptible people for each day or week that they have the disease, and that recovery takes a constant number of days. But in practice, researchers add in a host of different details to make the models look more like real life.
Because a person’s characteristics seem to change how COVID-19 affects them, researchers fill out their models with data about the population, allowing them to use specific rates for different groups of people. They might use a slower recovery rate for older people in the model, for example, meaning that a person’s age would factor into how soon they moved from the infected population to the recovered. Certain members of the population might also be more likely to spread the disease to one another, say people who take public transit, so details can be added to dictate which members of the Infected group are likely to pass the disease to which members of the Susceptible group.
Parametric models are being created by teams like the University of Washington’s Institute for Health Metrics and Evaluation, whose forecast has also been cited a handful of times by the White House’s Coronavirus Response Coordinator. These models assume that the data follows a known statistical distribution, and simply try to estimate the inputs to the distribution.
The IMHE refers to their methodology as “Curve Fitting,” as it involves finding the best line to summarize past data, then carrying the line forward to estimate how deaths will continue. They have chosen to predict coronavirus deaths, under the assumption that death data will be more accurate than incidence data, but these types of models can also be used to predict the number of people who will have the disease. Parametric models are a time-tested statistical technique and are often a data scientist’s go-to strategy, in epidemiology or otherwise. Like the other model types mentioned here, they will usually produce a range of credible estimates to account for uncertainty in the input data.
Survey-based forecasts, like some of those used by Carnegie Mellon’s Delphi lab, rely on the wisdom of crowds to make predictions. A significant body of research in recent years has shown that human predictions, when aggregated, can at times make more accurate predictions than advanced mathematical, statistical, or machine learning techniques. The Delphi team’s survey-based model has won first place in three of the CDC’s flu prediction tasks since 2014, and these types of models have been shown to perform well on a robust body of problems.
Survey models ask participants to spend time researching whatever they’re trying to predict, in this case COVID-19 incidence, before answering a set of questions about the likelihood of different outcomes. Researchers will then somehow aggregate their predictions. Some models survey both experts and laypeople, and some will only survey one group. There are a variety of aggregation methods used as well: some use the average of survey-takers predictions; some take the median; some surveys give more weight to experts; some give more weight to high performers; and some do not do any weighting.
How to use Coronavirus predictions
With some background on COVID-19 forecasts, it should be easier to understand why there are discrepancies between models. There’s no secret sauce to discerning the accuracy of a prediction, but we do have some general tips to help guide you through the onslaught of data science you’re seeing.
Be generally suspicious. There is a lot of debate in the forecasting community with the novel coronavirus, so strong claims are usually not warranted. Don’t place all your trust in one model.
Be especially suspicious of machine learning. You may have noticed that machine learning solutions were not mentioned in this article. ML models don’t rely on outside information about how the disease spreads, and they require a lot of data to learn on their own. That means they aren’t a good solution for modeling something with as little historical precedence as COVID-19.
Trust shorter-term predictions. Uncertainty compounds, so the further away a model is from real data, the more uncertain its predictions will be.
Look at the range of predictions a model makes. “Interval Estimates,” where a model comes up with a range of likely values, are a powerful tool that statisticians use to account for uncertainty. The middle prediction that a model comes up with will almost never be correct, but the true outcome will often fall in the Interval Estimate of a good model.
See where the modelers are getting their data. Some hyper-specific models, such as spread models based on flight patterns, are using trustworthy information.
Pay attention to the assumptions. Many models are used to explore what-if scenarios and will only be accurate if their assumptions are accurate. If a model offers rosy predictions by assuming people will strictly observe social distancing, that doesn’t mean that we shouldn’t worry about the virus and can now relax our social distancing measures.
Identify what models agree on and what they disagree on. Scientific consensus is a good indicator of how certain you should be about predictions.
Don’t be surprised if predictions change over time. Models can’t, and won’t, anticipate everything, especially large government actions or cultural shifts. A changed prediction may mean that the model is adapting to new circumstances.
To learn more about how PK is changing the landscape of patient data in healthcare, download our whitepaper on using artificial intelligence to predict hospital readmission rates.
About the Author
Isaac Slaughter is a Data Scientist in PK’s Intelligence and Analytics Practice. He has a background in statistics and has worked on predictive analytics projects for clients including Daimler and Cabi.Tags: COVID-19, Data Science