Understanding Data During a Pandemic

Are you tired of the phrases “exponential growth”, “unprecedented” and “flatten the curve”? I sure am. We’re being inundated daily with data and news reports with bold claims about how many people carry the infection, death rates, potential cures, and so much more.

Researchers are publishing new research at an astounding rate, giving our media ample material to create headlines that cause hope (“NY doctor reports 100% success treating coronavirus with drug combination”) and despair (“Imperial College Epidemiologists Report Forecasts up to 2.2 Million COVID-19 Deaths in US, 510,000 in UK”). The data landscape is rapidly shifting, so being thoughtful as we read articles can help us wade through the deluge of information to better understand the current situation and where this pandemic might take us.

There are three main categories of data-related topics coming out: 

  1. Reports on the numbers of cases and deaths that rely on COVID-19 testing
  2. Projections of hospital capacity, deaths, and timelines
  3. Medical research on treatments and risk factors

Unfortunately, much of the reports on these topics highlight only a piece of the research, ignoring details of what limitations the authors identified in their research. So, here we have detailed some major limitations in those three categories of data so that you can more thoughtfully engage in conversations about pandemic data.

Think About the Data

First, take a quick refresher on the five things that we should consider about a reported data point.

The Covid-19 pandemic is a very unique data situation because everything is moving so quickly. It’s difficult to scale existing data collection strategies to the rapidly increasing number of cases, methods don’t line up between different places, tests vary in accuracy, protocols for testing vary by hospital, and so many more issues are at play. We developed four major tips for you to keep in mind when you encounter a story using data.

Tip 1 for Understanding the Data: Every Place is Different and Changing

Every country has different ways of reporting data and some are even changing their reporting methods and definitions as the pandemic outbreaks continue. For a well-known, non-coronavirus example, infant mortality rates aren’t directly comparable between countries because of how different countries define “infant”. In the U.S. an infant includes any live birth while in some countries there are weight or age thresholds, which changes the entire sample size of infants and would lead to lower mortality rates (assuming super premature babies die at a higher rate than healthy infants). 

So, comparing U.S. coronavirus trends to other countries is difficult because we often don’t know yet how this data is being collected and reported. Officials are starting to be more transparent about this issue, revealing death tolls are much higher in some countries because of shortcomings in the data collection process, like how France excluded nursing home deaths for many weeks in their reporting.

For a more local example, the State of Michigan has reported daily case rates since the first reported case on March 10th. We’ve been collecting the daily case rates by county to feed into a tool that reports cases per capita and helps compare Detroit to surrounding counties. However, a few days ago, Michigan stopped reporting new cases by county, only keeping the breakdown for cumulative cases. 

Not a problem, we could just subtract yesterday’s total cases from today and get a new case rate! Unfortunately, that didn’t work because the state revised case numbers from previous days occasionally, meaning that some counties that previously reported cases no longer had cases as they were moved to other counties. 

While the state is entirely transparent about the rapidly changing nature of the situation, most news reports don’t include their multiple caveats in the data in the reporting. Which brings us to our next tip, knowing the limitations of the data you’re reading about.

Tip 2 for Understanding the Data: Know the Limitations

When reading about data, it’s important to know what limitations it has. Right now, coronavirus data has many limitations around the world. Many countries are only testing the sickest people, skewing the number of positive tests, making it difficult to estimate the true case load, and almost impossible to realistically estimate the number of asymptomatic cases. The tests take varying lengths of time to process. In Detroit, the lag time on testing is about one week, which is why Detroit is piloting 15-minute tests for first responders and bus drivers. Some more rural communities have to ship their test swabs to larger cities, creating an even longer lag before a case is reported through official channels.

In Michigan, the testing data reported doesn’t match up to the case data. The state’s website says this is due to not including data from every laboratory, and also multiple tests being run for a single individual. The state is also reporting race and ethnicity of individuals with confirmed cases and deaths, but as of April 6th, a significant number of cases are missing this information, which makes the report less reliable.

Additionally, in many places being “recovered” from coronavirus requires two negative tests, which means that at minimum three tests are being run on a single individual’s specimens (one to confirm a positive, and then two negative tests). Meanwhile, there are a growing number of anecdotal reports on a large number of false negative tests in the US, which would further skew testing data if someone has to be tested three or four times before receiving a positive. 

The qualifications on reported data are numerous. It’s always a good idea to dig a little deeper than a headline to see what the data source says the limitations are on what is being reported.

Think About the Projections

When data quality is limited, the models created from that data are also limited. Models are predictions, much like weather forecasting, with varying degrees of accuracy based on the data being used, the math equation selected, and the expertise of the individuals making assumptions. The fact is: No one knows how many people are going to die or get sick therefore no one knows when ICU beds will be full or when a particular community will run out of ventilators. We’ll only know these numbers for certain after the pandemic settles down. 

Tip 3 for Understanding the Data: Predictions are Never Certain

Even when there is a lot of high-quality data, projections are inherently difficult because they always rely on assumptions. Since there is not reliable data to model with, it is difficult to accurately predict the outcomes of coronavirus. 

In this case, the data scientists are using their professional expertise to choose what mortality rates, infection rates, and more to input into their model. FiveThirtyEight does a great job explaining the difficulties of modeling the current pandemic. 

The Imperial College’s projection of 2.2 million deaths in the United States relies on the assumption that the country would continue to do nothing in terms of mitigation. The Coronavirus Taskforce recently released projections that 100,000-240,000 Americans will die, even with our current “stay at home” rules. University of Washington’s Institution of Health Metrics and Evaluation (IHME) assumes social distancing through the end of May and projects over 90,000 deaths. 

The assumption of how Americans behave is just one of many these researchers are making as they create these models that are trying to predict the future. There are many assumptions being made by researchers as they build these models. A slight change in an assumption can alter the projections significantly and experts vary widely on best and worst case scenarios projections because of it.

Tip 4 for Understanding the Data: Account for Margins of Error 

An important piece of the equation is missing from almost all of the reporting on projections: margins of error. Margins of error exist for estimates and tell us how exact researchers believe their projection is based on the various inputs and assumptions they’re making. Read our blog about margins of error and other reliability measures to learn more about how they’re used in real life.

While most of the reporting doesn’t include margins of error, IHME does visualize the range of predictions in their continuously updating model, which can help demonstrate the difficulty of sure predictions.

In this graph, IHME predicts that April 16th is the peak day of coronavirus-related deaths in the United States as a whole (remember that what’s true for the US isn’t necessarily true for each state). The red dotted line is the predicted death count per day, but the red shaded area is the range that the model feels might also be accurate.  It’s also important to note that IHME is not graphing their previous projections so the solid and dotted red lines being close does not reflect their previous models being incredibly accurate. It does tell us that with new, updated data, their model can predict tomorrow fairly accurately, but as you can see long-term predictions still vary widely.

So, on April 16th, deaths per day could range from 1,294-4,140. The model also shows a different possible peak day, April 22nd, where the upper limit of possible fatalities is 4,526. IMHE’s model also projects total deaths at 93,765 through August 4th, but has a range of 41,399-177,381.

While these ranges are wide, they still demonstrate that there is relative confidence that a significant number of Americans will die. The takeaway shouldn’t be that this is not a critical public health issue, but that we should be judicious in how we respond to headlines that claim to have the “facts” when these estimates are based on highly educated guesses. 

Remember as you read articles with data and modeling to be critical of the methods, to consider where the data is from, and to look for what assumptions were made when analyzing the data.

Think About the Medical Research

Lastly, a word about medical research. There is a lot of media coverage about medical treatments, vaccines, infection rates, and more. All of the previously mentioned issues come into play with these stories. 

How many people are being tested? Who is being tested? Are they the same in France and Italy as in the U.S. or in Detroit? How does this data actually compare to who is being treated in my community?

Data and Clinical Trials

It’s important to remember how clinical trials work in testing the usefulness of drugs for treating a disease. In a trial, half of participants receive a treatment and the other half receive a placebo, without knowing what group they fall in. With this method, researchers can tell the difference between the treatment impact and doing nothing. This is where the term “placebo effect” comes from because it turns out giving someone a placebo can sometimes impact their outcomes just from the power of suggestion.

This is what is happening at the University of Minnesota, where researchers are conducting a clinical trial of the drug as a possible preventative measure and enrolling 1,500 people, half receive the drug and the other half receive a vitamin. On the other hand, the French doctor that reports positive outcomes from the antimalarial drug only enrolled 24 patients in the first test and 80 in the second and didn’t have a control group to compare outcomes to.  

In the pandemic situation where hospitals are being pushed to capacity, many hospitals are implementing various medication regimens without clinical trials and relying on anecdotal experience. It’s important to recognize that clinical experience is not the same as the results of a clinical trial. The methods are less rigorous, the samples aren’t always clear, and claiming widespread success without a clear control group is not sound scientific methodology.

 As Henry Ford Health Systems says

In the absence of a formal clinical trial, the next best evidence would be clinical experience and how people do when they get one treatment or another… the potential benefit of keeping someone out of an intensive care unit, getting them out of the hospital earlier, reducing mortality, we feel far outweighs any potential risk.”  

As you read about possible treatments, vaccines, and other developments about Covid-19, keep in mind these potential limitations. Many research papers are being written rapidly. Some are quickly retracted, like a report that a significant portion of asymptomatic cases had false positive tests in China. And that doctors claiming to have a 100% success rate treating coronavirus with a medication aren’t comparing that to a control group, so there’s no way to know if it’s the medication or some other factor impacting patient recovery.

The Bottomline

As the FiveThirtyEight article says, “Numbers aren’t facts. They’re the result of a lot of subjective choices that have to be documented transparently and in detail before you can even begin to consider treating the output as fact. How data is gathered—and whether it is gathered the same way each time—matters.”

We highly encourage you to always try to find the original data source or research and see what limitations are being reported. All data has limitations and there are significant data limitations in everything being reported right now.  If you need help with data related to Covid-19 for your organization, please reach out AskD3.

Copyright © 2022 Data Driven Detroit. All Rights Reserved.