Our model predicts the average daily new cases for each of the next 4 upcoming weeks. For point of comparison, the graph above also shows the same metric (which is known through testing data) for the past 4 weeks as well.
We utilize machine learning on a dataset of relevant predictors of COVID-19 outbreaks or potential for outbreaks at a U.S. county level, generating a regression model that predicts future daily case counts.
Socioeconomic Data: We utilize county-level socioeconomic metrics including the CCVI index from the Surgo Foundation, the proportions of elderly, black, hispanic, and male inhabitants, and population density from the US Census, as these factors are closely linked to COVID-19 outbreak risk.
Health Data: We utilize county-level data on various diseases' mortality rates and smoking prevalence statistics from IHME to understand the population susceptibility to COVID-19.
Rt Data: We use state-level data for reproduction rates of COVID-19 computed by standard epidemiological models by covid19-projections.com and rt.live to understand the level of statewide transmission.
Testing Data: We use state-level data on total tests conducted and positive tests received from covidtracking.com moving average, smoothed with a 7-day moving average to understand current diagnostic efforts.
Cases Data: We use county-level data on cases reported in each county every day from Johns Hopkins University. This data is smoothed with a 7-day moving average to understand epidemics at the local level.
Feature Engineering: Our raw dataset includes a wide range of statistics that are linked to COVID-19 for dates ranging from March 2020 to the present. We are actively tuning our models with dimensionality reduction and normalization techniques to improve performance and reduce overfitting.
Algorithms: We currently utilize Random Forests and Artificial Neural Networks on our datasets. These regression models output the projected average daily cases for each of the upcoming 4 weeks.
Mobility: A challenge we encountered is that mobility data is unavailable for some rural counties. To maximize the population that we serve, we use two models, one of which incorporates mobility data and the other of which does not.
Performance: Our latest models make good predictions on data they have not been trained with, with relatively low mean absolute error usually < 5 cases/100,000. To put this in perspective, for an individual county with this error for a prediction week and a population of 100,000, its average daily cases for that prediction week would only be off by 5 from our prediction.
trained with mobility data?
week ahead that is forecasted
a measure of goodness-of-fit
absolute cases per 100,000, nationwide
A machine learning model is only as reliable as the data used to train it. We have done our best to obtain the most reliable datasets available that are relevant to the current COVID-19 epidemic in the United States. However, we cannot help it if these include errors or skewing factors.
Case counts per county may be skewed by inaccurate test results and disparities in testing capacity, especially in rural vs. urban regions. We attempt to factor this skew in by including testing datasets, but these are at the state level. In fact, many of our other datasets, such as Rt values, are also at the state level, and may not be very applicable to some counties.
Our model and almost all other epidemiological models for the US only take into account case data based on testing, so their predictions are based on a limited view of the actual scope of the epidemic.
We have chosen not to build a traditional epidemiological model due to scarcity of necessary data and our desire to easily take into account a diverse set of relevant risk factors at the county level. While our model has shown good performance thus far, please focus on the general trends of the predictions instead of taking them too literally.
We hope to release a licensed version of our source code in the coming weeks to aid other researchers once we are certain that our pipeline is well-optimized and bug-free.
We are working hard to make sure that every county in the US is represented in our dataset.
Please contact us if you have any questions, comments, or concerns using the form at the top of this page.
COVID-19 has more adverse effects on certain populations within the U.S. For instance, men experience higher mortality than women and minorities like African Americans/Blacks experience higher mortality than Caucasians and Asians. Urban regions experience larger outbreaks than rural regions, but rural regions may have lower access to testing.
Consequently, our model takes into account racial and socioeconomic factors for every county to more accurately predict outbreaks.