Enter your location - 3000 places are live
Our model predicts the cases detected by testing for each of the next 4 upcoming weeks. For point of comparison, the graph above also shows the same metric for the past 4 weeks as well, which is known in retrospect.
We utilize machine learning on a dataset of relevant predictors of COVID-19 outbreaks or potential for outbreaks at a U.S. county level, generating a regression model that predicts future daily case counts.
Socioeconomic Data: We utilize county-level socioeconomic metrics including the CCVI index from the Surgo Foundation, the proportions of elderly, black, hispanic, and male inhabitants, and population density from the US Census, as these factors may be linked to COVID-19 outbreak risk.
Health Data: We utilize county-level data on various diseases' mortality and prevalence rates statistics from IHME to understand the population susceptibility to COVID-19.
Rt Data: We use state-level and county-level data for reproduction rates of COVID-19 computed by standard epidemiological models by CovidActNow.org to understand the level of transmission.
Testing Data: We use state-level data on total tests conducted and positive tests received from covidtracking.com to understand current diagnostic efforts.
Cases Data: We use daily county-level data on cases from Johns Hopkins University. This data is smoothed with a 7-day moving average to understand epidemics at the local level.
Feature Engineering: Our raw dataset includes a wide range of statistics that are linked to COVID-19 for dates ranging from March 2020 to the present. We are actively tuning our models with dimensionality reduction and normalization techniques to improve performance and reduce overfitting.
Algorithms: We currently utilize Random Forests on our datasets. These regression models output the projected cases detected by testing for each of the upcoming 4 weeks.
Mobility: A challenge we encountered is that mobility data is unavailable for some rural counties. To maximize the population that we serve, we use two models, one of which incorporates mobility data and the other of which does not.
Performance: Our latest models make good predictions on data they have not been trained with, with relatively low mean absolute error usually < 30 cases/100,000. To put this in perspective, for an individual county with this error for a prediction week and a population of 100,000, our projection would only be off by 30 cases.
trained with mobility data?
week ahead that is forecasted
a measure of goodness-of-fit
absolute cases per 100,000, nationwide
A machine learning model is only as reliable as the data used to train it. We have done our best to obtain the most reliable datasets available that are relevant to the current COVID-19 epidemic in the United States. However, we cannot help it if these include errors or skewing factors.
Case counts per county may be skewed by inaccurate test results and disparities in testing capacity, especially in rural vs. urban regions. We attempt to factor this skew in by including testing datasets, but these are at the state level.
Our model and almost all other epidemiological models for the US only take into account case data based on testing, so their predictions are based on a limited view of the actual scope of the epidemic.
We have chosen not to build a traditional epidemiological model due to scarcity of necessary data and our desire to easily take into account a diverse set of relevant risk factors at the county level. While our model has shown good performance thus far, please focus on the general trends of the predictions instead of taking them too literally.
We hope to release a licensed version of our source code in the coming weeks to aid other researchers once we are certain that our pipeline is well-optimized and bug-free.
We are working hard to make sure that every county in the US is represented in our dataset.
Please contact us if you have any questions, comments, or concerns using the form at the top of this page.
COVID-19 has more adverse effects on certain populations within the U.S. For instance, men experience higher mortality than women and minorities like African Americans/Blacks experience higher mortality than Caucasians and Asians. Urban regions experience larger outbreaks than rural regions, but rural regions may have lower access to testing.
Consequently, our model takes into account racial and socioeconomic factors for every county to more accurately predict outbreaks.