Coronavirus (COVID-19) Visualization & Prediction 笔记
项目链接:https://www.kaggle.com/therealcyberlord/coronavirus-covid-19-visualization-prediction
总结
写在前面:这个项目本来以为很复杂,会有很复杂的数据处理过程,其实就还好。由于图和表太多,把总结写在前面。
首先,没用到什么比较高级的算法什么的,对于python脚本能力要求不是很高。
其次,对我比较有价值的数据分析部分是该项目对新冠病毒未来数据的预测部分,包括SVM, Polynomial Regression, Bayesian Ridge Regression等回归方法。以及其中sklearn相关函数的使用,如 train_test_split, PolynomialFeatures等。
最后,关于数据的可视化这一部分,基本上和新闻上说的差不多,美国一枝独秀。
图和表巨多,有兴趣可以下去点点。部分结果有注释。
正文
冠状病毒(Coronavirus)是一个病毒家族,是由这些病毒具有的冠状突刺(spiky crown)而命名。新型冠状病毒,也就是SARS-CoV-2,是一种传染性呼吸道病毒,首次在中国武汉报道。2020年11月2日,世界卫生组织将这种新型冠状病毒引起的疾病命名为COVID-19。这篇Notebook
旨在通过数据分析和预测来探索COVID-19。
新型冠状病毒病例数据由Johns Hopkins University提供
新型冠状病毒移动数据由Apple提供
从World Health Organization获取更多信息
从Centers for Disease Control and Prevention获取更多信息
从JHU CCSE Dashboard查看地图可视化数据
源代码:my Github
最后更新:美国东部时间2020年10月20日下午5:13 数据表更新
最新更新:10月20日每日报告数据更新
时间序列数据更新为10/19,移动性数据更新为10/19
预测模型2020年3月13日开始训练。因此,较早的日期可能不准确。
目录
- Exploring Global Coronavirus Cases
- Exploring Coronavirus Cases From Different Countries
- Worldwide Confirmed Cases Prediction
- Data Table
- Pie Charts
- Bar Charts
- US Testing Data
- Mobility Data for Hotspots
# 导入模块,都比较常见
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import pandas as pd
import random
import math
import time
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error
import datetime
import operator
plt.style.use('fivethirtyeight')
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
读取数据
#也可以直接先下来,再读取
confirmed_df = pd.read_csv('./20201021_data/time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('./20201021_data/time_series_covid19_deaths_global.csv')
recoveries_df = pd.read_csv('./20201021_data/time_series_covid19_recovered_global.csv')
latest_data = pd.read_csv('./20201021_data/10-19-2020-latest.csv')
us_medical_data = pd.read_csv('./20201021_data/10-19-2020-us.csv')
apple_mobility = pd.read_csv('./20201021_data/applemobilitytrends-2020-10-19.csv')
# 这个文件是全球随时间变化的确诊病例
print(confirmed_df.columns)
confirmed_df.head()
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
'1/24/20', '1/25/20', '1/26/20', '1/27/20',
...
'10/11/20', '10/12/20', '10/13/20', '10/14/20', '10/15/20', '10/16/20',
'10/17/20', '10/18/20', '10/19/20', '10/20/20'],
dtype='object', length=277)
Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | … | 10/11/20 | 10/12/20 | 10/13/20 | 10/14/20 | 10/15/20 | 10/16/20 | 10/17/20 | 10/18/20 | 10/19/20 | 10/20/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Afghanistan | 33.93911 | 67.709953 | 0 | 0 | 0 | 0 | 0 | 0 | … | 39799 | 39870 | 39928 | 39994 | 40026 | 40073 | 40141 | 40200 | 40287 | 40357 |
1 | NaN | Albania | 41.15330 | 20.168300 | 0 | 0 | 0 | 0 | 0 | 0 | … | 15399 | 15570 | 15752 | 15955 | 16212 | 16501 | 16774 | 17055 | 17350 | 17651 |
2 | NaN | Algeria | 28.03390 | 1.659600 | 0 | 0 | 0 | 0 | 0 | 0 | … | 53072 | 53325 | 53399 | 53584 | 53777 | 53998 | 54203 | 54402 | 54616 | 54829 |
3 | NaN | Andorra | 42.50630 | 1.521800 | 0 | 0 | 0 | 0 | 0 | 0 | … | 2696 | 2995 | 2995 | 3190 | 3190 | 3377 | 3377 | 3377 | 3623 | 3623 |
4 | NaN | Angola | -11.20270 | 17.873900 | 0 | 0 | 0 | 0 | 0 | 0 | … | 6366 | 6488 | 6680 | 6846 | 7096 | 7222 | 7462 | 7622 | 7829 | 8049 |
5 rows × 277 columns
# 这个文件是全球随时间变化的死亡病例
print(deaths_df.columns)
deaths_df.head()
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
'1/24/20', '1/25/20', '1/26/20', '1/27/20',
...
'10/11/20', '10/12/20', '10/13/20', '10/14/20', '10/15/20', '10/16/20',
'10/17/20', '10/18/20', '10/19/20', '10/20/20'],
dtype='object', length=277)
Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | … | 10/11/20 | 10/12/20 | 10/13/20 | 10/14/20 | 10/15/20 | 10/16/20 | 10/17/20 | 10/18/20 | 10/19/20 | 10/20/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Afghanistan | 33.93911 | 67.709953 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1477 | 1479 | 1480 | 1481 | 1481 | 1485 | 1488 | 1492 | 1497 | 1499 |
1 | NaN | Albania | 41.15330 | 20.168300 | 0 | 0 | 0 | 0 | 0 | 0 | … | 420 | 424 | 429 | 434 | 439 | 443 | 448 | 451 | 454 | 458 |
2 | NaN | Algeria | 28.03390 | 1.659600 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1801 | 1809 | 1818 | 1827 | 1827 | 1841 | 1846 | 1856 | 1865 | 1873 |
3 | NaN | Andorra | 42.50630 | 1.521800 | 0 | 0 | 0 | 0 | 0 | 0 | … | 55 | 57 | 57 | 59 | 59 | 59 | 59 | 59 | 62 | 62 |
4 | NaN | Angola | -11.20270 | 17.873900 | 0 | 0 | 0 | 0 | 0 | 0 | … | 218 | 219 | 222 | 227 | 228 | 234 | 241 | 247 | 248 | 251 |
5 rows × 277 columns
# 这个文件是全球随时间变化的治愈病例
print(recoveries_df.columns)
recoveries_df.head()
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
'1/24/20', '1/25/20', '1/26/20', '1/27/20',
...
'10/11/20', '10/12/20', '10/13/20', '10/14/20', '10/15/20', '10/16/20',
'10/17/20', '10/18/20', '10/19/20', '10/20/20'],
dtype='object', length=277)
Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | … | 10/11/20 | 10/12/20 | 10/13/20 | 10/14/20 | 10/15/20 | 10/16/20 | 10/17/20 | 10/18/20 | 10/19/20 | 10/20/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Afghanistan | 33.93911 | 67.709953 | 0 | 0 | 0 | 0 | 0 | 0 | … | 33114 | 33118 | 33308 | 33354 | 33447 | 33516 | 33561 | 33614 | 33760 | 33790 |
1 | NaN | Albania | 41.15330 | 20.168300 | 0 | 0 | 0 | 0 | 0 | 0 | … | 9500 | 9585 | 9675 | 9762 | 9864 | 9957 | 10001 | 10071 | 10167 | 10225 |
2 | NaN | Algeria | 28.03390 | 1.659600 | 0 | 0 | 0 | 0 | 0 | 0 | … | 37170 | 37382 | 37492 | 37603 | 37603 | 37856 | 37971 | 38088 | 38215 | 38346 |
3 | NaN | Andorra | 42.50630 | 1.521800 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1814 | 1928 | 1928 | 2011 | 2011 | 2057 | 2057 | 2057 | 2273 | 2273 |
4 | NaN | Angola | -11.20270 | 17.873900 | 0 | 0 | 0 | 0 | 0 | 0 | … | 2743 | 2744 | 2761 | 2801 | 2928 | 3012 | 3022 | 3030 | 3031 | 3037 |
5 rows × 277 columns
# 截至当前20201021,各国新冠病毒统计情况
print(latest_data.columns)
latest_data.head()
Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
'Combined_Key', 'Incidence_Rate', 'Case-Fatality_Ratio'],
dtype='object')
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | Incidence_Rate | Case-Fatality_Ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | Afghanistan | 2020-10-20 04:24:22 | 33.93911 | 67.709953 | 40287 | 1497 | 33760 | 5030.0 | Afghanistan | 103.490154 | 3.715839 |
1 | NaN | NaN | NaN | Albania | 2020-10-20 04:24:22 | 41.15330 | 20.168300 | 17350 | 454 | 10167 | 6729.0 | Albania | 602.891097 | 2.616715 |
2 | NaN | NaN | NaN | Algeria | 2020-10-20 04:24:22 | 28.03390 | 1.659600 | 54616 | 1865 | 38215 | 14536.0 | Algeria | 124.548919 | 3.414750 |
3 | NaN | NaN | NaN | Andorra | 2020-10-20 04:24:22 | 42.50630 | 1.521800 | 3623 | 62 | 2273 | 1288.0 | Andorra | 4689.057141 | 1.711289 |
4 | NaN | NaN | NaN | Angola | 2020-10-20 04:24:22 | -11.20270 | 17.873900 | 7829 | 248 | 3031 | 4550.0 | Angola | 23.820776 | 3.167710 |
# 截至当前20201021,美国新冠病毒统计情况
print(us_medical_data.columns)
us_medical_data.head()
Index(['Province_State', 'Country_Region', 'Last_Update', 'Lat', 'Long_',
'Confirmed', 'Deaths', 'Recovered', 'Active', 'FIPS', 'Incident_Rate',
'People_Tested', 'People_Hospitalized', 'Mortality_Rate', 'UID', 'ISO3',
'Testing_Rate', 'Hospitalization_Rate'],
dtype='object')
Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | FIPS | Incident_Rate | People_Tested | People_Hospitalized | Mortality_Rate | UID | ISO3 | Testing_Rate | Hospitalization_Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | US | 2020-10-20 04:30:29 | 32.3182 | -86.9023 | 173485 | 2789 | 74238.0 | 96458.0 | 1.0 | 3538.210367 | 1260100.0 | NaN | 1.607632 | 84000001 | USA | 25699.621776 | NaN |
1 | Alaska | US | 2020-10-20 04:30:29 | 61.3707 | -152.4044 | 11182 | 67 | 6516.0 | 4599.0 | 2.0 | 1528.545749 | 536223.0 | NaN | 0.599177 | 84000002 | USA | 73300.070399 | NaN |
2 | American Samoa | US | 2020-10-20 04:30:29 | -14.2710 | -170.1320 | 0 | 0 | NaN | 0.0 | 60.0 | 0.000000 | 1616.0 | NaN | NaN | 16 | ASM | 2904.333136 | NaN |
3 | Arizona | US | 2020-10-20 04:30:29 | 33.7298 | -111.4312 | 231897 | 5830 | 38553.0 | 187514.0 | 4.0 | 3185.959833 | 1639785.0 | NaN | 2.514047 | 84000004 | USA | 22528.489568 | NaN |
4 | Arkansas | US | 2020-10-20 04:30:29 | 34.9697 | -92.3731 | 99597 | 1714 | 89217.0 | 8666.0 | 5.0 | 3300.313738 | 1223914.0 | NaN | 1.720935 | 84000005 | USA | 40556.444355 | NaN |
## 苹果的移动数据
print(apple_mobility.columns)
apple_mobility.head()
Index(['geo_type', 'region', 'transportation_type', 'alternative_name',
'sub-region', 'country', '2020-01-13', '2020-01-14', '2020-01-15',
'2020-01-16',
...
'2020-10-10', '2020-10-11', '2020-10-12', '2020-10-13', '2020-10-14',
'2020-10-15', '2020-10-16', '2020-10-17', '2020-10-18', '2020-10-19'],
dtype='object', length=287)
geo_type | region | transportation_type | alternative_name | sub-region | country | 2020-01-13 | 2020-01-14 | 2020-01-15 | 2020-01-16 | … | 2020-10-10 | 2020-10-11 | 2020-10-12 | 2020-10-13 | 2020-10-14 | 2020-10-15 | 2020-10-16 | 2020-10-17 | 2020-10-18 | 2020-10-19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | country/region | Albania | driving | NaN | NaN | NaN | 100.0 | 95.30 | 101.43 | 97.20 | … | 144.47 | 148.87 | 123.94 | 111.80 | 113.31 | 111.52 | 117.39 | 128.99 | 137.88 | 114.93 |
1 | country/region | Albania | walking | NaN | NaN | NaN | 100.0 | 100.68 | 98.93 | 98.46 | … | 167.16 | 142.52 | 150.36 | 141.02 | 155.39 | 134.41 | 142.26 | 142.22 | 125.67 | 149.77 |
2 | country/region | Argentina | driving | NaN | NaN | NaN | 100.0 | 97.07 | 102.45 | 111.21 | … | 79.72 | 49.19 | 49.69 | 61.16 | 65.26 | 68.45 | 84.94 | 88.93 | 48.76 | 52.73 |
3 | country/region | Argentina | walking | NaN | NaN | NaN | 100.0 | 95.11 | 101.37 | 112.67 | … | 62.15 | 36.57 | 43.66 | 52.51 | 56.79 | 55.10 | 69.59 | 62.42 | 34.40 | 42.97 |
4 | country/region | Australia | driving | AU | NaN | NaN | 100.0 | 102.98 | 104.21 | 108.63 | … | 83.24 | 88.85 | 91.45 | 93.17 | 96.06 | 104.24 | 99.65 | 85.42 | 92.72 | 94.60 |
5 rows × 287 columns
获取疫情爆发日期
# 从确诊、死亡、治愈病例中获取时间序列数据,第四列到最后一列
cols = confirmed_df.keys()
confirmed = confirmed_df.loc[:, cols[4]:cols[-1]]
deaths = deaths_df.loc[:, cols[4]:cols[-1]]
recoveries = recoveries_df.loc[:, cols[4]:cols[-1]]
# 获取累计确诊、死亡、治愈病例,然后画图
dates = confirmed.keys()
world_cases = []
total_deaths = []
mortality_rate = []
recovery_rate = []
total_recovered = []
total_active = []
for i in dates:
confirmed_sum = confirmed[i].sum()
death_sum = deaths[i].sum()
recovered_sum = recoveries[i].sum()
# confirmed, deaths, recovered, and active
world_cases.append(confirmed_sum)
total_deaths.append(death_sum)
total_recovered.append(recovered_sum)
total_active.append(confirmed_sum-death_sum-recovered_sum)
# calculate rates
mortality_rate.append(death_sum/confirmed_sum)
recovery_rate.append(recovered_sum/confirmed_sum)
# 获取每日每周增加数据,指上面的 确诊数量变化 死亡数量变化 治愈数量变化 confirmed, deaths, recovered, and active ,以及比例
def daily_increase(data):
d = []
for i in range(len(data)):
if i == 0:
d.append(data[0])
else:
d.append(data[i]-data[i-1])
return d
def moving_average(data, window_size):
moving_average = []
for i in range(len(data)):
if i + window_size < len(data):
moving_average.append(np.mean(data[i:i+window_size]))
else:
moving_average.append(np.mean(data[i:len(data)]))
return moving_average
# window size
window = 7
# confirmed cases
world_daily_increase = daily_increase(world_cases)
world_confirmed_avg= moving_average(world_cases, window)
world_daily_increase_avg = moving_average(world_daily_increase, window)
# deaths
world_daily_death = daily_increase(total_deaths)
world_death_avg = moving_average(total_deaths, window)
world_daily_death_avg = moving_average(world_daily_death, window)
# recoveries
world_daily_recovery = daily_increase(total_recovered)
world_recovery_avg = moving_average(total_recovered, window)
world_daily_recovery_avg = moving_average(world_daily_recovery, window)
# active
world_active_avg = moving_average(total_active, window)
# 从1月22日开始计数,并将数据整理为n*1矩阵
days_since_1_22 = np.array([i for i in range(len(dates))]).reshape(-1, 1)
world_cases = np.array(world_cases).reshape(-1, 1)
total_deaths = np.array(total_deaths).reshape(-1, 1)
total_recovered = np.array(total_recovered).reshape(-1, 1)
print(days_since_1_22.shape)
(273, 1)
# 未来预测,获取未来10天的编号,从1月22日开始计数
days_in_future = 10
future_forcast = np.array([i for i in range(len(dates)+days_in_future)]).reshape(-1, 1)
adjusted_dates = future_forcast[:-10]
# 把上一步得到的编号(int)转化为日期格式,便于可视化。 模块 datetime
start = '1/22/2020'
start_date = datetime.datetime.strptime(start, '%m/%d/%Y')
future_forcast_dates = []
for i in range(len(future_forcast)):
future_forcast_dates.append((start_date + datetime.timedelta(days=i)).strftime('%m/%d/%Y'))
# 正式预测:train_test_split sklearn.model_selection 从世界确诊病例中挑选第50天后的所有数据,test数据比例为0.15,不打乱
X_train_confirmed, X_test_confirmed, y_train_confirmed, y_test_confirmed = train_test_split(days_since_1_22[50:], world_cases[50:], test_size=0.15, shuffle=False)
print(X_train_confirmed.shape,X_test_confirmed.shape)
(189, 1) (34, 1)
# 利用支持向量机、贝叶斯岭和线性回归模型来预测确诊病例
# SVM sklearn.svm
svm_confirmed = SVR(shrinking=True, kernel='poly',gamma=0.01, epsilon=1,degree=3, C=0.1) #定义模型
svm_confirmed.fit(X_train_confirmed, y_train_confirmed) #训练
svm_pred = svm_confirmed.predict(future_forcast) #预测
# 画个图,看一下
# mean_absolute_error sklearn.metrics 平均绝对误差(MAE) https://blog.csdn.net/StupidAutofan/article/details/79556087
svm_test_pred = svm_confirmed.predict(X_test_confirmed)
plt.plot(y_test_confirmed)
plt.plot(svm_test_pred)
plt.legend(['Test Data', 'SVM Predictions'])
print('MAE:', mean_absolute_error(svm_test_pred, y_test_confirmed))
print('MSE:',mean_squared_error(svm_test_pred, y_test_confirmed))
MAE: 4400479.84939107
MSE: 21084654795537.24

# transform our data for polynomial regression
# 将数据转化为多项式回归,这一步是生成不同指数的特征项
# 用到的函数sklearn.preprocessing.PolynomialFeatures,如果有a,b两个特征,那么它的2次多项式为(1,a,b,a^2,ab, b^2)
poly = PolynomialFeatures(degree=4)
poly_X_train_confirmed = poly.fit_transform(X_train_confirmed)## 即生成单个x的不同乘方【0-4】
poly_X_test_confirmed = poly.fit_transform(X_test_confirmed)
poly_future_forcast = poly.fit_transform(future_forcast)
# 贝叶斯
bayesian_poly = PolynomialFeatures(degree=5)
bayesian_poly_X_train_confirmed = bayesian_poly.fit_transform(X_train_confirmed)
bayesian_poly_X_test_confirmed = bayesian_poly.fit_transform(X_test_confirmed)
bayesian_poly_future_forcast = bayesian_poly.fit_transform(future_forcast)
# polynomial regression
# 开始预测
linear_model = LinearRegression(normalize=True, fit_intercept=False)
linear_model.fit(poly_X_train_confirmed, y_train_confirmed)
test_linear_pred = linear_model.predict(poly_X_test_confirmed)
linear_pred = linear_model.predict(poly_future_forcast)
print('MAE:', mean_absolute_error(test_linear_pred, y_test_confirmed))
print('MSE:',mean_squared_error(test_linear_pred, y_test_confirmed))
MAE: 1125828.6734114974
MSE: 2107924255435.714
# 画个图,比之前的单纯SVR要好点
plt.plot(y_test_confirmed)
plt.plot(test_linear_pred)
plt.legend(['Test Data', 'Polynomial Regression Predictions'])
<matplotlib.legend.Legend at 0x234c8336588>
# bayesian ridge polynomial regression
# 贝叶斯岭回归多项式回归计算
tol = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
alpha_1 = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
alpha_2 = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
lambda_1 = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
lambda_2 = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
normalize = [True, False]
bayesian_grid = {'tol': tol, 'alpha_1': alpha_1, 'alpha_2' : alpha_2, 'lambda_1': lambda_1, 'lambda_2' : lambda_2,
'normalize' : normalize}
bayesian = BayesianRidge(fit_intercept=False)
bayesian_search = RandomizedSearchCV(bayesian, bayesian_grid, scoring='neg_mean_squared_error', cv=3, return_train_score=True, n_jobs=-1, n_iter=40, verbose=1)
bayesian_search.fit(bayesian_poly_X_train_confirmed, y_train_confirmed)
Fitting 3 folds for each of 40 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 1.1s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 1.2s finished
RandomizedSearchCV(cv=3, estimator=BayesianRidge(fit_intercept=False),
n_iter=40, n_jobs=-1,
param_distributions={'alpha_1': [1e-07, 1e-06, 1e-05, 0.0001,
0.001],
'alpha_2': [1e-07, 1e-06, 1e-05, 0.0001,
0.001],
'lambda_1': [1e-07, 1e-06, 1e-05,
0.0001, 0.001],
'lambda_2': [1e-07, 1e-06, 1e-05,
0.0001, 0.001],
'normalize': [True, False],
'tol': [1e-06, 1e-05, 0.0001, 0.001,
0.01]},
return_train_score=True, scoring='neg_mean_squared_error',
verbose=1)
# 同样,画个图 not bad
bayesian_confirmed = bayesian_search.best_estimator_
test_bayesian_pred = bayesian_confirmed.predict(bayesian_poly_X_test_confirmed)
bayesian_pred = bayesian_confirmed.predict(bayesian_poly_future_forcast)
print('MAE:', mean_absolute_error(test_bayesian_pred, y_test_confirmed))
print('MSE:',mean_squared_error(test_bayesian_pred, y_test_confirmed))
plt.plot(y_test_confirmed)
plt.plot(test_bayesian_pred)
plt.legend(['Test Data', 'Bayesian Ridge Polynomial Predictions'])
MAE: 607921.3030146289
MSE: 390856736680.26746
<matplotlib.legend.Legend at 0x234c8396fc8>
Graphing the number of confirmed cases, active cases, deaths, recoveries, mortality rate (CFR), and recovery rate
画图关于确诊病例,现存病例,死亡数,治愈数,死亡率,治愈率
# helper method for flattening the data, so it can be displayed on a bar graph
# 定义便于bar图示的flatten数据函数 感觉可以直接用np.adarray.reshape,或者np.flatten
def flatten(arr):
a = []
arr = arr.tolist()
for i in arr:
a.append(i[0])
return a
# 画图,感觉没啥说的,很基础
adjusted_dates = adjusted_dates.reshape(1, -1)[0]#拍扁
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, world_cases)
plt.plot(adjusted_dates, world_confirmed_avg, linestyle='dashed', color='orange')
plt.title('# of Coronavirus Cases Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Worldwide Coronavirus Cases', 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, total_deaths)
plt.plot(adjusted_dates, world_death_avg, linestyle='dashed', color='orange')
plt.title('# of Coronavirus Deaths Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Worldwide Coronavirus Deaths', 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, total_recovered)
plt.plot(adjusted_dates, world_recovery_avg, linestyle='dashed', color='orange')
plt.title('# of Coronavirus Recoveries Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Worldwide Coronavirus Recoveries', 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, total_active)
plt.plot(adjusted_dates, world_active_avg, linestyle='dashed', color='orange')
plt.title('# of Coronavirus Active Cases Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Active Cases', size=30)
plt.legend(['Worldwide Coronavirus Active Cases', 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()



plt.figure(figsize=(16, 10))
plt.bar(adjusted_dates, world_daily_increase)
plt.plot(adjusted_dates, world_daily_increase_avg, color='orange', linestyle='dashed')
plt.title('World Daily Increases in Confirmed Cases', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Moving Average {} Days'.format(window), 'World Daily Increase in COVID-19 Cases'], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.bar(adjusted_dates, world_daily_death)
plt.plot(adjusted_dates, world_daily_death_avg, color='orange', linestyle='dashed')
plt.title('World Daily Increases in Confirmed Deaths', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Moving Average {} Days'.format(window), 'World Daily Increase in COVID-19 Deaths'], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.bar(adjusted_dates, world_daily_recovery)
plt.plot(adjusted_dates, world_daily_recovery_avg, color='orange', linestyle='dashed')
plt.title('World Daily Increases in Confirmed Recoveries', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Moving Average {} Days'.format(window), 'World Daily Increase in COVID-19 Recoveries'], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, np.log10(world_cases))
plt.title('Log of # of Coronavirus Cases Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, np.log10(total_deaths))
plt.title('Log of # of Coronavirus Deaths Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, np.log10(total_recovered))
plt.title('Log of # of Coronavirus Recoveries Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


# 根据国家画图
def country_plot(x, y1, y2, y3, y4, country):
# window is set as 14 in in the beginning of the notebook
confirmed_avg = moving_average(y1, window)
confirmed_increase_avg = moving_average(y2, window)
death_increase_avg = moving_average(y3, window)
recovery_increase_avg = moving_average(y4, window)
plt.figure(figsize=(16, 10))
plt.plot(x, y1)
plt.plot(x, confirmed_avg, color='red', linestyle='dashed')
plt.legend(['{} Confirmed Cases'.format(country), 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.title('{} Confirmed Cases'.format(country), size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.bar(x, y2)
plt.plot(x, confirmed_increase_avg, color='red', linestyle='dashed')
plt.legend(['Moving Average {} Days'.format(window), '{} Daily Increase in Confirmed Cases'.format(country)], prop={'size': 20})
plt.title('{} Daily Increases in Confirmed Cases'.format(country), size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.bar(x, y3)
plt.plot(x, death_increase_avg, color='red', linestyle='dashed')
plt.legend(['Moving Average {} Days'.format(window), '{} Daily Increase in Confirmed Deaths'.format(country)], prop={'size': 20})
plt.title('{} Daily Increases in Deaths'.format(country), size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 10))
plt.bar(x, y4)
plt.plot(x, recovery_increase_avg, color='red', linestyle='dashed')
plt.legend(['Moving Average {} Days'.format(window), '{} Daily Increase in Confirmed Recoveries'.format(country)], prop={'size': 20})
plt.title('{} Daily Increases in Recoveries'.format(country), size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
# helper function for getting country's cases, deaths, and recoveries
def get_country_info(country_name):
country_cases = []
country_deaths = []
country_recoveries = []
for i in dates:
country_cases.append(confirmed_df[confirmed_df['Country/Region']==country_name][i].sum())
country_deaths.append(deaths_df[deaths_df['Country/Region']==country_name][i].sum())
country_recoveries.append(recoveries_df[recoveries_df['Country/Region']==country_name][i].sum())
return (country_cases, country_deaths, country_recoveries)
def country_visualizations(country_name):
country_info = get_country_info(country_name)
country_cases = country_info[0]
country_deaths = country_info[1]
country_recoveries = country_info[2]
country_daily_increase = daily_increase(country_cases)
country_daily_death = daily_increase(country_deaths)
country_daily_recovery = daily_increase(country_recoveries)
country_plot(adjusted_dates, country_cases, country_daily_increase, country_daily_death, country_daily_recovery, country_name)
# 每个国家的country_cases, country_deaths, country_recoveries,本来太多了,这里只显示中美
# countries = ['US', 'Russia', 'India', 'Brazil', 'South Africa', 'China', 'Italy',
# 'Germany', 'Spain', 'France', 'United Kingdom', 'Peru', 'Mexico', 'Colombia', 'Saudi Arabia', 'Iran', 'Bangladesh',
# 'Pakistan', 'Turkey', 'Philippines', 'Iraq', 'Indonesia', 'Israel', 'Ukraine', 'Ecuador', 'Bolivia', 'Netherlands']
countries = ['US','China']
for country in countries:
country_visualizations(country)







# Country Comparison
# removed redundant code
# 下面这几个国家比较
compare_countries = ['US', 'Brazil', 'India', 'Russia', 'South Africa']
graph_name = ['Coronavirus Confirmed Cases', 'Coronavirus Confirmed Deaths', 'Coronavirus Confirmed Recoveries']
for num in range(3):
plt.figure(figsize=(16, 10))
for country in compare_countries:
plt.plot(get_country_info(country)[num])
plt.legend(compare_countries, prop={'size': 20})
plt.xlabel('Days since 3/1', size=30)
plt.ylabel('# of Cases', size=30)
plt.title(graph_name[num], size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


## Predictions for confirmed coronavirus cases worldwide
## 预测画图
def plot_predictions(x, y, pred, algo_name, color):
plt.figure(figsize=(16, 10))
plt.plot(x, y)
plt.plot(future_forcast, pred, linestyle='dashed', color=color)
plt.title('Worldwide Coronavirus Cases Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Confirmed Cases', algo_name], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plot_predictions(adjusted_dates, world_cases, svm_pred, 'SVM Predictions', 'purple')
plot_predictions(adjusted_dates, world_cases, linear_pred, 'Polynomial Regression Predictions', 'orange')
plot_predictions(adjusted_dates, world_cases, bayesian_pred, 'Bayesian Ridge Regression Predictions', 'green')

# Future predictions using SVM
# SVM对未来数据的预测
svm_df = pd.DataFrame({'Date': future_forcast_dates[-10:], 'SVM Predicted # of Confirmed Cases Worldwide': np.round(svm_pred[-10:])})
svm_df.style.background_gradient(cmap='Reds')
Date | SVM Predicted # of Confirmed Cases Worldwide | |
---|---|---|
0 | 10/21/2020 | 47874251.000000 |
1 | 10/22/2020 | 48390574.000000 |
2 | 10/23/2020 | 48910680.000000 |
3 | 10/24/2020 | 49434582.000000 |
4 | 10/25/2020 | 49962294.000000 |
5 | 10/26/2020 | 50493831.000000 |
6 | 10/27/2020 | 51029205.000000 |
7 | 10/28/2020 | 51568431.000000 |
8 | 10/29/2020 | 52111522.000000 |
9 | 10/30/2020 | 52658492.000000 |
# Future predictions using polynomial regression
linear_pred = linear_pred.reshape(1,-1)[0]
linear_df = pd.DataFrame({'Date': future_forcast_dates[-10:], 'Polynomial Predicted # of Confirmed Cases Worldwide': np.round(linear_pred[-10:])})
linear_df.style.background_gradient(cmap='Reds')
Date | Polynomial Predicted # of Confirmed Cases Worldwide | |
---|---|---|
0 | 10/21/2020 | 37750457.000000 |
1 | 10/22/2020 | 37917642.000000 |
2 | 10/23/2020 | 38080320.000000 |
3 | 10/24/2020 | 38238382.000000 |
4 | 10/25/2020 | 38391715.000000 |
5 | 10/26/2020 | 38540207.000000 |
6 | 10/27/2020 | 38683745.000000 |
7 | 10/28/2020 | 38822215.000000 |
8 | 10/29/2020 | 38955502.000000 |
9 | 10/30/2020 | 39083490.000000 |
# Future predictions using Bayesian Ridge
bayesian_df = pd.DataFrame({'Date': future_forcast_dates[-10:], 'Bayesian Ridge Predicted # of Confirmed Cases Worldwide': np.round(bayesian_pred[-10:])})
bayesian_df.style.background_gradient(cmap='Reds')
Date | Bayesian Ridge Predicted # of Confirmed Cases Worldwide | |
---|---|---|
0 | 10/21/2020 | 41655203.000000 |
1 | 10/22/2020 | 42000051.000000 |
2 | 10/23/2020 | 42345665.000000 |
3 | 10/24/2020 | 42692026.000000 |
4 | 10/25/2020 | 43039116.000000 |
5 | 10/26/2020 | 43386918.000000 |
6 | 10/27/2020 | 43735412.000000 |
7 | 10/28/2020 | 44084580.000000 |
8 | 10/29/2020 | 44434403.000000 |
9 | 10/30/2020 | 44784863.000000 |
# 死亡率的预测
mean_mortality_rate = np.mean(mortality_rate)
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, mortality_rate, color='orange')
plt.axhline(y = mean_mortality_rate,linestyle='--', color='black')
plt.title('Worldwide Mortality Rate of Coronavirus Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('Case Mortality Rate', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
# 治愈率的预测
mean_recovery_rate = np.mean(recovery_rate)
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, recovery_rate, color='blue')
plt.title('Worldwide Recovery Rate of Coronavirus Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('Case Recovery Rate', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
# 两个对比
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, total_deaths, color='r')
plt.plot(adjusted_dates, total_recovered, color='green')
plt.legend(['death', 'recoveries'], loc='best', fontsize=25)
plt.title('Worldwide Coronavirus Cases', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
# 两个对比
plt.figure(figsize=(16, 10))
plt.plot(total_recovered, total_deaths)
plt.title('# of Coronavirus Deaths vs. # of Coronavirus Recoveries', size=30)
plt.xlabel('# of Coronavirus Recoveries', size=30)
plt.ylabel('# of Coronavirus Deaths', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
# 获取国家地区关于确诊病例的信息
unique_countries = list(latest_data['Country_Region'].unique())
country_confirmed_cases = []
country_death_cases = []
country_active_cases = []
country_recovery_cases = []
country_incidence_rate = []
country_mortality_rate = []
no_cases = []
for i in unique_countries:
cases = latest_data[latest_data['Country_Region']==i]['Confirmed'].sum()
if cases > 0:
country_confirmed_cases.append(cases)
else:
no_cases.append(i)
for i in no_cases:
unique_countries.remove(i)
# sort countries by the number of confirmed cases
unique_countries = [k for k, v in sorted(zip(unique_countries, country_confirmed_cases), key=operator.itemgetter(1), reverse=True)]
for i in range(len(unique_countries)):
country_confirmed_cases[i] = latest_data[latest_data['Country_Region']==unique_countries[i]]['Confirmed'].sum()
country_death_cases.append(latest_data[latest_data['Country_Region']==unique_countries[i]]['Deaths'].sum())
country_recovery_cases.append(latest_data[latest_data['Country_Region']==unique_countries[i]]['Recovered'].sum())
country_active_cases.append(latest_data[latest_data['Country_Region']==unique_countries[i]]['Active'].sum())
country_incidence_rate.append(latest_data[latest_data['Country_Region']==unique_countries[i]]['Incidence_Rate'].sum())
country_mortality_rate.append(country_death_cases[i]/country_confirmed_cases[i])
country_df = pd.DataFrame({'Country Name': unique_countries, 'Number of Confirmed Cases': country_confirmed_cases,
'Number of Deaths': country_death_cases, 'Number of Recoveries' : country_recovery_cases,
'Number of Active Cases' : country_active_cases, 'Incidence Rate' : country_incidence_rate,
'Mortality Rate': country_mortality_rate})
# number of cases per country/region
country_df.style.background_gradient(cmap='Oranges')
Country Name | Number of Confirmed Cases | Number of Deaths | Number of Recoveries | Number of Active Cases | Incidence Rate | Mortality Rate | |
---|---|---|---|---|---|---|---|
0 | US | 8212981 | 220119 | 3272603 | 4720260.000000 | 7590091.746066 | 0.026801 |
1 | India | 7597063 | 115197 | 6733328 | 748538.000000 | 26859.671287 | 0.015163 |
2 | Brazil | 5250727 | 154176 | 4526393 | 570158.000000 | 90720.527859 | 0.029363 |
3 | Russia | 1406667 | 24205 | 1070920 | 311542.000000 | 83205.304230 | 0.017207 |
4 | Argentina | 1002662 | 26716 | 803965 | 171981.000000 | 2218.486032 | 0.026645 |
5 | Spain | 974449 | 33992 | 150376 | 790081.000000 | 38939.187511 | 0.034883 |
6 | Colombia | 965883 | 29102 | 867961 | 68820.000000 | 55190.061088 | 0.030130 |
7 | France | 952600 | 33647 | 109611 | 809342.000000 | 13307.247668 | 0.035321 |
8 | Peru | 868675 | 33759 | 784056 | 50860.000000 | 65217.342575 | 0.038863 |
9 | Mexico | 854926 | 86338 | 727759 | 40829.000000 | 22262.251352 | 0.100989 |
10 | United Kingdom | 744122 | 43816 | 2613 | 697693.000000 | 10559.859426 | 0.058883 |
unique_provinces = list(latest_data['Province_State'].unique())
province_confirmed_cases = []
province_country = []
province_death_cases = []
# province_recovery_cases = []
province_active = []
province_incidence_rate = []
province_mortality_rate = []
no_cases = []
for i in unique_provinces:
cases = latest_data[latest_data['Province_State']==i]['Confirmed'].sum()
if cases > 0:
province_confirmed_cases.append(cases)
else:
no_cases.append(i)
# remove areas with no confirmed cases
for i in no_cases:
unique_provinces.remove(i)
unique_provinces = [k for k, v in sorted(zip(unique_provinces, province_confirmed_cases), key=operator.itemgetter(1), reverse=True)]
for i in range(len(unique_provinces)):
province_confirmed_cases[i] = latest_data[latest_data['Province_State']==unique_provinces[i]]['Confirmed'].sum()
province_country.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Country_Region'].unique()[0])
province_death_cases.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Deaths'].sum())
# province_recovery_cases.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Recovered'].sum())
province_active.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Active'].sum())
province_incidence_rate.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Incidence_Rate'].sum())
province_mortality_rate.append(province_death_cases[i]/province_confirmed_cases[i])
#前100数据
province_limit = 100
province_df = pd.DataFrame({'Province/State Name': unique_provinces[:province_limit], 'Country': province_country[:province_limit], 'Number of Confirmed Cases': province_confirmed_cases[:province_limit],
'Number of Deaths': province_death_cases[:province_limit],'Number of Active Cases' : province_active[:province_limit],
'Incidence Rate' : province_incidence_rate[:province_limit], 'Mortality Rate': province_mortality_rate[:province_limit]})
# number of cases per country/region
province_df.style.background_gradient(cmap='Oranges')
Province/State Name | Country | Number of Confirmed Cases | Number of Deaths | Number of Active Cases | Incidence Rate | Mortality Rate | |
---|---|---|---|---|---|---|---|
0 | Maharashtra | India | 1601365 | 42240 | 174246.000000 | 1300.397989 | 0.026377 |
1 | Sao Paulo | Brazil | 1064039 | 38035 | 117805.000000 | 2317.206090 | 0.035746 |
2 | California | US | 879645 | 16982 | 862663.000000 | 101246.390300 | 0.019306 |
3 | Texas | US | 856948 | 17481 | 839467.000000 | 635233.906196 | 0.020399 |
4 | Andhra Pradesh | India | 786050 | 6453 | 35065.000000 | 1458.256997 | 0.008209 |
5 | Karnataka | India | 770604 | 10542 | 106233.000000 | 1140.576323 | 0.013680 |
6 | Florida | US | 756727 | 16021 | 740706.000000 | 265466.585181 | 0.021171 |
7 | Tamil Nadu | India | 690936 | 10691 | 38093.000000 | 887.621729 | 0.015473 |
8 | England | United Kingdom | 629211 | 38783 | 590428.000000 | 1124.048720 | 0.061638 |
9 | New York | US | 485279 | 33366 | 451913.000000 | 74530.743741 | 0.068756 |
10 | Uttar Pradesh | India | 456865 | 6685 | 31495.000000 | 192.054719 | 0.014632 |
# return the data table with province/state info for a given country
def country_table(country_name):
states = list(latest_data[latest_data['Country_Region']==country_name]['Province_State'].unique())
state_confirmed_cases = []
state_death_cases = []
# state_recovery_cases = []
state_active = []
state_incidence_rate = []
state_mortality_rate = []
no_cases = []
for i in states:
cases = latest_data[latest_data['Province_State']==i]['Confirmed'].sum()
if cases > 0:
state_confirmed_cases.append(cases)
else:
no_cases.append(i)
# remove areas with no confirmed cases
for i in no_cases:
states.remove(i)
states = [k for k, v in sorted(zip(states, state_confirmed_cases), key=operator.itemgetter(1), reverse=True)]
for i in range(len(states)):
state_confirmed_cases[i] = latest_data[latest_data['Province_State']==states[i]]['Confirmed'].sum()
state_death_cases.append(latest_data[latest_data['Province_State']==states[i]]['Deaths'].sum())
# state_recovery_cases.append(latest_data[latest_data['Province_State']==states[i]]['Recovered'].sum())
state_active.append(latest_data[latest_data['Province_State']==states[i]]['Active'].sum())
state_incidence_rate.append(latest_data[latest_data['Province_State']==states[i]]['Incidence_Rate'].sum())
state_mortality_rate.append(state_death_cases[i]/state_confirmed_cases[i])
state_df = pd.DataFrame({'State Name': states, 'Number of Confirmed Cases': state_confirmed_cases,
'Number of Deaths': state_death_cases, 'Number of Active Cases' : state_active,
'Incidence Rate' : state_incidence_rate, 'Mortality Rate': state_mortality_rate})
# number of cases per country/region
return state_df
#美国的数据
us_table = country_table('US')
us_table.style.background_gradient(cmap='Oranges')
State Name | Number of Confirmed Cases | Number of Deaths | Number of Active Cases | Incidence Rate | Mortality Rate | |
---|---|---|---|---|---|---|
0 | California | 879645 | 16982 | 862663.000000 | 101246.390300 | 0.019306 |
1 | Texas | 856948 | 17481 | 839467.000000 | 635233.906196 | 0.020399 |
2 | Florida | 756727 | 16021 | 740706.000000 | 265466.585181 | 0.021171 |
3 | New York | 485279 | 33366 | 451913.000000 | 74530.743741 | 0.068756 |
4 | Illinois | 350744 | 9496 | 341248.000000 | 222737.038481 | 0.027074 |
5 | Georgia | 341310 | 7657 | 333653.000000 | 545676.734025 | 0.022434 |
6 | North Carolina | 247172 | 3939 | 243233.000000 | 241343.144415 | 0.015936 |
7 | Tennessee | 232061 | 2922 | 229139.000000 | 349619.782941 | 0.012592 |
8 | Arizona | 231897 | 5830 | 226068.000000 | 49228.321751 | 0.025140 |
9 | New Jersey | 221205 | 16214 | 204991.000000 | 46343.379793 | 0.073299 |
10 | Pennsylvania | 188381 | 8475 | 179906.000000 | 66883.526616 | 0.044989 |
#中国的数据
china_table = country_table('China')
china_table.style.background_gradient(cmap='Oranges')
State Name | Number of Confirmed Cases | Number of Deaths | Number of Active Cases | Incidence Rate | Mortality Rate | |
---|---|---|---|---|---|---|
0 | Hubei | 68139 | 4512 | 0.000000 | 115.158019 | 0.066218 |
1 | Hong Kong | 5256 | 105 | 169.000000 | 70.108155 | 0.019977 |
2 | Guangdong | 1889 | 8 | 40.000000 | 1.664904 | 0.004235 |
3 | Zhejiang | 1283 | 1 | 3.000000 | 2.236360 | 0.000779 |
4 | Henan | 1281 | 22 | 3.000000 | 1.333680 | 0.017174 |
5 | Shanghai | 1095 | 7 | 75.000000 | 4.517327 | 0.006393 |
6 | Hunan | 1019 | 4 | 0.000000 | 1.477026 | 0.003925 |
7 | Anhui | 991 | 6 | 0.000000 | 1.567046 | 0.006054 |
8 | Heilongjiang | 948 | 13 | 0.000000 | 2.512589 | 0.013713 |
9 | Beijing | 938 | 9 | 2.000000 | 4.354689 | 0.009595 |
10 | Jiangxi | 935 | 1 | 0.000000 | 2.011618 | 0.001070 |
total_world_cases = np.sum(country_confirmed_cases)
us_confirmed = latest_data[latest_data['Country_Region']=='US']['Confirmed'].sum()
outside_us_confirmed = total_world_cases - us_confirmed
plt.figure(figsize=(16, 9))
plt.barh('United States', us_confirmed)
plt.barh('Outside United States', outside_us_confirmed)
plt.title('# of Total Coronavirus Confirmed Cases', size=20)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
plt.figure(figsize=(16, 9))
plt.barh('United States', us_confirmed/total_world_cases)
plt.barh('Outside United States', outside_us_confirmed/total_world_cases)
plt.title('# of Coronavirus Confirmed Cases Expressed in Percentage', size=20)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

# Only show 15 countries with the most confirmed cases, the rest are grouped into the other category
visual_unique_countries = []
visual_confirmed_cases = []
others = np.sum(country_confirmed_cases[10:])
for i in range(len(country_confirmed_cases[:10])):
visual_unique_countries.append(unique_countries[i])
visual_confirmed_cases.append(country_confirmed_cases[i])
visual_unique_countries.append('Others')
visual_confirmed_cases.append(others)
def plot_bar_graphs(x, y, title):
plt.figure(figsize=(16, 12))
plt.barh(x, y)
plt.title(title, size=20)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
# good for a lot x values
def plot_bar_graphs_tall(x, y, title):
plt.figure(figsize=(19, 18))
plt.barh(x, y)
plt.title(title, size=25)
plt.xticks(size=25)
plt.yticks(size=25)
plt.show()
plot_bar_graphs(visual_unique_countries, visual_confirmed_cases, '# of Covid-19 Confirmed Cases in Countries/Regions')
log_country_confirmed_cases = [math.log10(i) for i in visual_confirmed_cases]
plot_bar_graphs(visual_unique_countries, log_country_confirmed_cases, 'Common Log # of Coronavirus Confirmed Cases in Countries/Regions')
# Only show 10 provinces with the most confirmed cases, the rest are grouped into the other category
visual_unique_provinces = []
visual_confirmed_cases2 = []
others = np.sum(province_confirmed_cases[10:])
for i in range(len(province_confirmed_cases[:10])):
visual_unique_provinces.append(unique_provinces[i])
visual_confirmed_cases2.append(province_confirmed_cases[i])
visual_unique_provinces.append('Others')
visual_confirmed_cases2.append(others)
plot_bar_graphs(visual_unique_provinces, visual_confirmed_cases2, '# of Coronavirus Confirmed Cases in Provinces/States')
log_province_confirmed_cases = [math.log10(i) for i in visual_confirmed_cases2]
plot_bar_graphs(visual_unique_provinces, log_province_confirmed_cases, 'Log of # of Coronavirus Confirmed Cases in Provinces/States')

## 检测率
us_medical_data.fillna(value=0, inplace=True)
def plot_us_medical_data():
states = us_medical_data['Province_State'].unique()
testing_number = []
testing_rate = []
for i in states:
testing_number.append(us_medical_data[us_medical_data['Province_State']==i]['People_Tested'].sum())
testing_rate.append(us_medical_data[us_medical_data['Province_State']==i]['Testing_Rate'].max())
# only show the top 15 states
testing_states = [k for k, v in sorted(zip(states, testing_number), key=operator.itemgetter(1), reverse=True)]
testing_rate_states = [k for k, v in sorted(zip(states, testing_rate), key=operator.itemgetter(1), reverse=True)]
for i in range(len(states)):
testing_number[i] = us_medical_data[us_medical_data['Province_State']==testing_states[i]]['People_Tested'].sum()
testing_rate[i] = us_medical_data[us_medical_data['Province_State']==testing_rate_states[i]]['Testing_Rate'].sum()
top_limit = 30
plot_bar_graphs_tall(testing_states[:top_limit], testing_number[:top_limit], 'Total Testing per State (Top 30)')
plot_bar_graphs_tall(testing_rate_states[:top_limit], testing_rate[:top_limit], 'Testing Rate per 100,000 People (Top 30)')
plot_us_medical_data()

## 移动数据
def get_mobility_by_state(transport_type, state, day):
return apple_mobility[apple_mobility['sub-region']==state][apple_mobility['transportation_type']==transport_type].sum()[day]
get_mobility_by_state('walking', 'Connecticut', '2020-07-30')
revised_dates = []
for i in range(len(dates)):
revised_dates.append(datetime.datetime.strptime(dates[i], '%m/%d/%y').strftime('%Y-%m-%d'))
def weekday_or_weekend(date):
date_obj = datetime.datetime.strptime(date, '%Y-%m-%d')
day_of_the_week = date_obj.weekday()
if (day_of_the_week+1) % 6 == 0 or (day_of_the_week+1) % 7 == 0:
return True
else:
return False
revised_day_since_1_22 = [i for i in range(len(revised_dates))]
import matplotlib.dates as mdates
states = ['New York', 'Connecticut', 'Florida', 'California', 'Texas', 'Georgia', 'Arizona', 'Illinois', 'Louisiana', 'Ohio',
'Tennessee', 'North Carolina', 'South Carolina', 'Alabama', 'Missouri', 'Kansas', 'Pennsylvania', 'Wisconsin', 'Virginia', 'Massachusetts', 'Utah', 'Minnesota',
'Oklahoma', 'Iowa', 'Arkansas', 'Kentucky', 'Puerto Rico', 'Colorado', 'New Jersey', 'Idaho', 'New Jersey', 'Nevada', 'Maryland']
states.sort()
# making sure the dates are in sync
mobility_latest_date = apple_mobility.columns[-1]
mobility_latest_index = revised_dates.index(mobility_latest_date)
for state in states:
# weekend and weekday mobility are separated
weekday_mobility = []
weekday_mobility_dates = []
weekend_mobility = []
weekend_mobility_dates = []
for i in range(len(revised_dates)):
if i <= mobility_latest_index:
if weekday_or_weekend(revised_dates[i]):
weekend_mobility.append(get_mobility_by_state('walking', state, revised_dates[i]))
weekend_mobility_dates.append(i)
else:
weekday_mobility.append(get_mobility_by_state('walking', state, revised_dates[i]))
weekday_mobility_dates.append(i)
else:
pass
# remove null values (they are counted as 0)
for i in range(len(weekend_mobility)):
if weekend_mobility[i] == 0 and i != 0:
weekend_mobility[i] = weekend_mobility[i-1]
elif weekend_mobility[i] == 0 and i == 0:
weekend_mobility[i] = weekend_mobility[i+1]
else:
pass
for i in range(len(weekday_mobility)):
if weekday_mobility[i] == 0 and i != 0:
weekday_mobility[i] = weekday_mobility[i-1]
elif weekday_mobility[i] == 0 and i == 0:
weekday_mobility[i] = weekday_mobility[i+1]
else:
pass
weekday_mobility_average = moving_average(weekday_mobility, 7)
weekend_mobility_average = moving_average(weekend_mobility, 7)
plt.figure(figsize=(16, 10))
plt.bar(weekday_mobility_dates, weekday_mobility, color='cornflowerblue')
plt.plot(weekday_mobility_dates, weekday_mobility_average, color='green')
plt.bar(weekend_mobility_dates, weekend_mobility, color='salmon')
plt.plot(weekend_mobility_dates, weekend_mobility_average, color='black')
plt.legend(['Moving average (7 days) weekday mobility', 'Moving Average (7 days) weekend mobility', 'Weekday mobility', 'Weekend mobility'], prop={'size': 25})
plt.title('{} Walking Mobility Data'.format(state), size=25)
plt.xlabel('Days since 1/22', size=25)
plt.ylabel('Mobility Value', size=25)
plt.xticks(size=25)
plt.yticks(size=25)
plt.show()













1 Comment
Anonymous · 2020年11月10日 at 11:28
的确很长。