最近工作上接触到一些Covid-19新冠病毒一些信息,平时也常常看到国内外各个地区确诊患者不断的增加,但是还有一些我想了解的数据并没有在一些主流媒体上呈现出来;因此想通过自己的方式来可视化这些数据,依次解决以下问题:
如果对于自己爬虫比较熟练的人,可以考虑爬取国内腾讯或者丁香园上的数据(我未尝试。。。);还可以用github上整理并每日更新的Covid-19数据,网址:https://github.com/CSSEGISandData/COVID-19,在此Github上提供了一个Covid-19数据可视化网站https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6,值得学习下~
我选择了直接读取github上的数据(省事并且效率),其大致分了三大类数据:
- 确诊数据:
time_series_covid19_confirmed_US.csv
和time_series_covid19_confirmed_global.csv
,两者区别在于前者是详细的美国各个州区的数据,后者的全球的数据(也包括美国的总数据),以下类似 - 死亡数据:
time_series_covid19_deaths_US.csv
和time_series_covid19_deaths_global.csv
- 治愈数据:
time_series_covid19_recovered_global.csv
接着通过data.table::fread
直接读取github上的数据文件,由于考虑到美国确诊数据地区过于详细,需要先对于处理再和global数据合并,我的方式如下(将每个Provinc_State下的所有数据进行加和):
library(dplyr)
library(plotly)
library(forcats)
library(tidyr)
#-------------------------------------------------------------------
# Covid19 data url
url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
url_comfirmed_global <- paste0(url, "time_series_covid19_confirmed_global.csv")
url_comfirmed_US <- paste0(url, "time_series_covid19_confirmed_US.csv")
url_death_global <- paste0(url, "time_series_covid19_deaths_global.csv")
url_recovered_global <- paste0(url, "time_series_covid19_recovered_global.csv")
#-------------------------------------------------------------------
# Input data
data_comfirmed_global <- data.table::fread(url_comfirmed_global) %>%
rename(Province_State = `Province/State`, Country_Region = `Country/Region`)
data_comfirmed_US_all <- data.table::fread(url_comfirmed_US) %>%
rename(Long = Long_) %>%
select(one_of(names(data_comfirmed_global)))
data_comfirmed_US_part1 <- data_comfirmed_US_all %>%
group_by(Country_Region, Province_State) %>%
summarise_at((5 - 2):(ncol(.) - 2), sum)
data_comfirmed_US_part2 <- data_comfirmed_US_all %>%
group_by(Country_Region, Province_State) %>%
summarise_at(1:2, median)
data_comfirmed_US <- left_join(data_comfirmed_US_part2, data_comfirmed_US_part1)
data_comfirmed <- bind_rows(filter(data_comfirmed_global, Country_Region != "US"), data_comfirmed_US) %>%
mutate(Total = .[[ncol(.)]])
data_death_global <- data.table::fread(url_death_global) %>%
rename(Province_State = `Province/State`, Country_Region = `Country/Region`)
data_recovered_global <- data.table::fread(url_recovered_global) %>%
rename(Province_State = `Province/State`, Country_Region = `Country/Region`)
以下示例以确诊数据为例
Plotly map visualization
Map可视化在R中有不少的包可以实现,简单罗列下有如下几个:
- plotly
- leaflet
- mapdeck
- mapview
- mapdeck
- ggmap
- tmap
基于我需要有一定的交互并且能与shiny结合的功能,我选择了plotly,其关于map的示例可看:https://plotly.com/r/maps/
以scattermapbox为例,此可视化方式需要Mapbox的账户,并且加载其token,参考:https://plotly.com/r/scattermapbox/
#-------------------------------------------------------------------
# Mapbox Setup
Sys.setenv("MAPBOX_TOKEN" = "your token")
#-------------------------------------------------------------------
data_comfirmed$Size <- ifelse(data_comfirmed$Total > 200000, 90,
ifelse(data_comfirmed$Total > 100000, 60,
ifelse(data_comfirmed$Total > 50000, 45,
ifelse(data_comfirmed$Total > 15000, 30,
ifelse(data_comfirmed$Total > 8000, 24,
ifelse(data_comfirmed$Total > 4000, 16,
ifelse(data_comfirmed$Total > 2000, 8,
ifelse(data_comfirmed$Total > 1000, 4,
ifelse(data_comfirmed$Total > 500, 2, 1)
)
)
)
)
)
)
)
)
data_comfirmed$Size_title <- ifelse(data_comfirmed$Size == 90, "> 200000",
ifelse(data_comfirmed$Size == 60, "> 100000-200000",
ifelse(data_comfirmed$Size == 45, "> 50000-100000",
ifelse(data_comfirmed$Size == 30, "> 15000-50000",
ifelse(data_comfirmed$Size == 24, "> 8000-15000",
ifelse(data_comfirmed$Size == 16, "> 4000-8000",
ifelse(data_comfirmed$Size == 8, "> 2000-4000",
ifelse(data_comfirmed$Size == 4, "> 1000-2000",
ifelse(data_comfirmed$Size == 2, "> 500-1000", "0-500")
)
)
)
)
)
)
)
)
data_comfirmed$Size_title <- fct_relevel(data_comfirmed$Size_title, c("0-500", "> 500-1000", "> 1000-2000", "> 2000-4000", "> 4000-8000", "> 8000-15000", "> 15000-50000", "> 50000-100000", "> 100000-200000", "> 200000"))
fig <- data_comfirmed %>%
plot_mapbox(
lat = ~Lat, lon = ~Long,
split = ~Size_title,
color = I("red"),
size = ~Size,
text = ~ paste(Total, Country_Region, Province_State, sep = ", "),
mode = "scattermapbox", hoverinfo = "text"
) %>%
layout(
title = list(text = "COVID19 by Country", x = 0),
font = list(color = "white"),
plot_bgcolor = "#191A1A", paper_bgcolor = "#191A1A",
mapbox = list(style = "dark"), # carto-positron, dark
# showlegend = F,
legend = list(
orientation = "h",
font = list(size = 8)
),
margin = list(
l = 25, r = 25,
b = 25, t = 25,
pad = 2
)
) %>%
config(mapboxAccessToken = Sys.getenv("MAPBOX_TOKEN"))
fig
通过Mapbox Choropleth Maps可以展示每个国家的新增确诊人数(类似于heatmap),参考:https://plotly.com/r/mapbox-county-choropleth/;另还需要对确诊数据做些转化,然后增加国家code以及新增人数变量;增加国家code可以用到countrycode
包
data_comfirmed_newcase <- data.table::fread(url_comfirmed_global) %>%
rename(Province_State = `Province/State`, Country_Region = `Country/Region`) %>%
group_by(Country_Region) %>%
summarise_at((5 - 1):(ncol(.) - 1), sum) %>%
mutate(
Code = countrycode(Country_Region, origin = "country.name", destination = "iso3c"),
Total = .[[ncol(.)]],
Newcase = .[[ncol(.)]] - .[[ncol(.) - 1]]
)
对于Mapbox Choropleth Maps而言,我们还需要一个全球的geojson文件,对于此文件的获取,我直接从kaggle一个比赛的数据中下载并调用了,下载地址: https://github.com/datasets/geo-countries
#----------------------------------------------------------------
cty <- rjson::fromJSON(file = "../Desktop/world-countries/world-countries.json")
fig <- plot_ly() %>%
add_trace(
type = "choroplethmapbox",
geojson = cty,
locations = data_comfirmed_newcase$Code,
z = data_comfirmed_newcase$Newcase,
colorscale = "Portland",
zmin = 0,
zmax = 10000,
marker = list(
line = list(
width = 0
),
opacity = 0.5
)
) %>%
layout(
title = list(text = "COVID19 by Country", x = 0),
font = list(color = "white"),
plot_bgcolor = "#191A1A", paper_bgcolor = "#191A1A",
mapbox = list(
style = "dark",
zoom = 1,
center = list(lon = -30, lat = 30)
),
legend = list(
x = 0.3, y = -2,
orientation = "h",
font = list(size = 8)
),
margin = list(
l = 25, r = 25,
b = 25, t = 25,
pad = 2
)
) %>%
config(mapboxAccessToken = Sys.getenv("MAPBOX_TOKEN"))
fig
如果不用mapbox的话,也可以展示一个比较简洁的map图(Choropleth maps),参考:https://plotly.com/r/choropleth-maps/,此时不一定需要geojson文件,只需要经纬度坐标和国家code即可
fig <- plot_geo(data_comfirmed_newcase) %>%
add_trace(
z = ~Newcase, color = ~Newcase, colors = "Blues",
text = ~Country_Region, locations = ~Code, marker = list(line = list(color = toRGB("grey"), width = 0.5))
) %>%
layout(
title = 'COVID19 by Country',
geo = list(
showframe = FALSE,
showcoastlines = FALSE,
projection = list(type = "Mercator")
)
)
fig
如果想看下城市之间的确诊人数密度分布,可以参考:https://plotly.com/r/mapbox-density-heatmaps/,主要用到城市对应的经纬度参数以及城市名称
fig <- data_comfirmed %>%
plot_ly(
type = 'densitymapbox',
lat = ~Lat,
lon = ~Long,
coloraxis = 'coloraxis',
radius = 12
) %>%
layout(
mapbox = list(
style="stamen-terrain",
center= list(lon=180)), coloraxis = list(colorscale = "Viridis")
)
fig
Data visualization-animation
比如我想看下随着时间的变化,各国确诊人数以及新增确诊人数的变化趋势,以气泡图展示;对于此问题,可以通过动态气泡图来实现,刚好plotly也可以实现
首先需要数据进行转化:
- 每日确诊总人数转化为每日新增人数,用
diff()
函数即可 - 挑选出确诊人数Top10的国家
- 将Year变量从行变成列
- 合并确诊人数和新增确诊人数数据
数据转化如下:
#---------------------------------------------------------------------
data_allcountry_tf <- data.table::fread(url_comfirmed_global) %>%
rename(Province_State = `Province/State`, Country_Region = `Country/Region`) %>%
group_by(Country_Region) %>%
summarise_at((5 - 1):(ncol(.) - 1), sum)
top_country <- data_allcountry_tf %>%
mutate(Total = .[[ncol(.)]]) %>%
arrange(desc(Total)) %>%
head(10) %>% .$Country_Region
data_allcountry_comfirmed <- data_allcountry_tf %>% gather("year", "comfirm", colnames(.)[2:ncol(.)])
data_allcountry_newcase <- plyr::ddply(data_allcountry_tf, "Country_Region", function(x) {
diff(c(0, as.numeric(x[2:ncol(x)])))
})
names(data_allcountry_newcase)[2:ncol(data_allcountry_newcase)] <- colnames(data_allcountry_tf)[2:ncol(data_allcountry_tf)]
data_allcountry_newcase <- data_allcountry_newcase %>% gather("year", "newcase", colnames(.)[2:ncol(.)])
data_allcountry <- left_join(data_allcountry_comfirmed, data_allcountry_newcase) %>%
filter(Country_Region %in% top_country)
data_allcountry$year <- factor(data_allcountry$year, unique(data_allcountry$year))
data_allcountry$Size <- ifelse(data_allcountry$comfirm > 200000, 180,
ifelse(data_allcountry$comfirm > 100000, 120,
ifelse(data_allcountry$comfirm > 50000, 90,
ifelse(data_allcountry$comfirm > 15000, 60,
ifelse(data_allcountry$comfirm > 8000, 45,
ifelse(data_allcountry$comfirm > 4000, 35,
ifelse(data_allcountry$comfirm > 2000, 16,
ifelse(data_allcountry$comfirm > 1000, 8,
ifelse(data_allcountry$comfirm > 500, 4, 2)
)
)
)
)
)
)
)
)
然后通过plotly
的animation_opts()
,animation_button()
以及animation_slider()
函数绘制动画气泡图
pty <- data_allcountry %>%
plot_ly(x = ~comfirm, y = ~newcase) %>%
layout(showlegend = F) %>%
add_text(
x = ~comfirm,
y = ~newcase,
text = ~Country_Region,
frame = ~year,
size = I(18),
textposition = "top right"
) %>%
add_markers(size = ~ I(comfirm / 1000), sizes = c(10000, 1000), color = ~Country_Region, frame = ~year, ids = ~Country_Region) %>%
animation_opts(400, easing = "linear", redraw = FALSE) %>%
animation_button(
x = 1, xanchor = "right", y = 0, yanchor = "bottom"
) %>%
animation_slider(
currentvalue = list(prefix = "YEAR", font = list(color = "red"))
)
pty
看起来确实略微有点简陋,但是Plotly的R库中还有其他可视化方式,比如我想用饼图来展示各个大洲下各国的死亡人数,如下:
library(plotly)
library(dplyr)
library(countrycode)
# Country to continent code
cty2ctt <- data.table::fread("https://pkgstore.datahub.io/JohnSnowLabs/country-and-continent-codes-list/country-and-continent-codes-list-csv_csv/data/b7876b7f496677669644f3d1069d3121/country-and-continent-codes-list-csv_csv.csv")
# Input world death data
death_df <- data.table::fread(url_death_global) %>%
rename(Province_State = `Province/State`, Country_Region = `Country/Region`) %>%
group_by(Country_Region) %>%
summarise_at((5 - 1):(ncol(.) - 1), sum) %>%
mutate(Total = .[[ncol(.)]],
Code = countrycode(Country_Region, origin = "country.name", destination = "iso2c")) %>%
left_join(cty2ctt, by = c("Code" = "Two_Letter_Country_Code")) %>%
filter(!is.na(Continent_Name)) %>%
dplyr::distinct(Country_Region, .keep_all = TRUE) %>%
select(one_of(c("Country_Region", "Continent_Name", "Total")))
death_top <- arrange(death_df, desc(Total)) %>% head(40)
# Combined data for pie graph visual
temp <- dplyr::group_by(death_top, Continent_Name) %>%
dplyr::summarise(Total = sum(as.numeric(Total)))
temp <- data.frame(labels = temp$Continent_Name, parents = "World", values = temp$Total, stringsAsFactors = F)
world <- data.frame(labels = "World", parents = "", values = sum(temp$values), stringsAsFactors = F)
names(death_top) <- c("labels", "parents", "values")
death_res <- rbind(world, temp, death_top)
death_res <- as.data.frame(sapply(death_res, function(x){as.factor(x)} ))
fig <- plot_ly(
labels = death_res$labels,
parents = death_res$parents,
values = death_res$values,
type = 'sunburst',
branchvalues = 'total'
)
fig
其他还有很多可视化的方法,比如展现各国确诊人数的时序变化曲线图,各国峰值变化图等等;以及不同的数据来源,如死亡数据/治愈数据等等,均可参考Plotly的R库:https://plotly.com/r/
本文出自于http://www.bioinfo-scrounger.com转载请注明出处