tidyverse是一组处理与可视化R包的集合,其中ggplot2dplyr最广为人知。

核心包有以下一些:

  • ggplot2 - 可视化数据
  • dplyr - 数据操作语法,可以用它解决大部分数据处理问题
  • tidyr - 清理数据
  • readr - 读入表格数据
  • purrr - 提供一个完整一致的工具集增强R的函数编程
  • tibble - 新一代数据框
  • stringr - 提供函数集用来处理字符数据
  • forcats - 提供有用工具用来处理因子问题

有几个包没接触过,R包太多了,这些强力包还是有必要接触和学习下使用,碰到问题事半功倍。

安装tidyverse

install.packages("tidyverse")

导入:

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.0.0     √ purrr   0.2.5
## √ tibble  1.4.2     √ dplyr   0.7.6
## √ tidyr   0.8.1     √ stringr 1.3.1
## √ readr   1.1.1     √ forcats 0.3.0
## Warning: 程辑包'tidyr'是用R版本3.5.1 来建造的
## Warning: 程辑包'purrr'是用R版本3.5.1 来建造的
## Warning: 程辑包'dplyr'是用R版本3.5.1 来建造的
## -- Conflicts -------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

有用的函数

# tidyverse与其他包的冲突
tidyverse_conflicts()
# 列出所有tidyverse的依赖包
tidyverse_deps()
#获取tidyverse的logo
tidyverse_logo()
# 列出所有tidyverse包
tidyverse_packages()
# 更新tidyverse包
tidyverse_update()

载入数据

library(datasets)
#install.packages("gapminder")
library(gapminder)
attach(iris)

dplyr

过滤

filter()函数可以用来取数据子集。

iris %>% 
    filter(Species == "virginica") # 指定满足的行
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1           6.3         3.3          6.0         2.5 virginica
## 2           5.8         2.7          5.1         1.9 virginica
## 3           7.1         3.0          5.9         2.1 virginica
## 4           6.3         2.9          5.6         1.8 virginica
## 5           6.5         3.0          5.8         2.2 virginica
## 6           7.6         3.0          6.6         2.1 virginica
## 7           4.9         2.5          4.5         1.7 virginica
## 8           7.3         2.9          6.3         1.8 virginica
## 9           6.7         2.5          5.8         1.8 virginica
## 10          7.2         3.6          6.1         2.5 virginica
## 11          6.5         3.2          5.1         2.0 virginica
## 12          6.4         2.7          5.3         1.9 virginica
## 13          6.8         3.0          5.5         2.1 virginica
## 14          5.7         2.5          5.0         2.0 virginica
## 15          5.8         2.8          5.1         2.4 virginica
## 16          6.4         3.2          5.3         2.3 virginica
## 17          6.5         3.0          5.5         1.8 virginica
## 18          7.7         3.8          6.7         2.2 virginica
## 19          7.7         2.6          6.9         2.3 virginica
## 20          6.0         2.2          5.0         1.5 virginica
## 21          6.9         3.2          5.7         2.3 virginica
## 22          5.6         2.8          4.9         2.0 virginica
## 23          7.7         2.8          6.7         2.0 virginica
## 24          6.3         2.7          4.9         1.8 virginica
## 25          6.7         3.3          5.7         2.1 virginica
## 26          7.2         3.2          6.0         1.8 virginica
## 27          6.2         2.8          4.8         1.8 virginica
## 28          6.1         3.0          4.9         1.8 virginica
## 29          6.4         2.8          5.6         2.1 virginica
## 30          7.2         3.0          5.8         1.6 virginica
## 31          7.4         2.8          6.1         1.9 virginica
## 32          7.9         3.8          6.4         2.0 virginica
## 33          6.4         2.8          5.6         2.2 virginica
## 34          6.3         2.8          5.1         1.5 virginica
## 35          6.1         2.6          5.6         1.4 virginica
## 36          7.7         3.0          6.1         2.3 virginica
## 37          6.3         3.4          5.6         2.4 virginica
## 38          6.4         3.1          5.5         1.8 virginica
## 39          6.0         3.0          4.8         1.8 virginica
## 40          6.9         3.1          5.4         2.1 virginica
## [到达getOption("max.print") -- 略过10行]]
iris %>% 
    filter(Species == "virginica", Sepal.Length > 6) # 多个条件用,分隔
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1           6.3         3.3          6.0         2.5 virginica
## 2           7.1         3.0          5.9         2.1 virginica
## 3           6.3         2.9          5.6         1.8 virginica
## 4           6.5         3.0          5.8         2.2 virginica
## 5           7.6         3.0          6.6         2.1 virginica
## 6           7.3         2.9          6.3         1.8 virginica
## 7           6.7         2.5          5.8         1.8 virginica
## 8           7.2         3.6          6.1         2.5 virginica
## 9           6.5         3.2          5.1         2.0 virginica
## 10          6.4         2.7          5.3         1.9 virginica
## 11          6.8         3.0          5.5         2.1 virginica
## 12          6.4         3.2          5.3         2.3 virginica
## 13          6.5         3.0          5.5         1.8 virginica
## 14          7.7         3.8          6.7         2.2 virginica
## 15          7.7         2.6          6.9         2.3 virginica
## 16          6.9         3.2          5.7         2.3 virginica
## 17          7.7         2.8          6.7         2.0 virginica
## 18          6.3         2.7          4.9         1.8 virginica
## 19          6.7         3.3          5.7         2.1 virginica
## 20          7.2         3.2          6.0         1.8 virginica
## 21          6.2         2.8          4.8         1.8 virginica
## 22          6.1         3.0          4.9         1.8 virginica
## 23          6.4         2.8          5.6         2.1 virginica
## 24          7.2         3.0          5.8         1.6 virginica
## 25          7.4         2.8          6.1         1.9 virginica
## 26          7.9         3.8          6.4         2.0 virginica
## 27          6.4         2.8          5.6         2.2 virginica
## 28          6.3         2.8          5.1         1.5 virginica
## 29          6.1         2.6          5.6         1.4 virginica
## 30          7.7         3.0          6.1         2.3 virginica
## 31          6.3         3.4          5.6         2.4 virginica
## 32          6.4         3.1          5.5         1.8 virginica
## 33          6.9         3.1          5.4         2.1 virginica
## 34          6.7         3.1          5.6         2.4 virginica
## 35          6.9         3.1          5.1         2.3 virginica
## 36          6.8         3.2          5.9         2.3 virginica
## 37          6.7         3.3          5.7         2.5 virginica
## 38          6.7         3.0          5.2         2.3 virginica
## 39          6.3         2.5          5.0         1.9 virginica
## 40          6.5         3.0          5.2         2.0 virginica
## [到达getOption("max.print") -- 略过1行]]

排序

arrange()函数用来对观察值排序,默认是升序。

iris %>% 
    arrange(Sepal.Length)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            4.3         3.0          1.1         0.1     setosa
## 2            4.4         2.9          1.4         0.2     setosa
## 3            4.4         3.0          1.3         0.2     setosa
## 4            4.4         3.2          1.3         0.2     setosa
## 5            4.5         2.3          1.3         0.3     setosa
## 6            4.6         3.1          1.5         0.2     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            4.6         3.6          1.0         0.2     setosa
## 9            4.6         3.2          1.4         0.2     setosa
## 10           4.7         3.2          1.3         0.2     setosa
## 11           4.7         3.2          1.6         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.8         3.4          1.9         0.2     setosa
## 15           4.8         3.1          1.6         0.2     setosa
## 16           4.8         3.0          1.4         0.3     setosa
## 17           4.9         3.0          1.4         0.2     setosa
## 18           4.9         3.1          1.5         0.1     setosa
## 19           4.9         3.1          1.5         0.2     setosa
## 20           4.9         3.6          1.4         0.1     setosa
## 21           4.9         2.4          3.3         1.0 versicolor
## 22           4.9         2.5          4.5         1.7  virginica
## 23           5.0         3.6          1.4         0.2     setosa
## 24           5.0         3.4          1.5         0.2     setosa
## 25           5.0         3.0          1.6         0.2     setosa
## 26           5.0         3.4          1.6         0.4     setosa
## 27           5.0         3.2          1.2         0.2     setosa
## 28           5.0         3.5          1.3         0.3     setosa
## 29           5.0         3.5          1.6         0.6     setosa
## 30           5.0         3.3          1.4         0.2     setosa
## 31           5.0         2.0          3.5         1.0 versicolor
## 32           5.0         2.3          3.3         1.0 versicolor
## 33           5.1         3.5          1.4         0.2     setosa
## 34           5.1         3.5          1.4         0.3     setosa
## 35           5.1         3.8          1.5         0.3     setosa
## 36           5.1         3.7          1.5         0.4     setosa
## 37           5.1         3.3          1.7         0.5     setosa
## 38           5.1         3.4          1.5         0.2     setosa
## 39           5.1         3.8          1.9         0.4     setosa
## 40           5.1         3.8          1.6         0.2     setosa
## [到达getOption("max.print") -- 略过110行]]

iris %>% 
    arrange(desc(Sepal.Length)) # 降序
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            7.9         3.8          6.4         2.0  virginica
## 2            7.7         3.8          6.7         2.2  virginica
## 3            7.7         2.6          6.9         2.3  virginica
## 4            7.7         2.8          6.7         2.0  virginica
## 5            7.7         3.0          6.1         2.3  virginica
## 6            7.6         3.0          6.6         2.1  virginica
## 7            7.4         2.8          6.1         1.9  virginica
## 8            7.3         2.9          6.3         1.8  virginica
## 9            7.2         3.6          6.1         2.5  virginica
## 10           7.2         3.2          6.0         1.8  virginica
## 11           7.2         3.0          5.8         1.6  virginica
## 12           7.1         3.0          5.9         2.1  virginica
## 13           7.0         3.2          4.7         1.4 versicolor
## 14           6.9         3.1          4.9         1.5 versicolor
## 15           6.9         3.2          5.7         2.3  virginica
## 16           6.9         3.1          5.4         2.1  virginica
## 17           6.9         3.1          5.1         2.3  virginica
## 18           6.8         2.8          4.8         1.4 versicolor
## 19           6.8         3.0          5.5         2.1  virginica
## 20           6.8         3.2          5.9         2.3  virginica
## 21           6.7         3.1          4.4         1.4 versicolor
## 22           6.7         3.0          5.0         1.7 versicolor
## 23           6.7         3.1          4.7         1.5 versicolor
## 24           6.7         2.5          5.8         1.8  virginica
## 25           6.7         3.3          5.7         2.1  virginica
## 26           6.7         3.1          5.6         2.4  virginica
## 27           6.7         3.3          5.7         2.5  virginica
## 28           6.7         3.0          5.2         2.3  virginica
## 29           6.6         2.9          4.6         1.3 versicolor
## 30           6.6         3.0          4.4         1.4 versicolor
## 31           6.5         2.8          4.6         1.5 versicolor
## 32           6.5         3.0          5.8         2.2  virginica
## 33           6.5         3.2          5.1         2.0  virginica
## 34           6.5         3.0          5.5         1.8  virginica
## 35           6.5         3.0          5.2         2.0  virginica
## 36           6.4         3.2          4.5         1.5 versicolor
## 37           6.4         2.9          4.3         1.3 versicolor
## 38           6.4         2.7          5.3         1.9  virginica
## 39           6.4         3.2          5.3         2.3  virginica
## 40           6.4         2.8          5.6         2.1  virginica
## [到达getOption("max.print") -- 略过110行]]

新增变量

mutate()可以更新或者新增数据框一列。

iris %>% 
    mutate(Sepal.Length = Sepal.Length * 10) # 将该列数值变成以mm为单位
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1             51         3.5          1.4         0.2     setosa
## 2             49         3.0          1.4         0.2     setosa
## 3             47         3.2          1.3         0.2     setosa
## 4             46         3.1          1.5         0.2     setosa
## 5             50         3.6          1.4         0.2     setosa
## 6             54         3.9          1.7         0.4     setosa
## 7             46         3.4          1.4         0.3     setosa
## 8             50         3.4          1.5         0.2     setosa
## 9             44         2.9          1.4         0.2     setosa
## 10            49         3.1          1.5         0.1     setosa
## 11            54         3.7          1.5         0.2     setosa
## 12            48         3.4          1.6         0.2     setosa
## 13            48         3.0          1.4         0.1     setosa
## 14            43         3.0          1.1         0.1     setosa
## 15            58         4.0          1.2         0.2     setosa
## 16            57         4.4          1.5         0.4     setosa
## 17            54         3.9          1.3         0.4     setosa
## 18            51         3.5          1.4         0.3     setosa
## 19            57         3.8          1.7         0.3     setosa
## 20            51         3.8          1.5         0.3     setosa
## 21            54         3.4          1.7         0.2     setosa
## 22            51         3.7          1.5         0.4     setosa
## 23            46         3.6          1.0         0.2     setosa
## 24            51         3.3          1.7         0.5     setosa
## 25            48         3.4          1.9         0.2     setosa
## 26            50         3.0          1.6         0.2     setosa
## 27            50         3.4          1.6         0.4     setosa
## 28            52         3.5          1.5         0.2     setosa
## 29            52         3.4          1.4         0.2     setosa
## 30            47         3.2          1.6         0.2     setosa
## 31            48         3.1          1.6         0.2     setosa
## 32            54         3.4          1.5         0.4     setosa
## 33            52         4.1          1.5         0.1     setosa
## 34            55         4.2          1.4         0.2     setosa
## 35            49         3.1          1.5         0.2     setosa
## 36            50         3.2          1.2         0.2     setosa
## 37            55         3.5          1.3         0.2     setosa
## 38            49         3.6          1.4         0.1     setosa
## 39            44         3.0          1.3         0.2     setosa
## 40            51         3.4          1.5         0.2     setosa
## [到达getOption("max.print") -- 略过110行]]
iris %>% 
    mutate(SLMn = Sepal.Length * 10) # 创建新的一列
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species SLMn
## 1            5.1         3.5          1.4         0.2     setosa   51
## 2            4.9         3.0          1.4         0.2     setosa   49
## 3            4.7         3.2          1.3         0.2     setosa   47
## 4            4.6         3.1          1.5         0.2     setosa   46
## 5            5.0         3.6          1.4         0.2     setosa   50
## 6            5.4         3.9          1.7         0.4     setosa   54
## 7            4.6         3.4          1.4         0.3     setosa   46
## 8            5.0         3.4          1.5         0.2     setosa   50
## 9            4.4         2.9          1.4         0.2     setosa   44
## 10           4.9         3.1          1.5         0.1     setosa   49
## 11           5.4         3.7          1.5         0.2     setosa   54
## 12           4.8         3.4          1.6         0.2     setosa   48
## 13           4.8         3.0          1.4         0.1     setosa   48
## 14           4.3         3.0          1.1         0.1     setosa   43
## 15           5.8         4.0          1.2         0.2     setosa   58
## 16           5.7         4.4          1.5         0.4     setosa   57
## 17           5.4         3.9          1.3         0.4     setosa   54
## 18           5.1         3.5          1.4         0.3     setosa   51
## 19           5.7         3.8          1.7         0.3     setosa   57
## 20           5.1         3.8          1.5         0.3     setosa   51
## 21           5.4         3.4          1.7         0.2     setosa   54
## 22           5.1         3.7          1.5         0.4     setosa   51
## 23           4.6         3.6          1.0         0.2     setosa   46
## 24           5.1         3.3          1.7         0.5     setosa   51
## 25           4.8         3.4          1.9         0.2     setosa   48
## 26           5.0         3.0          1.6         0.2     setosa   50
## 27           5.0         3.4          1.6         0.4     setosa   50
## 28           5.2         3.5          1.5         0.2     setosa   52
## 29           5.2         3.4          1.4         0.2     setosa   52
## 30           4.7         3.2          1.6         0.2     setosa   47
## 31           4.8         3.1          1.6         0.2     setosa   48
## 32           5.4         3.4          1.5         0.4     setosa   54
## 33           5.2         4.1          1.5         0.1     setosa   52
## [到达getOption("max.print") -- 略过117行]]

整合函数流:

iris %>% 
    filter(Species == "Virginica") %>% 
    mutate(SLMm = Sepal.Length) %>% 
    arrange(desc(SLMm))
## [1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species     
## [6] SLMm        
## <0 行> (或0-长度的row.names)

汇总

summarize()函数可以让我们将很多变量汇总为单个的数据点。

iris %>% 
    summarize(medianSL = median(Sepal.Length))
##   medianSL
## 1      5.8
iris %>% 
    filter(Species == "virginica") %>% 
    summarize(medianSL=median(Sepal.Length))
##   medianSL
## 1      6.5

还可以一次性汇总多个变量

iris %>% 
    filter(Species == "virginica") %>% 
    summarize(medianSL = median(Sepal.Length),
              maxSL = max(Sepal.Length))
##   medianSL maxSL
## 1      6.5   7.9

group_by()可以让我们安装指定的组别进行汇总数据,而不是针对整个数据框

iris %>% 
    group_by(Species) %>% 
    summarize(medianSL = median(Sepal.Length),
              maxSL = max(Sepal.Length))
## # A tibble: 3 x 3
##   Species    medianSL maxSL
##   <fct>         <dbl> <dbl>
## 1 setosa          5     5.8
## 2 versicolor      5.9   7  
## 3 virginica       6.5   7.9

iris %>% 
    filter(Sepal.Length>6) %>% 
    group_by(Species) %>% 
    summarize(medianPL = median(Petal.Length), 
              maxPL = max(Petal.Length))
## # A tibble: 2 x 3
##   Species    medianPL maxPL
##   <fct>         <dbl> <dbl>
## 1 versicolor      4.6   5  
## 2 virginica       5.6   6.9

ggplot2

散点图

散点图可以帮助我们理解两个变量的数据关系,使用geom_point()可以绘制散点图:

iris_small <- iris %>% 
    filter(Sepal.Length > 5)

ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width)) + 
    geom_point()

额外的美学映射

  • 颜色
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width,
                       color = Species)) + 
    geom_point()

  • 大小
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width,
                       color = Species,
                       size = Sepal.Length)) + 
    geom_point()

  • 分面
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width)) + 
    geom_point() + 
    facet_wrap(~Species)

线图

by_year <- gapminder %>% 
    group_by(year) %>% 
    summarize(medianGdpPerCap = median(gdpPercap))

ggplot(by_year, aes(x = year,
                    y = medianGdpPerCap)) +
    geom_line() + 
    expand_limits(y=0)

条形图

by_species <- iris %>%  
    filter(Sepal.Length > 6) %>% 
    group_by(Species) %>% 
    summarize(medianPL=median(Petal.Length))

ggplot(by_species, aes(x = Species, y=medianPL)) + 
    geom_col()

直方图

ggplot(iris_small, aes(x = Petal.Length)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

箱线图

ggplot(iris_small, aes(x=Species, y=Sepal.Length)) + 
    geom_boxplot()

资料来源:DataCamp