导入包:

library(tidyverse)
#> ─ Attaching packages ─────────────────────────────────────────────────── tidyverse 1.2.1 ─
#> ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.6
#> ✔ tidyr   0.8.1     ✔ stringr 1.3.1
#> ✔ readr   1.1.1     ✔ forcats 0.3.0
#> ─ Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ─
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(stringr)
x = c("\"", "\\")

显示字符串原始内容:

writeLines(x)
#> "
#> \

字符串长度

str_length(c("a", "R for data science", NA))
#> [1]  1 18 NA

字符串组合

组合两个或多个:

str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"

控制分隔:

str_c("x", "y", sep = ",")
#> [1] "x,y"

缺失值是可以传染的,我们可以将NA输出为"NA"

x = c("abc", NA)
str_c("|-", x, "-|")
#> [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
#> [1] "|-abc-|" "|-NA-|"

组合函数是向量化的:

str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

将字符向量合并为字符串:

str_c(c("x", "y", "z"), collapse = ", ")
#> [1] "x, y, z"

字符串取子集

x = c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"

负数表示从后到前:

str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"

注意如果字符串过短函数也会返回尽可能多的字符:

str_sub("a", 1, 5)
#> [1] "a"

以赋值的形式修改字符串:

str_sub(x, 1, 1) = str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple"  "banana" "pear"

区域设置

字符串的使用因国家地区不同可能有所不同。

str_to_upper(c("i", "l"))
#> [1] "I" "L"
str_to_upper(c("i", "l"), locale = "tr")
#> [1] "İ" "L"

排序:

x = c("apple", "eggplant", "banana")

str_sort(x, locale = "en")
#> [1] "apple"    "banana"   "eggplant"

str_sort(x, locale = "haw")
#> [1] "apple"    "eggplant" "banana"

使用正则表达式

我们可以通过str_view()str_view_all()函数学习正则表达式。函数接受一个字符向量和一个正则表达式。

基础匹配

精确匹配字符串:

x = c("apple", "banana", "pear")
str_view(x, "an")

另一个复杂的模式是使用.,它可以匹配除换行符外的任意字符:

str_view(x, ".a.")

锚点

  • ^从字符串开头进行匹配
  • $从字符串末尾进行匹配
str_view(x, "^a")
str_view(x, "a$")

字符串类与字符选项

除了.,还有4种常见的字符类:

  • \d匹配任意数字
  • \s匹配任意空白符
  • [abc]匹配a、b或c
  • [^abc]匹配除a、b、c之外的任意字符

因为要对\转义,在R中使用正则需要\\s来匹配空白符,其他也一样。

|可以获取可选模式,比如abc|xyz匹配abcxyz,该操作符的优先级很低。

str_view(c("grey", "gray"), "gr(e|a)y")

重复

该操作用来控制某个模式能够匹配多少次:

  • ?- 0次或一次
  • +- 1次或多次
  • *- 0次或多次
x = "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"

str_view(x, "CC?")

str_view(x ,"CC+")

精确匹配次数:

  • {n}- 匹配n次
  • {n,}- 匹配n次或更多次
  • {,m}- 最多匹配m次
  • {n, m}- 匹配n到m次
str_view(x, "C{2}")

str_view(x, "C{2,}")

str_view(x, "C{2,3}")

默认的匹配方式是贪婪的,正则表达式会匹配尽量长的字符串,在后面添加?可以将匹配方式更改为懒惰的,即匹配尽量短的字符串。

str_view(x, "C{2,3}?")

str_view(x, "C[LX]+?")

分组与回溯引用

括号除了可以消除复杂表达式的歧义,还可以定义分组,我们可以通过回溯引用(如\1,\2等)来引用这些分组。

str_view(fruit, "(..)\\1", match = TRUE)

工具

学习stringr多种函数,可以:

  • 确定与某种模式相匹配的字符串
  • 找出匹配的位置
  • 提取出匹配的内容
  • 使用新值替换匹配内容
  • 基于匹配拆分字符串

匹配检测

要想知道一个字符向量能否匹配一种模式,可以使用str_detect()

x = c("apple", "banana", "pear")

str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

因为在数学意义上F为0,T为1,所以我们可以使用求和和求均值函数等,它们有时候可以发挥巨大用处。

sum(str_detect(words, "^t"))
#> [1] 65
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277

当逻辑条件非常复杂时,相对于创建单个正则表达式,使用逻辑运算符进行调用组合会更容易

例如下面可以找不包含元音字母的所有单词:

no_vowel_1 = !str_detect(words, "[aeiou]")

no_vowel_2 = str_detect(words, "^[^aeiou]+$")

identical(no_vowel_1, no_vowel_2)
#> [1] TRUE

两种方法结果一致,但第一种更容易理解。

str_detect一种常见用法是选取匹配某种模式的元素,然后取子集,也可以使用str_subset()包装函数完全两步操作:

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"

str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

字符串通常是数据框的一列,我们可以联合filter()操作:

df = tibble(
    word = words,
    i = seq_along(words)
)

df %>% 
    filter(str_detect(words, "x$"))
#> # A tibble: 4 x 2
#>   word      i
#>   <chr> <int>
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841

str_detect()函数的变体str_count()返回字符串中匹配的数量:

str_count(x, "a")
#> [1] 1 3 1

str_count()完全可以和mutate()联合使用:

df %>% 
    mutate(
        vowels = str_count(word, "[aeiou]"),
        consonants = str_count(word, "[^aeiou]")
    )
#> # A tibble: 980 x 4
#>    word         i vowels consonants
#>    <chr>    <int>  <int>      <int>
#>  1 a            1      1          0
#>  2 able         2      2          2
#>  3 about        3      3          2
#>  4 absolute     4      4          4
#>  5 accept       5      2          4
#>  6 account      6      3          4
#>  7 achieve      7      4          3
#>  8 across       8      2          4
#>  9 act          9      1          2
#> 10 active      10      3          3
#> # ... with 970 more rows

注意,匹配的模式不会重叠,比如abababaaba只会匹配2次而不是3次:

str_count("abababa", "aba")
#> [1] 2

str_view_all("abababa", "aba")

str_view_all()用于全部匹配。

提取匹配内容

我们可以使用str_extract()函数来提取匹配的实际文本。这里使用维基百科的Harvard sentences作为复杂的示例。

length(sentences)
#> [1] 720

head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

假如现在我们想找出包含一种颜色的所有句子。我们先创建颜色名称向量,然后转换为正则表达式:

colors = c(
    "red", "orange", "yellow", "green", "blue", "purple"
)

color_match = str_c(colors, collapse = "|")

color_match
#> [1] "red|orange|yellow|green|blue|purple"

现在我们选取出包含一种颜色的句子,然后再提取出颜色:

has_color = str_subset(sentences, color_match)

matches = str_extract(has_color, color_match)

head(matches)
#> [1] "blue" "blue" "red"  "red"  "red"  "blue"

注意,str_extract()只提取第一个匹配。我们可以选取多余一种匹配的所有句子,这样我们更容易看到所有的匹配。

more = sentences[str_count(sentences, color_match) > 1]
str_view_all(more, color_match)

str_extract(more, color_match)
#> [1] "blue"   "green"  "orange"

这是stringr函数的一种通用模式,单个匹配可以使用更简单的数据结构,想要得到所有的匹配,使用str_extract_all()函数,它会返回一个列表

str_extract_all(more, color_match)
#> [[1]]
#> [1] "blue" "red" 
#> 
#> [[2]]
#> [1] "green" "red"  
#> 
#> [[3]]
#> [1] "orange" "red"

如果设置了simplify = TRUE,那么结果会是一个矩阵,其中短的匹配会和最长的匹配有一样的长度。

str_extract_all(more, color_match, simplify = TRUE)
#>      [,1]     [,2] 
#> [1,] "blue"   "red"
#> [2,] "green"  "red"
#> [3,] "orange" "red"
x = c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#>      [,1] [,2] [,3]
#> [1,] "a"  ""   ""  
#> [2,] "a"  "b"  ""  
#> [3,] "a"  "b"  "c"

分组匹配

括号在正则表达式中科院阐明优先级,还能对正则表达式进行分组,分组可以在匹配时回溯引用。我们因而可以用括号来提取复杂匹配的各个部分。

举例说明:加入我们想从句子中提取名词,我们可以先进行一种启发式实验,找出a或the后面的所有单词。使用正则表达式定义“单词”概念有点难度,我们使用一种简单的近似——至少有1个非空格字符的字符序列。

noun = "(a|the) ([^ ]+)"

has_noun = sentences %>% 
    str_subset(noun) %>% 
    head(10)

has_noun %>% 
    str_extract(noun)
#>  [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
#>  [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

str_extract()函数给出完整匹配,str_match()函数给出每个独立分组。后面函数返回的不是字符向量而是矩阵:其中一列是完整匹配,后面的列是每个分组的匹配:

has_noun %>% 
    str_match(noun)
#>       [,1]         [,2]  [,3]     
#>  [1,] "the smooth" "the" "smooth" 
#>  [2,] "the sheet"  "the" "sheet"  
#>  [3,] "the depth"  "the" "depth"  
#>  [4,] "a chicken"  "a"   "chicken"
#>  [5,] "the parked" "the" "parked" 
#>  [6,] "the sun"    "the" "sun"    
#>  [7,] "the huge"   "the" "huge"   
#>  [8,] "the ball"   "the" "ball"   
#>  [9,] "the woman"  "the" "woman"  
#> [10,] "a helps"    "a"   "helps"

这种启发式名词检测的效果并不好,它找出了一些形容词,比如smoothparked

如果数据保存在tibble中,使用extract()会更容易,该函数工作方式与str_match()函数类似,只需要为每个分组提供名词以作为结果的新列

tibble(sentences = sentences) %>% 
    tidyr::extract(
        sentences, c("article", "noun"), "(a|the) ([^ ]+)",
        remove = FALSE
    )
#> # A tibble: 720 x 3
#>    sentences                                   article noun   
#>    <chr>                                       <chr>   <chr>  
#>  1 The birch canoe slid on the smooth planks.  the     smooth 
#>  2 Glue the sheet to the dark blue background. the     sheet  
#>  3 It's easy to tell the depth of a well.      the     depth  
#>  4 These days a chicken leg is a rare dish.    a       chicken
#>  5 Rice is often served in round bowls.        <NA>    <NA>   
#>  6 The juice of lemons makes fine punch.       <NA>    <NA>   
#>  7 The box was thrown beside the parked truck. the     parked 
#>  8 The hogs were fed chopped corn and garbage. <NA>    <NA>   
#>  9 Four hours of steady work faced us.         <NA>    <NA>   
#> 10 Large size in stockings is hard to sell.    <NA>    <NA>   
#> # ... with 710 more rows

str_extract()函数一样,如果要找出所有的匹配,需要使用str_match_all()函数。

替换匹配内容

str_replace()str_replace_all()函数可以使用新的字符串替换匹配的内容。最简单的就是使用固定的字符串进行替换:

x = c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

通过一个命令向量我们可以同时进行多个替换

x = c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

除了使用固定字符串,我们还可以使用引用来插入匹配的分组。下面的代码我们交换第二个单词和第三个单词的顺序:

sentences %>% head(5)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."
sentences %>% 
    str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
    head(5)
#> [1] "The canoe birch slid on the smooth planks." 
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."     
#> [4] "These a days chicken leg is a rare dish."   
#> [5] "Rice often is served in round bowls."

拆分

str_split()函数可以将字符串拆分为多个片段。比如把句子拆分为单词:

sentences %>% 
    head(5) %>% 
    str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
#> 
#> [[4]]
#> [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
#> [8] "rare"    "dish."  
#> 
#> [[5]]
#> [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

因为拆分句子产生的单词数目不一样,所以函数结果返回一个列表。如果我们想要返回一个矩阵,可以通过simplify = TRUE进行指定。

sentences %>% 
    head(5) %>% 
    str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]    
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth"
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"  
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"    
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"     
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls."
#>      [,8]          [,9]   
#> [1,] "planks."     ""     
#> [2,] "background." ""     
#> [3,] "a"           "well."
#> [4,] "rare"        "dish."
#> [5,] ""            ""

我们可以设定拆分片段的最大数量:

fields = c("Name:诗翔", "Country:CN", "Age:24")
fields %>% str_split(":", n = 2, simplify = TRUE)
#>      [,1]      [,2]  
#> [1,] "Name"    "诗翔"
#> [2,] "Country" "CN"  
#> [3,] "Age"     "24"

我们还可以通过字母、行、句子和单词边界(boundary()函数)来拆分字符串

x = "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
#> [1] "This"      "is"        "a"         "sentence." "This"      "is"       
#> [7] "another"   "sentence."

str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"      
#> [7] "another"  "sentence"

定位匹配内容

str_locate()str_locate_all()函数可以给出每个匹配的开始位置和结束位置。当没有其他函数能够精确满足需求时该函数非常有用,我们可以使用str_locate()函数找出匹配的模式,然后使用str_sub()来提取和修改匹配的内容

其他类型模式

当一个字符串作为模式时,R内部使用regex()函数进行了包装:

# 正常调用
str_view(fruit, "nana")
# 上面实质上是下面的简写
str_view(fruit, regex("nana"))

因而我们可以通过设定regex()的其他参数来控制匹配方式。

  • ignore_case = TRUE:允许匹配大小写
  • multline = TRUE可以使^$锚定每行的开头和末尾,而不是整个字符串的开头和末尾
  • comment = TRUE,这可以让我们在复杂的正则表达式中加入注释和空白字符,以便理解。如要匹配空格,使用\\

    phone = regex("
                  \\(?          # 可选的小括号开头
                  (\\d{3})      # 地区编号
                  [)- ]?        # 可选的小括号结尾、短划线或空格
                  (\\d{3})      # 另外3个数字
                  [ -]?         # 可选的空格或短划线
                  (\\d{3})      # 另外3个数字
                  ", comment = TRUE)
    str_match("514-791-8141", phone)
    #>      [,1]          [,2]  [,3]  [,4] 
    #> [1,] "514-791-814" "514" "791" "814"
  • dotall = TRUE可以匹配包括\n在内的所有字符

除了regex(),我们还可以使用另外3种函数:

  • fixed()函数 - 可以按照字符串的字节形式进行精确匹配,它会忽略正则表达式中的所有特殊字符,在非常低的层次上进行操作。这样我们可以不用进行转义,并且速度也要快得多。下面是一个简单的测试示例,它的速度差不多是普通正则表达式的3倍。
microbenchmark::microbenchmark(
    fixed = str_detect(sentences, fixed("the")),
    regex = str_detect(sentences, "the"),
    times = 20
)
#> Unit: microseconds
#>   expr   min  lq mean median  uq max neval
#>  fixed  99.7 103  122    106 120 316    20
#>  regex 347.8 369  388    384 397 513    20
  • coll()函数使用标准排序规则来比较字符串,这在进行不区分大小写的匹配时时非常有效的,但速度很慢。注意,我们可以在coll()中设定locale参数,以确定使用哪种规则来比较字符。世界各地使用的规则是不同的。另外,我们可以使用下面代码查看默认区域设置:
stringi::stri_locale_info()
#> $Language
#> [1] "zh"
#> 
#> $Country
#> [1] "CN"
#> 
#> $Variant
#> [1] ""
#> 
#> $Name
#> [1] "zh_CN"
  • boundary()函数可以用来匹配边界,我们可以在其他字符串操作函数中使用它
x = "This is  a sentence"
str_view_all(x, boundary("word"))

str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This"     "is"       "a"        "sentence"

正则表达式其他应用

R基础包中存在2个常用函数,它们可以使用正则表达式:

  • apropos()函数在全局环境中搜索所有可用对象,当不记得函数名时非常有用:
apropos("replace")
#> [1] "%+replace%"       "replace"          "replace_na"      
#> [4] "setReplaceMethod" "str_replace"      "str_replace_all" 
#> [7] "str_replace_na"   "theme_replace"
  • dir()函数列出一个目录下的所有文件,其参数pattern可以是一个正则表达式:
head(dir(pattern = "\\.Rmd$"))
#> [1] "2015-07-23-r-rmarkdown.Rmd"                   
#> [2] "2018-03-21-data-transformation-with-dplyr.Rmd"
#> [3] "2018-03-22-combine-tech-with-art.Rmd"         
#> [4] "2018-03-23-test_poem.Rmd"                     
#> [5] "2018-04-13-Pubmed_trend_for_report.Rmd"       
#> [6] "2018-04-23-our-city-here.Rmd"

stringi

stringr是建立于stringi基础之上的。stringr比较容易学习(书上写非常容易,我个人并不这样认为)——它只提供少惊醒挑选的函数,可以完成常见大部分的字符串操作。而stringi的设计思想是尽量全面,几乎包含了我们可以用到的所有函数,共234个。

当我们从stringr过渡到stringi时会比较容易,相应的函数会经历str_stri_的转变。

线程信息

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.5
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] bindrcpp_0.2.2  forcats_0.3.0   stringr_1.3.1   dplyr_0.7.6    
#>  [5] purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_1.4.2   
#>  [9] ggplot2_3.0.0   tidyverse_1.2.1 pacman_0.4.6   
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_0.2.4     xfun_0.3             haven_1.1.2         
#>  [4] lattice_0.20-35      colorspace_1.3-2     htmltools_0.3.6     
#>  [7] yaml_2.2.0           utf8_1.1.4           rlang_0.2.2         
#> [10] pillar_1.3.0         glue_1.3.0           withr_2.1.2         
#> [13] modelr_0.1.2         readxl_1.1.0         bindr_0.1.1         
#> [16] plyr_1.8.4           munsell_0.5.0        blogdown_0.8        
#> [19] gtable_0.2.0         cellranger_1.1.0     rvest_0.3.2         
#> [22] htmlwidgets_1.2      evaluate_0.11        knitr_1.20          
#> [25] fansi_0.3.0          broom_0.5.0          Rcpp_0.12.18        
#> [28] scales_1.0.0         backports_1.1.2      jsonlite_1.5        
#> [31] microbenchmark_1.4-4 hms_0.4.2            digest_0.6.15       
#> [34] stringi_1.2.4        bookdown_0.7         grid_3.5.1          
#> [37] rprojroot_1.3-2      cli_1.0.0            tools_3.5.1         
#> [40] magrittr_1.5         lazyeval_0.2.1       crayon_1.3.4        
#> [43] pkgconfig_2.0.2      xml2_1.2.0           lubridate_1.7.4     
#> [46] assertthat_0.2.0     rmarkdown_1.10       httr_1.3.1          
#> [49] rstudioapi_0.7       R6_2.2.2             nlme_3.1-137        
#> [52] compiler_3.5.1