PROSAGA码农传奇-Sahara云大数据-r并排显示grep模式和结果

<div class =“post-text”itemprop =“text”>
  <P>
    我建议采用一种基于两种观察的方法：
  </p>
  <OL>
    <LI>
      国家/地区名称包括诸如的条目
       <code>
        "Hong Kong, Special Administrative Region of China"
      </code>
      ，
       <code>
        "Korea, Republic of"
      </code>
      ，
       <code>
        "Venezuela?(Bolivarian Republic of)"
      </code>
       当用作模式时很少会产生匹配
       <code>
        grep
      </code>
      。需要事先清理列表或者需要部分匹配。
    </LI>
    <LI>
      OP评论说：“
      <EM>
        在这个阶段，我需要梳理所有可能错位的国家数据的地理数据（这些记录中缺少国家数据）。这是第一次通过，我可能需要手动整理误报。
      </EM>
      “
    </LI>
  </醇>
  <P>
    基本思想是将国家名称和目标文本分成单词，找到匹配的单词（即国家名称与目标文本的部分匹配），最后追加
    <EM>
      所有
    </EM>
     与目标文本的相应行部分匹配的国家/地区名称。
  </p>
  <H2>
    准备国名
  </H2>
  <P>
     <code>
      data.table
    </code>
     版本1.9.8用于性能原因（请运行
     <code>
      packageVersion("data.table")
    </code>
     验证您是否已从CRAN安装了最新版本。
  </p>
   <pre>
    <code>
      library(data.table)
setDT(countries)

# split country names into words
countries <- countries[, stringr::str_split(name, boundary("word")), by = name]

print(countries)
#               name          V1
#  1:    Afghanistan Afghanistan
#  2:  Aland Islands       Aland
#  3:  Aland Islands     Islands
#  4:        Albania     Albania
#  5:        Algeria     Algeria
# ---                           
#406: Western Sahara     Western
#407: Western Sahara      Sahara
#408:          Yemen       Yemen
#409:         Zambia      Zambia
#410:       Zimbabwe    Zimbabwe

# show top ten entries after split 
countries[, .N, by = V1][order(-N)][1:10]
#          V1  N
# 1:  Islands 17
# 2:       of 13
# 3:      and 12
# 4: Republic 12
# 5:    Saint  7
# 6:   United  5
# 7:   Island  4
# 8:    China  4
# 9:   Guinea  4
#10:   French  4

# remove some unspecific entries
countries <- countries[!V1 %chin% c("and", "of")]

</code>
  </pre>
  <H2>
    准备目标文本
  </H2>
   <pre>
    <code>
      # keep original row names as unique identifier for each record, verify uniqueness
setDT(df, keep.rownames = TRUE)
stopifnot(!df[, anyDuplicated(rn)])

# reshape from wide to long form to have all target text in one column
mdf <- melt(df, id.vars = "rn")

# split target text into words
mdf <- mdf[, stringr::str_split(value, boundary("word")), 
           by = .(rn, variable, value)]

</code>
  </pre>
  <H2>
    查找匹配的单词=国家/地区与目标文本的部分匹配
  </H2>
   <pre>
    <code>
      # right join of mdf with countries on words, all rows of mdf are kept 
mw <- countries[mdf, on = .(V1)]

</code>
  </pre>
  <H2>
    准备结果
  </H2>
  <P>
    对于每个唯一的行ID，具有部分匹配的国家/地区名称将折叠在一个字符串中，该字符串可以附加到原始的各行中
     <code>
      df
    </code>
    。
     <code>
      unique()
    </code>
     确保即使在多个部分匹配的情况下，国家/地区名称也只出现一次（例如，
     <code>
      "South Africa"
    </code>
     会有多个匹配的单词
     <code>
      "South"
    </code>
     和
     <code>
      "Africa"
    </code>
    ）。
  </p>
  <P>
    我决定使用斜线
     <code>
      "/"
    </code>
     作为分隔折叠条目的字符，因为在逗号时国家/地区名称中似乎没有使用斜杠
     <code>
      ","
    </code>
     已经是。
  </p>
   <pre>
    <code>
      # collapse country names which have a partial match
ccn <- mw[, .(pmcn = paste(sort(unique(name)), collapse = "/")), by = rn]

# append collapsed list of partially matching country names to original df by join on row id
result <- ccn[df, on = "rn"]
# re-order columns to have new column as last column
setcolorder(result, c(names(df), "pmcn"))

# show statistics on matches
result[, .N, by = pmcn]
#                                                                    pmcn  N
#1:                                                                       44
#2:                                                                Norway  1
#3:                                                                France  2
#4:                                                          South Africa  1
#5:                                  France/Holy See?(Vatican City State)  1
#6: South Africa/South Georgia and the South Sandwich Islands/South Sudan  1

</code>
  </pre>
  <H2>
    讨论样本结果
  </H2>
   <pre>
    <code>
      # show one row where partially matching country names where found
result[pmcn != ""][1]
#       rn org_order organizations.api_path    type                             uuid
#1: 110439    168297     organizations/saev Address b6332226dec612644868f7491a901f4a
#            name          street_1              street_2 postal_code
#1: Aberdeen (UK) Abercrombie Court Arnhall Business Park    AB32 6FE
#                            city city_web_path region region_code2 region_web_path
#1: Aberdeen, UK and Oslo, Norway                                                  
#   country country_code2 country_code3 country_web_path latitude longitude created_at
#1:                                                                         1366007686
#   updated_at   pmcn
#1: 1424818704 Norway

</code>
  </pre>
  <P>
    对于这一行，找到了匹配
     <code>
      Norway
    </code>
     来自
     <code>
      city
    </code>
     列但不是
     <code>
      UK
    </code>
     因为包含国家名称列表
     <code>
      United Kingdom
    </code>
    。
  </p>
   <pre>
    <code>
      result[pmcn != ""][5]
#       rn org_order     organizations.api_path    type
#1: 145391    214549 organizations/adn-contents Address
#                               uuid         name                        street_1
#1: 76e1d4c1a1c56c5fe2b1c21c656a3a69 Headquarters 2 passage of Mulhouse City Hall
#   street_2 postal_code   city city_web_path region region_code2 region_web_path
#1:                68100 France                                                  
#   country country_code2 country_code3 country_web_path latitude longitude created_at
#1:                                                                         1422885253
#   updated_at                                 pmcn
#1: 1422902286 France/Holy See?(Vatican City State)

</code>
  </pre>
  <P>
    对于这一行，找到了匹配
     <code>
      France
    </code>
     来自
     <code>
      city
    </code>
     专栏和误报
     <code>
      Vatican City State
    </code>
     由于这个词
     <code>
      City
    </code>
     在专栏中
     <code>
      street_1
    </code>
    。
  </p>
</DIV>