我建议采用一种基于两种观察的方法:
"Hong Kong, Special Administrative Region of China"
"Korea, Republic of"
"Venezuela?(Bolivarian Republic of)"
grep
基本思想是将国家名称和目标文本分成单词,找到匹配的单词(即国家名称与目标文本的部分匹配),最后追加 所有 与目标文本的相应行部分匹配的国家/地区名称。
data.table 版本1.9.8用于性能原因(请运行 packageVersion("data.table") 验证您是否已从CRAN安装了最新版本。
data.table
packageVersion("data.table")
library(data.table) setDT(countries) # split country names into words countries <- countries[, stringr::str_split(name, boundary("word")), by = name] print(countries) # name V1 # 1: Afghanistan Afghanistan # 2: Aland Islands Aland # 3: Aland Islands Islands # 4: Albania Albania # 5: Algeria Algeria # --- #406: Western Sahara Western #407: Western Sahara Sahara #408: Yemen Yemen #409: Zambia Zambia #410: Zimbabwe Zimbabwe # show top ten entries after split countries[, .N, by = V1][order(-N)][1:10] # V1 N # 1: Islands 17 # 2: of 13 # 3: and 12 # 4: Republic 12 # 5: Saint 7 # 6: United 5 # 7: Island 4 # 8: China 4 # 9: Guinea 4 #10: French 4 # remove some unspecific entries countries <- countries[!V1 %chin% c("and", "of")]
# keep original row names as unique identifier for each record, verify uniqueness setDT(df, keep.rownames = TRUE) stopifnot(!df[, anyDuplicated(rn)]) # reshape from wide to long form to have all target text in one column mdf <- melt(df, id.vars = "rn") # split target text into words mdf <- mdf[, stringr::str_split(value, boundary("word")), by = .(rn, variable, value)]
# right join of mdf with countries on words, all rows of mdf are kept mw <- countries[mdf, on = .(V1)]
对于每个唯一的行ID,具有部分匹配的国家/地区名称将折叠在一个字符串中,该字符串可以附加到原始的各行中 df 。 unique() 确保即使在多个部分匹配的情况下,国家/地区名称也只出现一次(例如, "South Africa" 会有多个匹配的单词 "South" 和 "Africa" )。
df
unique()
"South Africa"
"South"
"Africa"
我决定使用斜线 "/" 作为分隔折叠条目的字符,因为在逗号时国家/地区名称中似乎没有使用斜杠 "," 已经是。
"/"
","
# collapse country names which have a partial match ccn <- mw[, .(pmcn = paste(sort(unique(name)), collapse = "/")), by = rn] # append collapsed list of partially matching country names to original df by join on row id result <- ccn[df, on = "rn"] # re-order columns to have new column as last column setcolorder(result, c(names(df), "pmcn")) # show statistics on matches result[, .N, by = pmcn] # pmcn N #1: 44 #2: Norway 1 #3: France 2 #4: South Africa 1 #5: France/Holy See?(Vatican City State) 1 #6: South Africa/South Georgia and the South Sandwich Islands/South Sudan 1
# show one row where partially matching country names where found result[pmcn != ""][1] # rn org_order organizations.api_path type uuid #1: 110439 168297 organizations/saev Address b6332226dec612644868f7491a901f4a # name street_1 street_2 postal_code #1: Aberdeen (UK) Abercrombie Court Arnhall Business Park AB32 6FE # city city_web_path region region_code2 region_web_path #1: Aberdeen, UK and Oslo, Norway # country country_code2 country_code3 country_web_path latitude longitude created_at #1: 1366007686 # updated_at pmcn #1: 1424818704 Norway
对于这一行,找到了匹配 Norway 来自 city 列但不是 UK 因为包含国家名称列表 United Kingdom 。
Norway
city
UK
United Kingdom
result[pmcn != ""][5] # rn org_order organizations.api_path type #1: 145391 214549 organizations/adn-contents Address # uuid name street_1 #1: 76e1d4c1a1c56c5fe2b1c21c656a3a69 Headquarters 2 passage of Mulhouse City Hall # street_2 postal_code city city_web_path region region_code2 region_web_path #1: 68100 France # country country_code2 country_code3 country_web_path latitude longitude created_at #1: 1422885253 # updated_at pmcn #1: 1422902286 France/Holy See?(Vatican City State)
对于这一行,找到了匹配 France 来自 city 专栏和误报 Vatican City State 由于这个词 City 在专栏中 street_1 。
France
Vatican City State
City
street_1