我有一个数据集如下:
id date customer_id 1 02/03/2018 undefined 1 04/23/2018 12 1 05/22/2018 12 1 06/25/2018 undefined 2 01/14/2017 undefined 2 02/23/2018 ……
你可以尝试:
library(dplyr) df %>% group_by(id) %>% filter(all(diff(row_number()[customer_id == 'undefined']) == 1) & customer_id[n()] != 'undefined')
输出:
# A tibble: 4 x 3 # Groups: id [1] id date customer_id <int> <fct> <fct> 1 2 01/14/2017 undefined 2 2 02/23/2018 undefined 3 2 03/04/2018 23 4 2 04/04/2018 23
此代码假定您的数据框已经安排好了。除此以外:
df %>% arrange(date = as.Date(date, "%m/%d/%Y")) %>% group_by(id) %>% filter(all(diff(row_number()[customer_id == 'undefined']) == 1) & customer_id[n()] != 'undefined')
基本上,我们所做的是检查每个组是否有行号之间的差异 undefined 情况总是1(即它们是顺序的),以及最后一个值是否是顺序的 undefined 。
undefined
这些记录保存( id 在你的情况下2)。
id
你可以通过检查是否运行长度编码来实现这一点 customer_id column为2或1,表示id未定义或未定义多次:
customer_id
data <- read.table(text="id date customer_id 1 02/03/2018 undefined 1 04/23/2018 12 1 05/22/2018 12 1 06/25/2018 undefined 2 01/14/2017 undefined 2 02/23/2018 undefined 2 03/04/2018 23 2 04/04/2018 23", header = T, stringsAsFactors=F) data$date <- as.Date(data$date, "%m/%d/%Y") data$customer_id <- as.integer(data$customer_id) data %>% dplyr::group_by(id) %>% dplyr::arrange(date, .by_group=T) %>% dplyr::filter(length(rle(is.na(customer_id))$values < 3) && !is.na(tail(customer_id, 1))) # A tibble: 4 x 3 # Groups: id [1] id date customer_id <int> <date> <int> 1 2 2017-01-14 NA 2 2 2018-02-23 NA 3 2 2018-03-04 23 4 2 2018-04-04 23
请注意,您还必须确保每个组的最后一项有效,否则将从有效ID转到的组 undefined 将通过测试。