编辑:好吧,我想我明白了!
数据:
df <- data.frame( "Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA"), "Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU"), "Product cat." = c("1", "11","111", "12","2", "21", "211", "212", "22", "3"), "Val" = c(200, 170, 170, 30, 100, 50, 25, 5, 40, 220), stringsAsFactors = FALSE)
我们首先创建两个辅助变量 Prodcat1 和 Prodcat2 :
Prodcat1
Prodcat2
# create new variable Prodcat1 df1 <- df %>% group_by(Reporter, Partner) %>% mutate(Prodcat1 = str_extract(Product.cat., "^.{1}")) # create new variable Prodcat2 for my 2nd level product category df1 <- df1 %>% group_by(Reporter, Partner) %>% mutate(Prodcat2 = str_extract(Product.cat., "^.{2}"))
现在我们将数据拆分为两部分,一部分要完成,一部分不需要在第三级进行任何操作:
# to be completed df2 <- df1 %>% group_by(Reporter, Partner, Prodcat2) %>% filter(sum(Val[2:n()]) < Val[1]) # no operation on third level df3 <- df1 %>% group_by(Reporter, Partner, Prodcat2) %>% filter(!sum(Val[2:n()]) < Val[1] | n() == 1)
我们分手了 df2 通过 Prodcat2 ,控制 Reporter 和 Partner
df2
Reporter
Partner
split1 <- split(df2, interaction(df2$Reporter, df2$Partner, df2$Prodcat2)) split1 <- split1[sapply(split1, nrow) != 0]
并在必要时添加新行:
split2 <- lapply(split1, function(x){ y <- rbind.data.frame(x, x[1,]) y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat2"], "k") y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"]) return(y) })
然后我们第一次将数据重新组合在一起 的 并排序 强> 由原件 Product.cat. 。
Product.cat.
split3 <- do.call(rbind, split2) newdf <- do.call(rbind, list(split3, df3)) newdf <- newdf %>% arrange(Product.cat.)
迄今为止的数据:
# A tibble: 11 x 6 # Groups: Reporter, Partner, Prodcat2 [5] Reporter Partner Product.cat. Val Prodcat1 Prodcat2 <chr> <chr> <chr> <dbl> <chr> <chr> 1 USA EU 1 200 1 NA 2 USA EU 11 170 1 11 3 USA EU 111 170 1 11 4 USA EU 12 30 1 12 5 USA EU 2 100 2 NA 6 USA EU 21 50 2 21 7 USA EU 211 25 2 21 8 USA EU 212 5 2 21 9 USA EU 21k 20 2 21 10 USA EU 22 40 2 22 11 USA EU 3 220 3 NA
现在我们进入第二级。首先,我们创建三个部分:
# part to complete df4 <- newdf %>% group_by(Reporter, Partner, Prodcat1) %>% filter(nchar(Product.cat.) < 3) %>% filter(n() == 1 | sum(Val[2:n()]) < Val[1]) # third level rows, which are not necessary here df5 <- newdf %>% group_by(Reporter, Partner, Prodcat1) %>% filter(nchar(Product.cat.) == 3) # second level part already complete df6 <- newdf %>% group_by(Reporter, Partner, Prodcat1) %>% filter(nchar(Product.cat.) < 3) %>% filter(sum(Val[2:n()]) == Val[1])
我们现在再次拆分数据 Prodcat1 ,控制 Reporter 和 Partner :
split3 <- split(df4, interaction(df4$Reporter, df4$Partner, df4$Prodcat1)) split3 <- split3[sapply(split3, nrow) != 0]
我们创建新行:
split4 <- lapply(split3, function(x){ if(nrow(x) == 1){ y <- rbind.data.frame(x, x) y[2, "Product.cat."] <- paste0(y[2, "Prodcat1"], "m") }else{ y <- rbind.data.frame(x, x[1,]) y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat1"], "k") y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"]) } return(y) })
然后我们将它们全部重新组合在一起,再次排序并删除辅助变量。
split5 <- do.call(rbind, split4) finaldf <- do.call(rbind, list(split5, df5, df6)) finaldf <- finaldf %>% ungroup() %>% arrange(Product.cat.) %>% select(-c("Prodcat1", "Prodcat2"))
最终数据:
# A tibble: 13 x 4 Reporter Partner Product.cat. Val <chr> <chr> <chr> <dbl> 1 USA EU 1 200 2 USA EU 11 170 3 USA EU 111 170 4 USA EU 12 30 5 USA EU 2 100 6 USA EU 21 50 7 USA EU 211 25 8 USA EU 212 5 9 USA EU 21k 20 10 USA EU 22 40 11 USA EU 2k 10 12 USA EU 3 220 13 USA EU 3m 220
最后,我们清除了我们所需的所有临时对象的环境
rm(df1, df2, df3, df4, df5, df6, newdf, split1, split2, split3, split4, split5)
这给我们留下了原始数据集 df 以及最终完成的数据集 finaldata :)
df
finaldata