如何将文档拆分为训练集和测试集？

作者: 圈圈红
发布时间: 2025-04-08 11:29:06 (2月前)
转自：

4 条回复

0#
回复此人
妖邪 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 您可以使用sklearn提供的train_test_split方法。请参阅此处的文档 </p> <P> <a href="http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" rel="nofollow noreferrer"> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html </A> </p> </DIV>

编辑
1#
回复此人
v-star*위위 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 只需使用即可创建文件名列表 <code> os.listdir() </code> 。使用 <code> collections.shuffle() </code> 洗牌，然后 <code> training_files = filenames[:700] </code> 和 <code> testing_files = filenames[700:] </code> </p> </DIV>

编辑
2#
回复此人
荧惑 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 如果你使用numpy，这很简单，首先加载文档并使它们成为一个numpy数组，然后： </p> <pre> <code> import numpy as np docs = np.array([ 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', ]) idx = np.hstack((np.ones(7), np.zeros(3))) # generate indices np.random.shuffle(idx) # shuffle to make training data and test data random train = docs[idx == 1] test = docs[idx == 0] print(train) print(test) </code> </pre> <P> 结果： </p> <pre> <code> ['one' 'two' 'three' 'six' 'eight' 'nine' 'ten'] ['four' 'five' 'seven'] </code> </pre> </DIV>

编辑

登录后才能参与评论