您可以使用sklearn提供的train_test_split方法。请参阅此处的文档
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
只需使用即可创建文件名列表 os.listdir() 。使用 collections.shuffle() 洗牌,然后 training_files = filenames[:700] 和 testing_files = filenames[700:]
os.listdir()
collections.shuffle()
training_files = filenames[:700]
testing_files = filenames[700:]
如果你使用numpy,这很简单,首先加载文档并使它们成为一个numpy数组,然后:
import numpy as np docs = np.array([ 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', ]) idx = np.hstack((np.ones(7), np.zeros(3))) # generate indices np.random.shuffle(idx) # shuffle to make training data and test data random train = docs[idx == 1] test = docs[idx == 0] print(train) print(test)
结果:
['one' 'two' 'three' 'six' 'eight' 'nine' 'ten'] ['four' 'five' 'seven']