按照用户3914041和Andreus的最小示例和响应,这可以按预期工作。的确,我得到了:
Validation Sample Score: 10.176958 (mean squared). Fitting 1 folds for each of 1 candidates, totalling 1 fits mean: 10.19074, std: 0.00000, params: {'n_estimators': 1000}
在这种情况下,我们在两种方法中都有相同的结果(省略一些舍入)。以下是重现相同分数的代码:
from sklearn.cross_validation import train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn import metrics from sklearn.grid_search import GridSearchCV from sklearn.metrics import mean_squared_error, make_scorer from sklearn.datasets import load_boston b=load_boston() X = b.data y = b.target folds=train_test_split(range(len(y)),test_size=0.5, random_state=10) train_X=X[folds[0],:] train_y=y[folds[0]] test_X=X[folds[1],:] test_y=y[folds[1]] folds_split = np.zeros_like(y) folds_split[folds[0]] = -1 ps = PredefinedSplit(folds_split) rf=RandomForestRegressor(n_estimators=1000, random_state=42) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print "Validation Sample Score: %f (mean squared)." % mean_squared_error(test_y,y_submission) mse_scorer = make_scorer(mean_squared_error) parameters= {'n_estimators': [1000]} grid_search = GridSearchCV(RandomForestRegressor(random_state=42), cv=ps, param_grid=parameters, verbose=1, scoring=mse_scorer) grid_search.fit(X,y) print grid_search.grid_scores_[0]
在您的第一个示例中,尝试删除 greater_is_better=True 。实际上,基尼系数应该被最小化,而不是最大化。
greater_is_better=True
试着看看这是否能解决问题。您还可以添加一些随机种子,以确保以完全相同的方式完成拆分。
我可以告诉两个代码块之间有一个区别。通过使用 cv=2 ,您将数据拆分为两个50%大小的块。然后在它们之间平均得到的基尼。
cv=2
作为旁注,你确定要吗? greater_is_better=True 在你的得分手?从您的帖子中,您暗示您希望降低该分数。在这一点上要特别小心,因为GridSearchCV最大化得分。
来自 GridSearchCV文档 :
选择的参数是最大化遗漏数据得分的参数,除非传递明确的分数,在这种情况下使用它。
这个线程现在已经很老了,所以我假设你们都已经想到这一点,但为了清楚起见,原来的2个块中至少有3个问题导致它们产生不同的结果:简而言之,未能设置一个在train_test_split返回的折叠上耦合随机种子并且无法使用PredefinedSplit(迭代可以最终重新排序分割)。以下是使用不同的gini实现来说明的独立代码:
import sys import numpy as np import pandas as pd from sklearn.cross_validation import train_test_split, PredefinedSplit from sklearn.feature_extraction import DictVectorizer as DV from sklearn.grid_search import GridSearchCV from sklearn.ensemble import RandomForestRegressor from sklearn import metrics def gini(expected, predicted): assert expected.shape[0] == predicted.shape[0], 'unequal number of rows: [ %d vs %d ]' \ % ( expected.shape[0] == predicted.shape[0] ) _all = np.asarray(np.c_[ expected, predicted, np.arange(expected.shape[0])], dtype=np.float) _EXPECTED = 0 _PREDICTED = 1 _INDEX = 2 # sort by predicted descending, then by index ascending sort_order = np.lexsort((_all[:, _INDEX], -1 * _all[:, _PREDICTED])) _all = _all[sort_order] total_losses = _all[:, _EXPECTED].sum() gini_sum = _all[:, _EXPECTED].cumsum().sum() / total_losses gini_sum -= (expected.shape[0] + 1.0) / 2.0 return gini_sum / expected.shape[0] def gini_normalized(solution, submission, gini=gini): solution = np.array(solution) submission = np.array(submission) return gini(solution, submission) / gini(solution, solution) gini_scorer = metrics.make_scorer( gini_normalized, greater_is_better=True ) dat=pd.read_table('train.csv',sep=',') y=dat[['Hazard']].values.ravel() dat=dat.drop(['Hazard','Id'],axis=1) # 1. set seed for train_test_split() folds = train_test_split( range(len(y)), test_size=0.7, random_state=15 ) # 70% test dat_dict=dat.T.to_dict().values() vectorizer=DV( sparse = False ) vectorizer.fit( dat_dict ) dat=vectorizer.transform( dat_dict ) dat=pd.DataFrame(dat) # 2. instead of using the raw folds returned by train_test_split, # use the PredefinedSplit iterator, just like GridSearchCV does if 0: train_X=dat.iloc[folds[0]] train_y=y[folds[0]] test_X=dat.iloc[folds[1]] test_y=y[folds[1]] else: folds_split = np.zeros_like(y) folds_split[folds[0]] = -1 ps = PredefinedSplit(folds_split) # in this example, there's only one iteration here for train_index, test_index in ps: train_X, test_X = dat.iloc[train_index], dat.iloc[test_index] train_y, test_y = y[train_index], y[test_index] n_estimators = [ 100, 200 ] # 3. also set seed for RFR rfr_params = { 'n_jobs':7, 'random_state':15 } ###################################################################### # manual grid search ( block 1 ) for n_est in n_estimators: print 'n_estimators = %d:' % n_est; sys.stdout.flush() rfr = RandomForestRegressor( n_estimators=n_est, **rfr_params ) rfr.fit( train_X, train_y ) y_pred = rfr.predict( test_X ) gscore = gini_normalized( test_y, y_pred ) print ' validation score: %.5f (normalized gini)' % gscore ###################################################################### # GridSearchCV grid search ( block 2 ) ps = PredefinedSplit(folds_split) rfr = RandomForestRegressor( **rfr_params ) grid_params = { 'n_estimators':n_estimators } gcv = GridSearchCV( rfr, grid_params, scoring=gini_scorer, cv=ps ) gcv.fit( dat, y ) print gcv.grid_scores_