项目作者: KienVu2368

项目描述 :
Tabular data interpretation and explanation
高级语言: Jupyter Notebook
项目地址: git://github.com/KienVu2368/tabint.git
创建时间: 2018-09-07T14:33:01Z
项目社区:https://github.com/KienVu2368/tabint

开源协议:MIT License

下载


Welcom to Tabint

NB: this is on development process, many things we want to develop but have not yet done. If you want to contribute please feel free to do so. We are according to nbdev style. So if you do contribute, please do so accordingly. For more information about nbdev style, please visit nbdev document

Installing

  1. git clone https://github.com/KienVu2368/tabint
  2. cd tabint
  3. conda env create -f environment.yml
  4. conda activate tabint

Pre-processing

  1. import pandas as pd
  2. df = pd.read_csv('df_sample.csv')
  3. df_proc, y, pp_outp = tabular_proc(df, 'TARGET', [fill_na(), app_cat(), dummies()])

Unify class for pre processing class.

  1. class cls(TBPreProc):
  2. @staticmethod
  3. def func(df, pp_outp, na_dict = None):
  4. ...
  5. return df

For example, fill_na class

  1. class fill_na(TBPreProc):
  2. @staticmethod
  3. def func(df, pp_outp, na_dict = None):
  4. na_dict = {} if na_dict is None else na_dict.copy()
  5. na_dict_initial = na_dict.copy()
  6. for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
  7. if len(na_dict_initial.keys()) > 0:
  8. df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)
  9. pp_outp['na_dict'] = na_dict
  10. return df

Dataset

Dataset class contain training set, validation set and test set.

Dataset can be built by split method of SKlearn

  1. ds = TBDataset.from_SKSplit(df_proc, y, cons, cats, ratio = 0.2)

Or by split method of tabint. This method will try to keep the same distribution of categorie variables between training set and validation set.

  1. ds = TBDataset.from_TBSplit(df_proc, y, cons, cats, ratio = 0.2)

Dataset class have method that can simultaneously edit training set, validation set and test set.

Drop method can drop one or many columns in training set, validation set and test set.

  1. ds.drop('DAYS_LAST_PHONE_CHANGE_na')

Or if we need to keep only importance columns that we found above. Just use keep method from dataset.

  1. mpt_features = impt.top_features(24)
  2. ds.keep(impt_features)

Dataset class in tabint also can simultaneously apply a funciton to training set, validation set and test set

  1. ds.apply('DAYS_BIRTH', lambda df: -df['DAYS_BIRTH']/365)

Or we can pass many transformation function at once.

  1. tfs = {'drop 1': ['AMT_REQ_CREDIT_BUREAU_HOUR_na', 'AMT_REQ_CREDIT_BUREAU_YEAR_na'],
  2. 'apply':{'DAYS_BIRTH': lambda df: -df['DAYS_BIRTH']/365,
  3. 'DAYS_EMPLOYED': lambda df: -df['DAYS_EMPLOYED']/365,
  4. 'NEW_EXT_SOURCES_MEAN': lambda df: df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1, skipna=True),
  5. 'NEW_EXT_SOURCES_GEO': lambda df: (df['EXT_SOURCE_1']*df['EXT_SOURCE_2']*df['EXT_SOURCE_3'])**(1/3),
  6. 'AMT_CREDIT/AMT_GOODS_PRICE': lambda df: df['AMT_CREDIT']/df['AMT_GOODS_PRICE'],
  7. 'AMT_CREDIT/AMT_CREDIT': lambda df: df['AMT_CREDIT']/df['AMT_CREDIT'],
  8. 'DAYS_EMPLOYED/DAYS_BIRTH': lambda df: df['DAYS_EMPLOYED']/df['DAYS_BIRTH'],
  9. 'DAYS_BIRTH*EXT_SOURCE_1_na': lambda df: df['DAYS_BIRTH']*df['EXT_SOURCE_1_na']},
  10. 'drop 2': ['AMT_ANNUITY', 'AMT_CREDIT', 'AMT_GOODS_PRICE']}
  11. ds.transform(tfs)

Learner

Learner class unify training method from sklearn model

  1. learner = LGBLearner()
  2. params = {'task': 'train', 'objective': 'binary', 'metric':'binary_logloss'}
  3. learner.fit(params, *ds.trn, *ds.val)

LGBM model

  1. learner = SKLearner(RandomForestClassifier())
  2. learner.fit(*ds.trn, *ds.val)

and XGB model (WIP)

Feature correlation

tabint use đenogram for easy to see and pick features with high correlation

  1. ddg = Dendogram.from_df(ds.x_trn)
  1. ddg.plot()

Feature importance

tabint use permutation importance. Each column or group of columns in validation set in dataset will be permute to calculate the importance.

  1. group_cols = [['AMT_CREDIT', 'AMT_GOODS_PRICE', 'AMT_ANNUITY'], ['FLAG_OWN_CAR_N', 'OWN_CAR_AGE_na']]
  1. impt = Importance.from_Learner(learner, ds, group_cols)
  1. impt.plot()

We can easily get the most importance feature by method in Importance class

  1. impt.top_features(24)

Model performance

Classification problem

Receiver operating characteristic

  1. roc = ReceiverOperatingCharacteristic.from_learner(learner, ds)
  2. roc.plot()

Probability distribution

  1. kde = KernelDensityEstimation.from_learner(learner, ds)
  2. kde.plot()

Precision and Recall

  1. pr = PrecisionRecall.from_series(y_true, y_pred)
  2. pr.plot()

Regression problem

Actual vs Predict

  1. avp = actual_vs_predict.from_learner(learner, ds)
  2. avp.plot(hue = 'Height')

Interpretation and explaination

Partial dependence

tabint use PDPbox library to visualize partial dependence.

  1. pdp = PartialDependence.from_Learner(learner, ds)

info target plot

  1. pdp.info_target_plot('EXT_SOURCE_3')

We can see result as table

  1. pdp.info_target_data()

isolate plot

  1. pdp.isolate_plot('EXT_SOURCE_3')

Tree interpreter

  1. Tf = Traterfall.from_SKTree(learner, ds.x_trn, 3)
  1. Tf.plot(formatting = "$ {:,.3f}")

We can see and filter result table

  1. Tf.data.pos(5)
  2. Tf.data.neg(5)

SHAP

tabint visual SHAP values from SHAP library. SHAP library use red and blue for default color. tabint change these color to green and blue for easy to see and consistence with pdpbox library.

  1. Shap = SHAP.from_Tree(learner, ds)

force plot

  1. Shap.one_force_plot(3)

And we can see table result also.

  1. Shap.one_force_data.pos(5)
  1. Shap.one_force_data.neg(5)

dependence plot

  1. Shap.dependence_plot('EXT_SOURCE_2')