项目作者: bdwilliamson

项目描述 :
Perform inference on algorithm-agnostic variable importance in Python
高级语言: Python
项目地址: git://github.com/bdwilliamson/vimpy.git
创建时间: 2017-04-04T21:46:16Z
项目社区:https://github.com/bdwilliamson/vimpy

开源协议:MIT License

下载


" class="reference-link">Python/vimpy: inference on algorithm-agnostic variable importance

PyPI version
License: MIT

Software author: Brian Williamson

Methodology authors: Brian Williamson, Peter Gilbert, Noah Simon, Marco Carone

R package: https://github.com/bdwilliamson/vimp

Introduction

In predictive modeling applications, it is often of interest to determine the relative contribution of subsets of features in explaining an outcome; this is often called variable importance. It is useful to consider variable importance as a function of the unknown, underlying data-generating mechanism rather than the specific predictive algorithm used to fit the data. This package provides functions that, given fitted values from predictive algorithms, compute nonparametric estimates of variable importance based on $R^2$, deviance, classification accuracy, and area under the receiver operating characteristic curve, along with asymptotically valid confidence intervals for the true importance.

For more details, please see the accompanying manuscripts “Nonparametric variable importance assessment using machine learning techniques” by Williamson, Gilbert, Carone, and Simon (Biometrics, 2020), “A unified approach for inference on algorithm-agnostic variable importance” by Williamson, Gilbert, Simon, and Carone (arXiv, 2020), and “Efficient nonparametric statistical inference on population feature importance using Shapley values” by Williamson and Feng (arXiv, 2020; to appear in the Proceedings of the Thirty-seventh International Conference on Machine Learning [ICML 2020]).

Installation

You may install a stable release of vimpy using pip by running python pip install vimpy from a Terminal window. Alternatively, you may install within a virtualenv environment.

You may install the current dev release of vimpy by downloading this repository directly.

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.

Example

This example shows how to use vimpy in a simple setting with simulated data and using a single regression function. For more examples and detailed explanation, please see the R vignette.

  1. ## load required libraries
  2. import numpy as np
  3. import vimpy
  4. from sklearn.ensemble import GradientBoostingRegressor
  5. from sklearn.model_selection import GridSearchCV
  6. ## -------------------------------------------------------------
  7. ## problem setup
  8. ## -------------------------------------------------------------
  9. ## define a function for the conditional mean of Y given X
  10. def cond_mean(x = None):
  11. f1 = np.where(np.logical_and(-2 <= x[:, 0], x[:, 0] < 2), np.floor(x[:, 0]), 0)
  12. f2 = np.where(x[:, 1] <= 0, 1, 0)
  13. f3 = np.where(x[:, 2] > 0, 1, 0)
  14. f6 = np.absolute(x[:, 5]/4) ** 3
  15. f7 = np.absolute(x[:, 6]/4) ** 5
  16. f11 = (7./3)*np.cos(x[:, 10]/2)
  17. ret = f1 + f2 + f3 + f6 + f7 + f11
  18. return ret
  19. ## create data
  20. np.random.seed(4747)
  21. n = 100
  22. p = 15
  23. s = 1 # importance desired for X_1
  24. x = np.zeros((n, p))
  25. for i in range(0, x.shape[1]) :
  26. x[:,i] = np.random.normal(0, 2, n)
  27. y = cond_mean(x) + np.random.normal(0, 1, n)
  28. ## -------------------------------------------------------------
  29. ## preliminary step: get regression estimators
  30. ## -------------------------------------------------------------
  31. ## use grid search to get optimal number of trees and learning rate
  32. ntrees = np.arange(100, 500, 100)
  33. lr = np.arange(.01, .1, .05)
  34. param_grid = [{'n_estimators':ntrees, 'learning_rate':lr}]
  35. ## set up cv objects
  36. cv_full = GridSearchCV(GradientBoostingRegressor(loss = 'ls', max_depth = 1), param_grid = param_grid, cv = 5)
  37. cv_small = GridSearchCV(GradientBoostingRegressor(loss = 'ls', max_depth = 1), param_grid = param_grid, cv = 5)
  38. ## -------------------------------------------------------------
  39. ## get variable importance estimates
  40. ## -------------------------------------------------------------
  41. # set seed
  42. np.random.seed(12345)
  43. ## set up the vimp object
  44. vimp = vimpy.vim(y = y, x = x, s = 1, pred_func = cv_full, measure_type = "r_squared")
  45. ## get the point estimate of variable importance
  46. vimp.get_point_est()
  47. ## get the influence function estimate
  48. vimp.get_influence_function()
  49. ## get a standard error
  50. vimp.get_se()
  51. ## get a confidence interval
  52. vimp.get_ci()
  53. ## do a hypothesis test, compute p-value
  54. vimp.hypothesis_test(alpha = 0.05, delta = 0)
  55. ## display the estimates, etc.
  56. vimp.vimp_
  57. vimp.se_
  58. vimp.ci_
  59. vimp.p_value_
  60. vimp.hyp_test_
  61. ## -------------------------------------------------------------
  62. ## using precomputed fitted values
  63. ## -------------------------------------------------------------
  64. np.random.seed(12345)
  65. folds_outer = np.random.choice(a = np.arange(2), size = n, replace = True, p = np.array([0.5, 0.5]))
  66. ## fit the full regression
  67. cv_full.fit(x[folds_outer == 1, :], y[folds_outer == 1])
  68. full_fit = cv_full.best_estimator_.predict(x[folds_outer == 1, :])
  69. ## fit the reduced regression
  70. x_small = np.delete(x[folds_outer == 0, :], s, 1) # delete the columns in s
  71. cv_small.fit(x_small, y[folds_outer == 0])
  72. small_fit = cv_small.best_estimator_.predict(x_small)
  73. ## get variable importance estimates
  74. np.random.seed(12345)
  75. vimp_precompute = vimpy.vim(y = y, x = x, s = 1, f = full_fit, r = small_fit, measure_type = "r_squared", folds = folds_outer)
  76. ## get the point estimate of variable importance
  77. vimp_precompute.get_point_est()
  78. ## get the influence function estimate
  79. vimp_precompute.get_influence_function()
  80. ## get a standard error
  81. vimp_precompute.get_se()
  82. ## get a confidence interval
  83. vimp_precompute.get_ci()
  84. ## do a hypothesis test, compute p-value
  85. vimp_precompute.hypothesis_test(alpha = 0.05, delta = 0)
  86. ## display the estimates, etc.
  87. vimp_precompute.vimp_
  88. vimp_precompute.se_
  89. vimp_precompute.ci_
  90. vimp_precompute.p_value_
  91. vimp_precompute.hyp_test_
  92. ## -------------------------------------------------------------
  93. ## get variable importance estimates using cross-validation
  94. ## -------------------------------------------------------------
  95. np.random.seed(12345)
  96. ## set up the vimp object
  97. vimp_cv = vimpy.cv_vim(y = y, x = x, s = 1, pred_func = cv_full, V = 5, measure_type = "r_squared")
  98. ## get the point estimate
  99. vimp_cv.get_point_est()
  100. ## get the standard error
  101. vimp_cv.get_influence_function()
  102. vimp_cv.get_se()
  103. ## get a confidence interval
  104. vimp_cv.get_ci()
  105. ## do a hypothesis test, compute p-value
  106. vimp_cv.hypothesis_test(alpha = 0.05, delta = 0)
  107. ## display estimates, etc.
  108. vimp_cv.vimp_
  109. vimp_cv.se_
  110. vimp_cv.ci_
  111. vimp_cv.p_value_
  112. vimp_cv.hyp_test_

The logo was created using hexSticker, lisa, and a python image distributed under the CC0 license. Many thanks to the maintainers of these packages and the Color Lisa team.