PROSAGA码农传奇-用户画像-PCA分析后的特征/变量重要性

<div class =“post-text”itemprop =“text”>
  
    的
      首先，我假设你打电话
       <code>
 features
 </code>
       变量和
       <code>
 not the samples/observations
 </code>
      。在这种情况下，您可以通过创建一个来执行以下操作
       <code>
 biplot
 </code>
       在一个图中显示所有内容的功能。在这个例子中，我使用的是虹膜数据：
    </强>
  
  
    的
      在此示例之前，请注意使用PCA作为特征选择工具时的基本思想是根据系数（负载）的大小（从绝对值的最大值到最小值）选择变量。有关更多详细信息，请参阅情节后的最后一段。
    </强>
  
  <HR />
   <pre>
 <code>
 import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

</code>
 </pre>
  <HR />
  
    的
      使用biplot可视化正在发生的事情
    </强>
  
  
    <a href="https://i.stack.imgur.com/r4fLJ.png" rel="nofollow noreferrer">
      <img src =“https://i.stack.imgur.com/r4fLJ.png”alt =“BIPLOT RESULT”/>
    </A>
  
  <HR />
  
    的
      现在，每个特征的重要性反映在特征向量中相应值的大小（更高的幅度 - 更高的重要性）
    </强>
  
  
    让我们先看看每台PC解释的差异量。
  
   <pre>
 <code>
 pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]

</code>
 </pre>
  
     <code>
 PC1 explains 72%
 </code>
     和
     <code>
 PC2 23%
 </code>
    。总之，如果我们只保留PC1和PC2，他们会解释
     <code>
 95%
 </code>
    。
  
  
    现在，让我们找到最重要的功能。
  
   <pre>
 <code>
 print(abs( pca.components_ ))

[[0.52237162 0.26335492 0.58125401 0.56561105]
 [0.37231836 0.92555649 0.02109478 0.06541577]
 [0.72101681 0.24203288 0.14089226 0.6338014 ]
 [0.26199559 0.12413481 0.80115427 0.52354627]]

</code>
 </pre>
  <HR />
  
    的
      这里，
       <code>
 pca.components_
 </code>
       有形状
       <code>
 [n_components, n_features]
 </code>
      。因此，通过观察
       <code>
 PC1
 </code>
       （第一主成分）是第一行：
       <code>
 [0.52237162 0.26335492 0.58125401 0.56561105]]
 </code>
       我们可以得出结论
       <code>
 feature 1, 3 and 4
 </code>
       （或双标图中的Var 1,3和4）是最重要的。
    </强>
  
  
    总而言之，查看对应于k个最大特征值的特征向量分量的绝对值。在
     <code>
 sklearn
 </code>
     组件按排序
     <code>
 explained_variance_
 </code>
    。这些绝对值越大，特定特征对该主要成分的贡献就越大。
  
</DIV>