PROSAGA码农传奇-特征工程-Python中时间序列数据集的特征工程

<div class =“post-text”itemprop =“text”>
  
    你可以试试
    <a href="https://www.featuretools.com" rel="nofollow noreferrer">
      Featuretools
    </A>
    。它是一个开源自动化功能工程库，明确处理时间，以确保您不会引入标签泄漏。
  
  
    对于您的数据，您可以创建两个实体：
     <code>
 "observations"
 </code>
     和
     <code>
 "timesteps"
 </code>
    ，然后申请
     <code>
 featuretools.dfs
 </code>
     （
    <a href="https://www.featurelabs.com/blog/deep-feature-synthesis/" rel="nofollow noreferrer">
      深度特征合成
    </A>
    ）为每个时间步长生成功能。您可以将实体视为与关系数据库中的表相同。
  
  
    对于您的问题特别有用的是Featuretools中的累积原语，这些操作使用按时间排序的许多实例来计算单个值。在您的情况下，如果有多个时间步长的观察值，每个时间步长都有一定值，您可以使用以下方法计算前一个时间步长的平均值
    <a href="https://docs.featuretools.com/generated/featuretools.primitives.CumMean.html?highlight=cummean" rel="nofollow noreferrer">
      CumMean原语
    </A>
    。
  
  
    这是一个例子：
  
   <pre>
 <code>
 from featuretools.primitives import Day, Weekend, Percentile, CumMean, CumSum
import featuretools as ft
import pandas as pd
import numpy as np
timesteps = pd.DataFrame({'ts_id': range(12),
 'timestamp': pd.DatetimeIndex(start='1/1/2018', freq='1d', periods=12),
 'attr1': np.random.random(12),
 'obs_id': [1, 2, 3] * 4})
print(timesteps)

attr1  obs_id  timestamp  ts_id
0   0.663216       1 2018-01-01      0
1   0.455353       2 2018-01-02      1
2   0.800848       3 2018-01-03      2
3   0.938645       1 2018-01-04      3
4   0.442037       2 2018-01-05      4
5   0.724044       3 2018-01-06      5
6   0.304241       1 2018-01-07      6
7   0.134359       2 2018-01-08      7
8   0.275078       3 2018-01-09      8
9   0.499343       1 2018-01-10      9
10  0.608565       2 2018-01-11     10
11  0.340991       3 2018-01-12     11

entityset = ft.EntitySet("timeseries")
entityset.entity_from_dataframe("timesteps",
                                timesteps,
                                index='ts_id',
                                time_index='timestamp')
entityset.normalize_entity(base_entity_id='timesteps',
                           new_entity_id='observations',
                           index='obs_id',
                           make_time_index=True)

# per timestep
cutoffs = timesteps[['ts_id', 'timestamp']]
feature_matrix, feature_list = ft.dfs(entityset=entityset,
                                      target_entity='timesteps',
                                      cutoff_time=cutoffs,
                                      trans_primitives=[Day, Weekend, Percentile, CumMean, CumSum],
                                      agg_primitives=[])
print(feature_matrix.iloc[:, -6:])

CUMMEAN(attr1 by obs_id)  CUMSUM(attr1 by obs_id)  CUMMEAN(PERCENTILE(attr1) by obs_id)  CUMSUM(CUMMEAN(attr1 by obs_id) by obs_id)  CUMSUM(PERCENTILE(attr1) by obs_id)  observations.DAY(first_timesteps_time)
ts_id
0                      0.100711                 0.100711                              1.000000                                    0.100711                             1.000000                                       1
1                      0.811898                 0.811898                              1.000000                                    0.811898                             1.000000                                       2
2                      0.989166                 0.989166                              1.000000                                    0.989166                             1.000000                                       3
3                      0.442035                 0.442035                              0.500000                                    0.442035                             0.500000                                       1
4                      0.910106                 0.910106                              0.800000                                    0.910106                             0.800000                                       2
5                      0.427610                 0.427610                              0.333333                                    0.427610                             0.333333                                       3
6                      0.832516                 0.832516                              0.714286                                    0.832516                             0.714286                                       1
7                      0.035121                 0.035121                              0.125000                                    0.035121                             0.125000                                       2
8                      0.178202                 0.178202                              0.333333                                    0.178202                             0.333333                                       3
9                      0.085608                 0.085608                              0.200000                                    0.085608                             0.200000                                       1
10                     0.891033                 0.891033                              0.818182                                    0.891033                             0.818182                                       2
11                     0.044010                 0.044010                              0.166667                                    0.044010                             0.166667                                       3

</code>
 </pre>
  
    此示例还使用“截止时间”来告诉特征计算引擎仅在每个“ts_id”或“obs_id”的指定时间之前使用数据。您可以阅读更多关于截止时间的信息
    <a href="https://docs.featuretools.com/automated_feature_engineering/handling_time.html" rel="nofollow noreferrer">
      这一页
    </A>
     在文档中。
  
  
    Featuretools允许你做的另一个很酷的事情是在“观察”表中构建每个观察的特征，而不是每个时间步。为此，请更改“target_entity”参数。在下面的示例中，我们使用每个观察的最后一个时间戳作为截止时间，这将确保在该时间之后没有使用任何数据（例如，来自2018-01-11的obs_id = 2的数据将不包括在obs_id = 1的Percentile（）计算，截止时间为2018-01-10）。
  
   <pre>
 <code>
 # per observation
ocutoffs = timesteps[['obs_id', 'timestamp']].drop_duplicates(['obs_id'], keep='last')
ofeature_matrix, ofeature_list = ft.dfs(entityset=entityset,
 target_entity='observations',
 cutoff_time=ocutoffs,
 trans_primitives=[Day, Weekend, Percentile, CumMean, CumSum])
print(ofeature_matrix.iloc[:, -6:])

PERCENTILE(STD(timesteps.attr1))  PERCENTILE(MAX(timesteps.attr1))  PERCENTILE(SKEW(timesteps.attr1))  PERCENTILE(MIN(timesteps.attr1))  PERCENTILE(MEAN(timesteps.attr1))  PERCENTILE(COUNT(timesteps))
obs_id
1                               0.666667                          1.000000                           0.666667                          0.666667                           0.666667                      1.000000
2                               0.333333                          0.666667                           0.666667                          0.666667                           0.333333                      0.833333
3                               1.000000                          1.000000                           0.333333                          0.333333                           1.000000                      0.666667

</code>
 </pre>
  
    最后，实际上可以将tsfresh与Featuretools结合使用作为“自定义原语”。这是一项高级功能，但如果您有兴趣，我可以解释一下。
  
</DIV>