这些数据非常适合 Featuretools 。它是一个开源自动化功能工程库,明确处理时间,以确保您不会引入标签泄漏。 对于您的音乐数据,您可以创建两个 entities : "users" 和 "artist_plays" ,然后申请 featuretools.dfs (深度特征合成)生成特征。将实体视为与关系数据库中的表相同。深度特征合成创建了一个单表特征矩阵,可以从多个不同的表中进行建模,并具有高级统计功能。这里有一个 短信 解释它是如何工作的。
entities
"users"
"artist_plays"
featuretools.dfs
这个例子是使用普通的Python,但可以适用于Spark或 DASK
# Create entityset import featuretools as ft from sklearn.preprocessing import Imputer, StandardScaler import pandas as pd import pickle def load_entityset(user_df, artist_plays_df): es = ft.EntitySet("artist plays") es.entity_from_dataframe("users", user_df, index="user_id") es.entity_from_dataframe("artist_plays", artist_plays_df, index="artist_id") es.add_relationship(ft.Relationship(es['users']['user_id'], es['artist_plays']['user_id'])) return es user_df = pd.read_csv("training_user.csv") artist_plays_df = pd.read_csv("training_artist_plays.csv") es = load_entityset(user_df, artist_plays_df) feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='artist_plays', ignore_variables={'artist_plays': ['play']}) # encode categoricals encoded_fm, encoded_fl = ft.encode_features(feature_matrix, feature_defs) # Impute/scale using SKLearn imputer = Imputer() scaler = StandardScaler() imputed_fm = imputer.fit_transform(encoded_fm) scaled_fm = scaler.fit_transform(imputed_fm) # Now, save the encoded feature list, and the imputer/scaler to files to reuse in production ft.save_features(encoded_fl, 'fl.p') with open('imputer.p', 'wb') as f: pickle.dump(imputer, f) with open('scaler.p', 'wb') as f: pickle.dump(scaler, f)
然后在生产中:
import featuretools as ft import pickle import pandas as pd # load previous data old_user_df = pd.read_csv("training_user.csv") old_artist_plays_df = pd.read_csv("training_artist_plays.csv") es_old = load_entityset(old_user_df, old_artist_plays_df) # load new data user_df = pd.read_csv("new_user.csv") artist_plays_df = pd.read_csv("new_artist_plays.csv") es_updated = load_entityset(user_df, artist_plays_df) # merge both data sources es = es_old.concat(es_updated) # load back in encoded features features = ft.load_features('fl.p', es) fm = ft.calculate_feature_matrix(features, entityset=es, instance_ids=es_updated['artist_plays'].get_all_instances()) # impute and scale with open('imputer.p', 'r') as f: imputer = pickle.load(f) imputed_fm = imputer.transform(fm) with open('scaler.p', 'r') as f: scaler = pickle.load(f) scaled_fm = scaler.transform(imputed_fm)
我们使用此工作流程进行了一些演示,请查看 这个例子 预测杂货店购物者将来会购买什么。
我还在实时生产环境中使用此工作流程来预测大型软件项目的交付指标 - 请查看此信息 白皮书 我们已经发布了通过这种方法和结果进行实时部署的详细信息。