A submission for HUAWEI - 2020 DIGIX GLOBAL AI CHALLENGE
A submission for HUAWEI - 2020 DIGIX GLOBAL AI CHALLENGE
team: Melbourne dağları
members: @mustafahakkoz, @Aysenuryilmazz
rank: 94/ 343
score (AUC): 0.679876
dataset: advertising behavior data Heavily unbalanced and very large / out of core dataset containing the advertising behavior data collected from seven consecutive days.
training dataset (6.09 GB, 43M rows, 36 cols)
2 testing datasets (153 MB, 1M rows, 36 cols)
The main ideas of the project are:
Reading dataset with chunks and downcasting to fit into the memory.
Target encoding with smoothing.
SGD model with mini-batches.
class_weights to balance classes.
Implementation details can be found in the document DIGIX Implementation Instruction.docx.
2. target encoding with smoothing
We implemented target encoding on columns by using a custom function which smooths standard target encoding with global mean of a column.
We dropped uid and pt_d columns on train dataset.
We shuffle the dataset and split it to 40M for train and rest (~2M) for test purposes.
We produce train dataset in several notebooks due to hard disk limitations of Kaggle platform (only 5GB).
We chose SGD model of Scikit Learn with default parameters and feed it with batches of 10K since it supports out-of-core learning by partial_fit and warm_start parameters.
For every batch we used class_weight parameter to balance classes.
After evaluating our model on our test dataset (with AUC score of 70%), we refit our model on whole training set (~42M) and export the model.
We didn’t use any cross validation or hyper parameter tuning technique for this contest due to computational constraints of online platforms.
We didn’t perform any of feature engineering techniques also.
We also tried Decision Tree, XGBoost, catboost and lightGBM with several parameters but they didn’t work out due to memory errors.
This repo only contains of final versions. Experiments are implemented in kaggle platform. All of the notebooks including scratches are below.
Reading whole dataset by chunks, downcasting and shuffling
3.
a. Trying out Target encoding and smoothing
a. Splitting Dataset into 4 parts-1 [deleted]
b. Splitting Dataset into 4 parts-2 [deleted]
c. Splitting Dataset into 4 parts-3
d. Splitting Dataset into 4 parts-4
e. Trying out XGBoost with batches but failed since boosting cannot work with different datasets
a. Creating test set, encoding map dictionary, datatype dictionary for reading data by chunks
b. Splitting train set by chunk size of 5M-1
c. Splitting train set by chunk size of 5M-2
d. Splitting train set by chunk size of 5M-3
e. Splitting train set by chunk size of 5M-4
f. Splitting train set by chunk size of 5M-5
g. Splitting train set by chunk size of 5M-6
h. Splitting train set by chunk size of 5M-7
i. Splitting train set by chunk size of 5M-8
j. Training SGD model by chunks with class_weights, testing the model then refitting whole data