GS-Quantify' 17, Goldman Sachs Data Science Competition
Here gc
stands for Garbage Collector.
initial-Used-Memory (y-axis) vs gc-Initial-Memory (x-axis)
The plot shows us that there is a linear relationship between the 2 variabes.
Final-Used-Memory vs gc-Final-Memory
The plot shows us that there is a linear relationship between the 2 variabes.
initial-Used-Memory + initial-Free-Memory vs gc-Total-Memory
The plot shows us that there is a linear relationship between the 2 variabes.
initial-Used-Memory + initial-Free-Memory vs final-Used-Memory + final-Free-Memory
The plot shows us that there is a linear relationship between the 2 variabes. We observe 3 outliers in this plot which we remove before proceeding.
initial-Used-Memory + initial-Free-Memory vs final-Used-Memory + final-Free-Memory
The plot shows us that there is a linear relationship between the 2 variabes.
Following approximations were made:
We were required to print the memory free after every query is served but he heading of that column was given as initialFreeMemory which we take it as finalFreeMemory
Linear Regression
Following the plots and approximations we predicted:
gcInitialMemory using linear regression with initialUsedMemory
finalUsedMemory using linear regression with resources+initialUsedMemory
gcTotalMemory using linear regression with initialUsedMemory+initialFreeMemory
FinalFreeMemory using linear regression with initialFreeMemory+initialUsedMemory-finalUsedMemory
XGBOOST
Xgboost was used to determine the gcRun. We gave parameters to xgboost as: resources, initialMemoryUsed, initialMemoryFree, cpuTimeTaken.
We chose this model as the output was not in linearly related to the parameters. We confirmed this creating a cross validation set and checking the accuracy of a linear model such as logistic regression, linear SVM( both hard-margin and soft-margin). The result came to be very poor. We also tried SVM with ‘rbf’ kernel, which wasn’t much an improvement from the linear models.
So we applied Xgboost was the best among the other models due to the nonlinear relationship between taget and parameters. Xgboost being an ensemble method has the added advantage of not being overfitted easily while preserving the accuracy.
To predict gcRun:
We used the xgboost to predict ‘gcRun’. We supplied resources feature to the xgboost algorithm by saved value of resources that we obtained from the training set. Eg: token_53 had ‘resources’ as 0.047545312750000325 which was obtained from training set.
To predict initialFreeMemory:
This finalFreeMemory then becomes the output for that query as the initialFreeMemory
Please star the repo if you found the materials in the repo useful :)