GS-Quantify-17

(Goldman Sachs Flagship Data Science Competition)

ML Problem Statement - Predicting Garbage Collector Invocation

Data Visualisation

Here gc stands for Garbage Collector.

initial-Used-Memory (y-axis) vs gc-Initial-Memory (x-axis)

The plot shows us that there is a linear relationship between the 2 variabes.

Final-Used-Memory vs gc-Final-Memory

The plot shows us that there is a linear relationship between the 2 variabes.

initial-Used-Memory + initial-Free-Memory vs gc-Total-Memory

The plot shows us that there is a linear relationship between the 2 variabes.

initial-Used-Memory + initial-Free-Memory vs final-Used-Memory + final-Free-Memory

The plot shows us that there is a linear relationship between the 2 variabes. We observe 3 outliers in this plot which we remove before proceeding.

initial-Used-Memory + initial-Free-Memory vs final-Used-Memory + final-Free-Memory

The plot shows us that there is a linear relationship between the 2 variabes.

Approximations Used

Following approximations were made:

gcInitialMemory = initialUsedMemory
GcFinalMemory = finalUsedMemory
GcTotalMemory = finalUsedMemory+finalFreeMemory = initialUsedMemory + initialFreeMemory

We were required to print the memory free after every query is served but he heading of that column was given as initialFreeMemory which we take it as finalFreeMemory

Models Used

Linear Regression
Following the plots and approximations we predicted:
- gcInitialMemory using linear regression with initialUsedMemory
- finalUsedMemory using linear regression with resources+initialUsedMemory
- gcTotalMemory using linear regression with initialUsedMemory+initialFreeMemory
- FinalFreeMemory using linear regression with initialFreeMemory+initialUsedMemory-finalUsedMemory

XGBOOST

Xgboost was used to determine the gcRun. We gave parameters to xgboost as: resources, initialMemoryUsed, initialMemoryFree, cpuTimeTaken.

We chose this model as the output was not in linearly related to the parameters. We confirmed this creating a cross validation set and checking the accuracy of a linear model such as logistic regression, linear SVM( both hard-margin and soft-margin). The result came to be very poor. We also tried SVM with ‘rbf’ kernel, which wasn’t much an improvement from the linear models.

So we applied Xgboost was the best among the other models due to the nonlinear relationship between taget and parameters. Xgboost being an ensemble method has the added advantage of not being overfitted easily while preserving the accuracy.

Strategy for deciding the results

To predict gcRun:

We used the xgboost to predict ‘gcRun’. We supplied resources feature to the xgboost algorithm by saved value of resources that we obtained from the training set. Eg: token_53 had ‘resources’ as 0.047545312750000325 which was obtained from training set.

To predict initialFreeMemory:

We computed initialFreeMemory as previous query’s finalFreeMemory
We computed initialUsedMemory as previous query’s finalUsedMemory
We computed gcInitialMemory as initialFreeMemory of same query
We computed gcTotalMemory as initialFreeMemory+initialUsedMemory of same query
We computed finalUsedMemory as resources+initialUsedMemory of same query
We computed finalFreeMemory as initialFreeMemory+initialUsedMemory-finalUsedMemory of same query

This finalFreeMemory then becomes the output for that query as the initialFreeMemory

Github repos of similar Data Science Competitions:

Please star the repo if you found the materials in the repo useful :)