This is the first project in Udacity's Machine Learning Engineer Nanodegree program.
This project is part of the Udacity Azure ML Nanodegree.
In this project, I had the opportunity to build and train an Azure ML pipeline using the Azure Python SDK and a provided Scikit-learn Logistic Regression model. Hyperparameters were optimised using Azure Hyperdrive.
To compare the results with another method, an Azure AutoML run was built and optimised on the same dataset.
The following figure displays the main steps that were being taken:
The dataset contains information about possible bank customers based on direkt marketing campaigns (phone calls). The goal of this classification task is to predict whether the customer will subscribe to a term deposit or not. We explore and compare two different approaches: one model using a hpyerparameter-optimized logistic regression model and another model that was built using AutoML.
The best performing model was found to be a HyperDrive optimized logistic regression model with 95.0% accuracy and the id HD_666d2c70-9aa9-4461-b7cb-d6b43adb83ee_3. The best result using AutoML was a Voting Ensemble algorithms with an accuracy of 91.6% (id AutoML_62082872-6598-496f-936a-4e6dcb1ea86a_24).
Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.
The following part is a brief description of the hyperparameter tuning process for the custom-coded model. The necessary steps are:
Parameter sampler
Tune hyperparameters by exploring the range of values defined for each hyperparameter.
I specified the parameter sampler as such:
ps = RandomParameterSampling(
{
'--C' : choice(0.001,0.01,0.1,1,10,20,50,100,200,500,1000),
'--max_iter': choice(100,200,300)
}
)
I chose discrete values with choice for both parameters, C and max_iter.
C is the Regularization while max_iter is the maximum number of iterations.
Parameter Sampling
Azure Machine Learning supports the following methods:
I chose RandomParameterSampling because it is the fastest option and supports early termination of low-performance runs. In random sampling, hyperparameter values are randomly selected from the defined search space.
Further options are Grid sampling (Performs a simple grid search over all possible values) or Bayesian sampling (based on the Bayesian optimization algorithm, picks samples based on how previous samples performed, so that new samples improve the primary metric)
Both require enough budget to explore the hyperparameter space.
Each training run is evaluated for the primary metric. The early termination policy uses the primary metric to identify low-performance runs.
The following attributes are needed for a primary metric:
In this case, I used the combination to maximize “accuracy”.
Automatically terminate poorly performing runs with an early termination policy. Early termination improves computational efficiency.Azure Machine Learning supports the following early termination policies:
I opted for the fastest version, the Bandit policy. Bandit policy is based on slack factor/slack amount and evaluation interval. Bandit terminates runs where the primary metric is not within the specified slack factor/slack amount compared to the best performing run. In other words, Azure ML should check the job every 2 iterations and if the primary metric falls outside the top 50% range, it should terminate the job (as seen in the following code snippet).
policy = BanditPolicy(evaluation_interval=5, slack_factor=0.1)
Control your resource budget by specifying the maximum number of training runs.
hd_best_run = hypdrive_run.get_best_run_by_primary_metric()
best_run_metrics = hd_best_run.get_metrics()
parameter_values = hd_best_run.get_details()
Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time consuming, iterative tasks of machine learning model development. The following describes the model and hyperparameters generated by AutoML.
In contrast to the manual adjustments I made to the Linear Regression model, the configuration of the AutoML job is pretty straight forward. I defined the following AutoMLConfig:
automl_config = AutoMLConfig(
compute_target = cpu_cluster,
experiment_timeout_minutes=30,
task="classification",
primary_metric="accuracy",
training_data=dataset,
label_column_name='y',
enable_onnx_compatible_models=True,
n_cross_validations=2)
In there, the most important parameters are:
Azure AutoML then tries different models and algorithms during the automation and tuning process. As a user, there is no need to specify the algorithm. The three different task parameter values determine the list of algorithms, or models, to apply (namely classification, regression or forecasting). For classification, this includes the following: Logistic Regression, Light GBM, Gradient Boosting, Decision Tree, K Nearest Neighbors, Linear SVC, Support Vector Classification (SVC), Random Forest, Extremely Randomized Trees, Xgboost, Averaged Perceptron Classifier, Naive Bayes and Linear SVM Classifier. That is pretty impressive given that I do no need to specify or configure any of those! See Configure automated ML experiments in Python for reference.
As stated in the summary, the custom-coded solution based on Scikit-learn resulted in a better solution. However, the effort and invested time was way more in comparison to the AutoML job, since the latter does turn all the necessary knobs for me.
Both models had the goal of maximizing the accuracy. The exact results are:
Architecturally, the two models are quite different. The two-class logistic regression predicts the probability of occurrence of an event by fitting data to a logistic functionm, thus performing a binary classification. The voting ensemble, as the name implies, carries out a number of individual classifiers and combines the predictions from those to make a prediction.
Hint: weighted accuracy weighs each class according to the number of samples that belong to that class in the dataset.
The basis for this work can be found under the following link:
I furthermore used the official Microsoft Azure documentation, namely: