Automated Machine Learning (AutoML) is a game-changer for data scientists and analysts who want to speed up the machine learning process without sacrificing accuracy. AutoML automates the process of building, training, and tuning machine learning models, freeing up time for more strategic tasks.
With AutoML, you can try out multiple models and algorithms with just a few clicks, making it easier to find the best fit for your problem. This is especially useful for complex problems that require multiple models to solve.
AutoML can be used for a variety of tasks, including regression, classification, clustering, and more. By automating the machine learning process, you can reduce the time and effort required to get accurate results.
AutomL Workflow
To start an AutoML workflow, you need to identify the ML problem to be solved, which can be classification, forecasting, regression, computer vision, or NLP. This will inform your data requirements.
You have two options to design and run your automated ML training experiments: a code-first experience using the Azure Machine Learning SDKv2 or CLIv2, or a no-code studio web experience using the Azure Machine Learning studio web interface.
Here are the steps to follow for a no-code experience:
- Specify the source of the labeled training data.
- Configure the automated machine learning parameters.
- Submit the training job.
- Review the results.
The training job produces a Python serialized object (.pkl file) that contains the model and data preprocessing.
Workflow
The Vertex AI workflow is a standard machine learning workflow that involves five stages: gathering data, preparing data, training, evaluating, and deploying and predicting. Gathering data is the first step, where you determine the data you need for training and testing your model based on the outcome you want to achieve.
To prepare your data, you need to make sure it's properly formatted and labeled. This is a crucial step, as it ensures that your model can learn from the data correctly.
Training is the next stage, where you set parameters and build your model. You can use supervised learning tasks to achieve a chosen outcome, and the specifics of the algorithm and training methods change based on the data type and use case.
There are many different subcategories of machine learning, all of which solve different problems and work within different constraints. You can train a model to recognize patterns and content in images, text, or videos, depending on the use case.
If this caught your attention, see: Energy-based Model
Here are the five stages of the Vertex AI workflow:
- Gather your data
- Prepare your data
- Train
- Evaluate
- Deploy and predict
AutoML, on the other hand, uses a different approach to automate the machine learning process. It involves creating many pipelines in parallel that try different algorithms and parameters, and it stops once it hits the exit criteria defined in the experiment.
During training, Azure Machine Learning creates many pipelines in parallel that try different algorithms and parameters for you. The service iterates through ML algorithms paired with feature selections, where each iteration produces a model with a training score.
Expand your knowledge: Mlops Continuous Delivery and Automation Pipelines in Machine Learning
Ensemble
Ensemble models are enabled by default in automated machine learning, which improves machine learning results and predictive performance by combining multiple models.
Automated machine learning uses both voting and stacking ensemble methods for combining models. Voting predicts based on the weighted average of predicted class probabilities or predicted regression targets.
Stacking combines heterogeneous models and trains a meta-model based on the output from the individual models. The current default meta-models are LogisticRegression for classification tasks and ElasticNet for regression/forecasting tasks.
Readers also liked: Automated Decision-making
The Caruana ensemble selection algorithm is used to decide which models to use within the ensemble. This algorithm initializes the ensemble with up to five models with the best individual scores, and verifies that these models are within 5% threshold of the best score.
Here's a brief overview of the ensemble methods used in automated machine learning:
- Voting: Predicts based on the weighted average of predicted class probabilities (for classification tasks) or predicted regression targets (for regression tasks).
- Stacking: Combines heterogeneous models and trains a meta-model based on the output from the individual models.
The Caruana ensemble selection algorithm also ensures that the ensemble is updated to include a new model if it improves the existing ensemble score.
Data Preparation
Data Preparation is a crucial step in any machine learning project. To get started, you need to have a good understanding of your data, including its quality and relevance to your problem.
You can import data from your computer or Cloud Storage in a CSV or JSON Lines format, with labels and bounding boxes (if necessary) inline. This format is available for manual or default data split.
To ensure your data is ready for training, consider the following: biased or missing values can affect the quality of your model. Make sure to check for these issues before proceeding with model training.
Worth a look: Ai and Machine Learning Training
Here are some key points to keep in mind when preparing your data:
- 100 image examples per category/label are the bare minimum required for classification.
- Aim for at least 1000 examples per label for better model performance.
- Capture the variety and diversity of your problem space to improve model generalization.
By following these guidelines, you'll be well on your way to preparing high-quality data for your machine learning project.
Preparation
Preparation is key to a successful machine learning project. You need to ensure your data is ready for training by making sure it's not biased or contains missing or erroneous values.
Consider the quality of your data before you start training your model. If your data is biased or contains errors, it will affect the quality of the model.
To add data to Vertex AI, you can import it from your computer or Cloud Storage in a CSV or JSON Lines format with the labels and bounding boxes (if necessary) inline. This allows you to specify the splits in your data if you want to split your dataset manually.
You can also upload unlabeled images and use the Google Cloud console to apply annotations. This is useful if your data hasn't been annotated yet.
To prepare your data, you can start by considering all the data your organization collects. You may find that you're already collecting the relevant data you need to train a model.
Here are some ways to source your data:
- Obtain it manually
- Outsource it to a third-party provider
- Use data you're already collecting
Capturing the variation in your problem space is also important. This means exposing your model to a wide variety of examples in your training data. The broader the selection, the more readily it will generalize to new examples.
For example, if you're trying to classify photos of consumer electronics into categories, the wider a variety of consumer electronics your model is exposed to in training, the more likely it'll be able to distinguish between a novel model of tablet, phone, or laptop.
A unique perspective: Supervised or Unsupervised Machine Learning Examples
Include Enough
Include enough labeled examples in each category to ensure your model learns to recognize patterns. A bare minimum of 100 image examples per category/label is required for classification. The more high-quality examples you can bring to the training process, the better your model will be.
Check this out: Clustering Algorithms Unsupervised Learning
The likelihood of successfully recognizing a label increases with the number of examples. Target at least 1000 examples per label for optimal results. This will help your model generalize to new examples and perform better in real-world scenarios.
Here are some guidelines for ensuring you have enough labeled examples:
- For classification, aim for at least 1000 examples per label.
- For object detection, ensure you have enough examples to cover a variety of scenarios, such as different angles, lighting conditions, and object sizes.
- For video classification, try to capture a variety of video shots, such as different camera angles, day and night times, and player movements.
By following these guidelines, you can ensure your model is well-trained and can accurately recognize patterns in your data.
Analyze After Importing
You should review each column's variable type in your dataset after importing. Vertex AI will automatically detect the variable type based on the column's values, but it's always a good idea to double-check.
Make sure each column has the correct variable type to ensure accurate analysis. This is especially important if you have columns with missing or NULL values.
Review your dataset to determine the nullability of each column. Nullability determines whether a column can have missing or NULL values. This is crucial for accurate analysis and modeling.
To add data in Vertex AI, you can import it from your computer or Cloud Storage in the CSV or JSON Lines format. The data should be in the format specified in Preparing your training data, with labels inline.
You can also upload unlabeled text examples and use the Vertex AI console to apply labels. This is a great option if your data hasn't been labeled yet.
To prepare your dataset for video analysis, make sure the videos contain labels associated with video segments or bounding boxes. For action recognition, the video segment is a timestamp, while for classification, the segment can be a video shot, a segment, or the whole video. For object tracking, the labels are associated with bounding boxes.
A fresh viewpoint: Machine Learning in Video Games
Model Selection
Model selection is a critical step in the AutoML process. You want to choose the right model type for your problem, and Vertex AI can help you with that.
When deciding on a model type, consider the nature of your data. For example, if you're working with images or videos, you may want to use a computer vision model, which can automatically split your dataset into training, validation, and testing sets. By default, Vertex AI will use 80% of your data for training, 10% for validation, and 10% for testing.
In some cases, you may need to adjust the model parameters based on your data quality and the outcome you're looking for. For instance, if you're doing object tracking, a higher resolution is more important than the frame rate. However, using higher resolution videos may not help with improving the model performance because internally the video frames are downsampled to boost training and inference speed.
Here are some common model types to consider:
Assess Your Use Case
To assess your use case, start by identifying the outcome you want to achieve. This will help you determine the type of model you need to build. You can begin by asking yourself some key questions: What is the outcome you're trying to achieve? What kinds of categories or objects would you need to recognize to achieve this outcome?
A binary classification model is suitable for yes or no questions, such as predicting whether a customer would buy a subscription. This type of model requires less data than other model types.
To determine the type of model you need, consider the following:
- What is the outcome you're trying to achieve?
- What kinds of categories do you need to recognize to achieve this outcome?
- Is it possible for humans to recognize those categories?
- What kinds of examples would best reflect the type and range of data your system will classify?
This will help you decide between a binary classification model, a multi-class classification model, a forecasting model, or a regression model.
Here's a quick reference guide to help you choose:
Understanding your use case and the type of model you need will help you create a successful machine learning project.
Capture Problem Space Variation
Capturing the variation in your problem space is crucial for building a robust model. This means ensuring that your training data includes a wide range of examples that accurately represent the diversity of your problem space.
To achieve this, try to expose your model to a broad selection of data, as seen in Example 1, where a model is trained to classify photos of consumer electronics. The more varied the data, the better the model will generalize to new examples.
Check this out: Version Space Learning
A good rule of thumb is to have a similar number of training examples for each class, as illustrated in Example 3, where a model is trained to classify soccer videos. This helps prevent the model from being biased towards the most common classes.
To give you a better idea, here's a rough guideline for class distribution:
By following this guideline, you can ensure that your model sees a diverse range of examples during training, which will improve its ability to generalize to new or less common examples.
For another approach, see: Hidden Layers in Neural Networks Code Examples Tensorflow
Creating a Custom
Creating a custom model involves several steps, starting with preparing your dataset. Vertex AI automatically splits your data into training, validation, and testing sets if you don't specify the splits, using 80% for training, 10% for validating, and 10% for testing.
The training set is the largest portion of your data, used to teach the model its parameters. The validation set is used to fine-tune the model's hyperparameters, ensuring it doesn't overfit to the training data.
The test set is kept separate and used to evaluate the model's performance on unseen data. This is crucial for getting an accurate idea of how the model will perform on real-world data.
You can manually split your dataset if you want more control over the process. After splitting your data, you're ready to create a machine learning model.
Here are some key parameters to consider when creating a custom model:
By adjusting these parameters, you can fine-tune your model to better suit your specific use case.
How Uses Your
Your dataset will be split into training, validation, and testing sets by Vertex AI, with the default split depending on the type of model you're training.
The test set is not involved in the training process at all, so you can rest assured that it's an entirely new challenge for your model.
After training is complete, Vertex AI uses the test set to evaluate your model's performance, giving you a good idea of how it will perform on real-world data.
Model Selection
When selecting a model, consider the type of data you have available. Ideally, your training examples are real-world data drawn from the same dataset you're planning to use the model to classify.
Data sourcing and preparation are critical steps for building a machine learning model. How much data do you have available? Are your data relevant to the questions you're trying to answer?
Find training examples that are visually similar to what you're planning to make predictions on. If you are trying to classify house images taken in snowy winter weather, you probably won't get great performance from a model trained only on house images taken in sunny weather.
The data you have available informs the kind of problems you can solve. This is why it's essential to match your data to the intended output for your model.
Video resolution, video frames per second, camera angle, and background are also factors to consider when selecting a model. For example, if all of your training videos are taken in the winter or in the evening, the lighting and color patterns in those environments will affect your model.
A score threshold is the number that determines when a given score is converted into a yes or no decision. This threshold should be based on a given use case, such as a medical use case like cancer detection, where the consequences of mislabeling are higher than mislabeling sports videos.
Check this out: Inception Score
When to Use: Classification, Regression, Forecasting
When it comes to selecting a model for your project, you need to consider the type of problem you're trying to solve. Classification, regression, and forecasting are three common types of models that can help you achieve your goals.
Classification models are useful when you want to predict which category a new data point falls into. For example, you can use a classification model to predict whether a customer will buy a jacket in the next year. Classification models are often used in tasks like fraud detection, handwriting recognition, and object detection.
Regression models are useful when you want to predict a continuous value. For example, you can use a regression model to predict how much a customer will spend next month. Regression models are often used in tasks like predicting revenue, inventory, and sales.
Related reading: Binary Classifier
Forecasting models are useful when you want to predict a sequence of values. For example, you can use a forecasting model to predict daily demand of your products for the next 3 months. Forecasting models are often used in tasks like predicting energy demand, stock prices, and weather patterns.
To determine which type of model to use, consider the following:
- Classification: Use when you want to predict which category a new data point falls into.
- Regression: Use when you want to predict a continuous value.
- Forecasting: Use when you want to predict a sequence of values.
Here's a summary of the three types of models:
By considering the type of problem you're trying to solve and the characteristics of each model type, you can select the best model for your project and achieve your goals.
Sources
- https://cloud.google.com/vertex-ai/docs/beginner/beginners-guide
- https://icml.cc/virtual/2021/workshop/8371
- https://icml.cc/virtual/2020/workshop/5725
- https://www.resurchify.com/e/workshop/automl/all-countries/all-years/page/1/
- https://learn.microsoft.com/en-us/azure/machine-learning/concept-automated-ml
Featured Images: pexels.com