To create a scalable MLOps architecture on AWS, you'll want to start by setting up an S3 bucket for data storage, which can handle large amounts of data and scale with your needs.
AWS Lambda functions can be used to automate data processing and model training, allowing you to focus on higher-level tasks.
A scalable architecture also requires a robust monitoring and logging system, such as AWS CloudWatch and Amazon CloudFront, to track performance and troubleshoot issues.
By leveraging these AWS services, you can build an MLOps architecture that grows with your data and model requirements.
Architecture
The architecture of an MLOps solution on AWS is built with two primary components: the orchestrator component, created by deploying the solution's AWS CloudFormation template, and the AWS CodePipeline instance deployed from either calling the solution's API Gateway, or by committing a configuration file into an AWS CodeCommit repository.
The orchestrator component is created by deploying the solution's AWS CloudFormation template, which allows for the extension of the solution and the addition of custom pipelines.
AWS CloudFormation templates are used to implement the solution's pipelines, providing flexibility and customization options.
The solution provides two AWS CloudFormation templates: one for single account deployment and another for multi-account deployment.
These templates offer the option to use Amazon SageMaker Model Registry to deploy versioned models.
AWS CloudFormation templates provide a way to define and manage infrastructure and applications in the cloud.
By using AWS CloudFormation templates, you can create a consistent and reproducible infrastructure for your MLOps solution.
Here is a brief overview of the two AWS CloudFormation templates provided by the solution:
The use of Amazon SageMaker Model Registry allows for the deployment of versioned models, enabling you to track changes and maintain a record of your model versions.
Components
AWS architecture is built around several key components that work together to provide a scalable and reliable infrastructure for your applications. At the core of this architecture are compute, storage, and networking services.
Compute services include EC2 instances, which provide a virtual machine environment for running your applications. Storage services like S3 and EBS offer scalable and durable storage options for your data.
Networking services like VPC and Route 53 help manage traffic and routing for your applications. Security and compliance features like IAM and Cognito provide identity and access management for your users.
Task Definitions in AWS ECS architecture specify the Docker image to use, CPU and memory requirements, networking mode, and other configurations. Clusters manage the scheduling and deployment of containers across the cluster.
Services in ECS define how many tasks to run and maintain, ensuring that the desired number of tasks are always running.
Here are the key components of AWS ECS architecture:
Security and Compliance
In AWS architecture, security is a top priority. AWS Identity and Access Management (IAM) is a key service that manages user access and permissions securely.
AWS Key Management Service (KMS) helps create and control encryption keys used to encrypt data. This is crucial for protecting sensitive information.
AWS CloudTrail enables governance, compliance, and operational and risk auditing of your AWS account. This feature helps you track and analyze account activity.
By leveraging these security services, organizations can build robust and secure applications in the cloud.
Deploy Solution
To deploy your solution, you'll need to follow these steps. First, determine the name of a target Amazon S3 distribution bucket that should be created in your account, and create the bucket in the target account.
You'll also need to set environment variables in your shell, including SOLUTION_NAME and VERSION. SOLUTION_NAME should be the name of your solution, and VERSION should be the version number of the change.
Next, upload the distributable assets to your Amazon S3 bucket in your account. You can use the AWS Console or the AWS CLI to do this. Ensure that you own the Amazon S3 bucket before uploading the assets.
Once you've uploaded the assets, you'll need to create the deployment using CloudFormation with the template link. You can do this by copying the Object URL link for the preferred deployment architecture.
Here's a summary of the steps:
Model Training and Deployment
Model training is a crucial step in the MLOps pipeline, and it's essential to validate the model before deployment. This architecture uses a Continuous Integration/Continuous Deployment (CI/CD) approach to ensure seamless, automated model deployments.
The key components of the deployment process are AWS CodePipeline, AWS CodeBuild, Amazon SageMaker Endpoints, Amazon CloudWatch, and AWS IAM, KMS, and Secrets Manager. These services work together to automate the build, test, and deployment phases, reducing manual intervention and ensuring that the latest, best-performing model is always in production.
The trained model is deployed as an API endpoint in SageMaker, allowing other applications to consume it for real-time predictions. It also supports multi-model endpoints and A/B testing, making deploying and comparing multiple models easy.
Here are the key components of the deployment pipeline:
- AWS CodePipeline: automates the build, test, and deployment phases
- AWS CodeBuild: handles building the model package or any dependencies required for deployment
- Amazon SageMaker Endpoints: deploys the trained model as an API endpoint
- Amazon CloudWatch: monitors the deployment pipeline and the health of the deployed models
- AWS IAM, KMS, and Secrets Manager: ensures secure access to the model endpoints and sensitive data
Orchestrating Model Training
Model training is a crucial step in the machine learning process, and it's essential to get it right. Data processing is the first key step in orchestrating model training, where data is loaded and split.
The next step is model training, where the model is trained and evaluated for mean Average Precision (mAP). If the model meets the threshold, it's registered in the SageMaker Model Registry.
Model training requires careful evaluation to ensure the model performs well. The model is trained and evaluated for mAP, which is a crucial metric for measuring the model's performance.
Model Deployment
Model deployment is a crucial step in the MLOps pipeline, and there are several approaches to achieve it. One approach is to use AWS SageMaker endpoints, which offers a more straightforward deployment process.
To automate the edge deployment phase, you can export the model in ONNX format and register it in Amazon SageMaker Model Registry. This ensures portability and optimization for seamless deployment in production systems.
For real-time edge inference, you can consider using IoT Greengrass, but for this example, we'll focus on AWS SageMaker endpoints. This approach involves creating a deployment package that includes the trained model in ONNX format and a private component for inference code.
The private component handles data preparation, communication with the model, and post-processing of results to ensure effective model deployment. This approach is more straightforward and easier to manage than IoT Greengrass.
Here's a summary of the key components involved in model deployment using AWS SageMaker endpoints:
- Trained model in ONNX format
- Private component for inference code
- Amazon SageMaker Model Registry for model registration
- AWS SageMaker endpoints for real-time inference
Data Change Model Change
Automating the MLOps process is crucial for modern machine learning pipelines, ensuring that models stay relevant as new data or performance requirements change.
Amazon SageMaker is the core machine learning platform that handles model training, tuning, and deployment, and can be triggered by new data arrivals or model performance degradation.
This process involves monitoring deployed models in production for model drift, data quality issues, or bias using Amazon SageMaker Model Monitor.
Amazon SageMaker Model Monitor can detect deviations and trigger an automated model retraining process.
Here are the key components involved in this automated MLOps setup:
By leveraging this automated MLOps setup, organizations can ensure their models are always performing optimally, responding to changes in the underlying data or business requirements.
Data Management
Data Management is crucial in any MLOps pipeline, and AWS provides several services to ensure scalability and flexibility. We use AWS Database Migration Service (DMS) to extract and replicate data from a relational database to Amazon S3, where the data lake resides.
AWS DMS supports continuous replication, ensuring that new data in the relational database is mirrored into S3 in near real-time. This allows for optimal retrieval of data, often partitioned by time or categories.
AWS Glue Data Catalog is integrated to automatically catalog the ingested data, creating metadata models that describe its structure and relationships. This enhances data discoverability and governance, making it easier to manage and maintain our MLOps pipeline.
The pipeline ensures scalability and flexibility by using a data lake architecture with proper metadata management.
Storage Solutions
When managing data, storage is a crucial aspect to consider. Amazon S3 is an object storage service that provides industry-leading scalability, data availability, security, and performance.
Amazon S3 is ideal for storing raw image data, as seen in our MLOps Solution Architecture. This is because it's a cost-efficient solution that can handle large amounts of data.
For applications requiring consistent performance, Amazon EBS is a good option. It provides block-level storage volumes for use with EC2 instances.
Amazon Glacier is a low-cost cloud storage service for data archiving and long-term backup. It's a great solution for storing data that's not frequently accessed.
Here are the different storage solutions offered by AWS:
Data Pre-Processing
Data Pre-Processing involves cleaning, transforming, and enriching raw data to make it suitable for machine learning. This crucial step ensures that the data is in a format that can be effectively used to train and deploy models.
AWS Glue is a fully managed ETL service that helps transform raw data by applying necessary filters, aggregations, and transformations. It's a game-changer for data scientists and engineers who want to focus on higher-level tasks.
Data Pre-Processing also involves running SQL queries on the data in S3 for exploratory data analysis. Amazon Athena allows data scientists and engineers to do just that, making it a valuable tool in the data management process.
Amazon SageMaker Feature Store stores engineered features and provides consistent, reusable feature sets across different models and teams. This ensures that feature management is scalable and efficient.
Here's a breakdown of the key services involved in Data Pre-Processing:
- AWS Glue: For ETL and data transformation
- AWS Lambda: For lightweight transformations or event-triggered processing
- Amazon Athena: For exploratory data analysis and SQL querying
- Amazon SageMaker Feature Store: For feature management and consistency
Real-Time Reporting
Real-Time Reporting is a game-changer for businesses looking to stay on top of user interactions. By setting up a Lambda function that processes Snowplow events, you can gain immediate insights into user behavior.
This setup allows for real-time updates to a DynamoDB table, giving you the ability to make data-driven decisions quickly. With this setup, you can track user interactions and make adjustments on the fly.
By leveraging AWS Lambda and Snowplow, you can create a robust real-time reporting system that provides valuable insights into user behavior. This can be a huge advantage in today's fast-paced business environment.
Sources
- https://github.com/aws-solutions/mlops-workload-orchestrator
- https://www.msystechnologies.com/blog/mlops-on-aws/
- https://dev.to/aws-builders/architecting-mlops-for-computer-vision-using-aws-sagemaker-1j6g
- https://medium.com/@aliasghar.arabi/aws-mlops-a-reference-architecture-a6999c022045
- https://www.restack.io/p/mlops-answer-aws-architecture-cat-ai
Featured Images: pexels.com