Brief Guide to Amazon SageMaker Algorithms: Unsupervised Learning
data:image/s3,"s3://crabby-images/8a50b/8a50ba77e306d02e3146181bf2c9851992e1127d" alt="Image produced with Amazon Bedrock"
Imagine uncovering hidden customer segments without prior labels, detecting fraudulent network activity in real-time, or reducing a mountain of data into actionable insights - all without explicit human guidance. This is the power of unsupervised learning.
Welcome to the third installment in our series exploring Amazon SageMaker's machine-learning algorithms. In the earlier entries, we tackled natural language processing and tabular data. Now, we venture into unsupervised learning, where data "speaks for itself," revealing patterns, clusters, and anomalies that might otherwise go unnoticed.
From identifying anomalies in network traffic with IP Insights to grouping data with K-means, unsupervised learning unlocks transformative solutions for complex data challenges. This installment includes actionable code examples to get you up and running with SageMaker's Estimator objects for these algorithms, offering a practical jumpstart to your implementations.
If you're new to this series, check out Part 1 on text processing and Part 2 on tabular data. In Part 4, we tackle time series data and computer vision.
IP Insights
Amazon SageMaker IP Insights is an unsupervised learning algorithm designed to understand the usage patterns of IPv4 addresses. The algorithm learns associations between IP addresses and entities like user IDs, revealing how these entities interact with online resources. This information is crucial for various applications, including detecting unusual activities, understanding user behavior, and enhancing security measures.
- π Learns usage patterns of IP addresses to understand how entities interact with online resources.
- π€ Unsupervised learning algorithm that identifies patterns without explicit labels.
- π Provides insights into IP address usage, aiding anomaly detection, user behavior analysis, and security improvements.
Practical Applications
IP Insights is ideal for applications requiring anomaly detection and behavioral analysis involving IP addresses. Here are the most significant use cases:
- Fraud Detection: Identifying suspicious login attempts or unusual access patterns by analyzing associations between IP addresses and user IDs.
- Security Monitoring: Detecting unusual traffic patterns, potential bot activity, and unauthorized access to online resources to improve system security.
- Preprocessing for Anomaly Detection Pipelines: Using IP Insights to generate embeddings or insights that serve as input features for downstream supervised anomaly detection algorithms, such as XGBoost or logistic regression.
Hyperparameters
The IP Insights algorithm has two required hyper parameters:
- num_entity_vectors: Specifies the number of unique entity embeddings the model should learn. A higher value is needed for datasets with many unique entities (e.g., user IDs). (e.g., num_entity_vectors=10000 creates embeddings for 10,000 unique entities)
- vector_dim: Defines the size (dimensionality) of each entity embedding vector. Larger dimensions allow the model to capture more complex patterns but increase computation time. (e.g., vector_dim=128 creates embeddings with 128 dimensions)
It also supports additional optional hyperparameters, used to tune performance. Here are a couple:
- epochs: The number of training iterations over the dataset. Increasing this value improves embedding accuracy but may lead to overfitting. (e.g., epochs=10 runs 10 full passes over the training data)
- learning_rate: Controls the step size for updating model weights during training. A smaller value improves convergence stability, while a larger value speeds up training. (e.g., learning_rate=0.001 uses a learning rate of 0.001)
Usage
The following example demonstrates how to set up and train an IP Insights model in Amazon SageMaker to detect unusual patterns in IP address usage. The process includes initializing a SageMaker session, defining training and validation datasets stored in S3, and configuring the IP Insights algorithm with essential hyperparameters like entity embeddings, batch size, and learning rate.
Once trained, the model artifacts are stored in S3, making them ready for deployment and integration into real-world applications. Here's a code example:
import sagemaker from sagemaker import get_execution_role from sagemaker.amazon.amazon_estimator import get_image_uri # Initialize a SageMaker session session = sagemaker.Session() # Get the execution role - this is used to give SageMaker access to your AWS resources role = get_execution_role() # Specify the data paths in S3 training_data = "s3://your-bucket/path/to/train/data" # persumably 80% of your data validation_data = ( "s3://your-bucket/path/to/validation/data" # persumably 20% of your data ) # Get the IP Insights container image container = get_image_uri(session.boto_region_name, "ipinsights") # Create the IP Insights estimator ip_insights = sagemaker.estimator.Estimator( container, role, instance_count=1, # Number of instances to use for training instance_type="ml.c5.xlarge", # Instance type for training output_path="s3://your-bucket/output", # Output path for model artifacts sagemaker_session=session, # Hyperparameters for IP Insights hyperparameters={ "num_entity_vectors": "10000", # Number of entity embeddings "vector_dim": "128", # Size of embeddings "epochs": "10", # Number of training epochs "learning_rate": "0.001", # Learning rate }, ) # Specify the data channels for training ip_insights.fit({"train": training_data, "validation": validation_data})
In this example, the IP Insights algorithm processes training data stored in Amazon S3 to learn associations between IP addresses and entities. A validation dataset is used to evaluate the model's generalization performance, ensuring reliable detection of unusual patterns in network behavior.
The final model artifacts are stored in S3 and can be deployed for real-world applications such as fraud detection, security monitoring, and behavioral analysis. The resulting embeddings can also be integrated into downstream machine learning models for further anomaly detection or classification tasks.
K-means
Amazon SageMaker's K-means algorithm is an unsupervised learning technique designed to identify and group data into discrete clusters. It works by organizing data points so that members of the same cluster are as similar as possible, while members of different clusters are as distinct as possible. Users can customize the attributes the algorithm uses to determine similarity, making it a flexible option for various use cases.
- π― The K-means algorithm is an unsupervised learning algorithm.
- π The algorithm attempts to find discrete groupings within data based on similarity.
- βοΈ Users define the attributes used to determine similarity.
Practical Applications
K-means is widely used for clustering and grouping data into discrete clusters, making it a powerful tool for identifying patterns, segmenting data, and preparing datasets for other machine learning tasks. Here are the most practical examples:
- Retail: Customer segmentation for personalized marketing, product recommendation engines, and identifying purchasing behavior patterns.
- Healthcare: Grouping patients based on medical records, identifying disease subtypes, and analyzing treatment response patterns.
- Preprocessing for Supervised Learning: Grouping data into clusters as a preprocessing step to improve performance in supervised learning tasks like classification or regression. For example, cluster assignments can be used as additional features for algorithms such as XGBoost or logistic regression.
Hyperparameters
The K-means algorithm in SageMaker has two required hyperparameters:
- k: Specifies the number of clusters to create. The algorithm groups data points into k clusters, where members of the same cluster are as similar as possible, and those in different clusters are distinct. (e.g., k=10 creates 10 clusters)
- feature_dim: Defines the total number of input features (dimensions) in the dataset. This must match the number of features in the input data. (e.g., feature_dim=784 assumes the dataset has 784 input features)
It also supports additional optional hyperparameters that are used to optimize performance. Here are a couple:
- mini_batch_size: The number of samples processed in each training batch. Larger batch sizes improve training speed but require more memory. (e.g., mini_batch_size=500 processes 500 samples per batch)
- init_method: Determines how the initial cluster centers are chosen. Options include random (random selection of initial points) and kmeans++ (optimized initialization). (e.g., init_method='kmeans++' uses the K-means++ initialization method for better convergence)
Usage
Clustering algorithms like K-means are unsupervised, meaning they don't rely on labeled data for training. In K-means clustering, each cluster has a defined center, and during training, the algorithm groups data points based on their distance to these cluster centers. The number of clusters, represented as k, is a key hyperparameter that users define.
For example, in the code below, we specify k = 10 to create 10 clusters:
container = get_image_uri(session.boto_region_name, "kmeans") sagemaker.estimator.Estimator( container, role, instance_count=1, instance_type="ml.g4dn.xlarge", output_path=output_path, sagemaker_session=session, hyperparameters={ "k": "10", "feature_dim": "784", "mini_batch_size": "500", "init_method": "kmeans++", }, )
Here, only two hyperparameters are required: k (the number of clusters) and feature_dim (the number of features in the dataset). The dataset contains 784 features in this example, which the K-means model will process. The other parameters, like mini-batch size, ensure efficient training by dividing the dataset into manageable chunks.
Principal Component Analysis (PCA)
The Amazon SageMaker PCA algorithm is an unsupervised machine learning algorithm that reduces the number of features in a dataset while retaining as much information as possible. This process, known as dimensionality reduction, is widely used in machine learning pipelines where feature reduction improves model performance and speeds up computation.
By leveraging Principal Component Analysis (PCA), you can transform high-dimensional datasets into a lower-dimensional space while preserving maximum variance. SageMaker allows you to easily train and deploy PCA models for scalable dimensionality reduction.
- π Improves computational efficiency: Reduces the size of input features for faster model training and inference.
- π Avoids overfitting: Reducing dimensions minimizes noise and redundancy in data.
- π Better visualization: Visualizing lower-dimensional data (e.g., 2D or 3D) is easier for analysis.
- π Flexible input formats: Supports both numpy arrays and Amazon RecordIO protobuf files stored in S3.
Practical Applications
PCA is widely used in various fields where reducing dimensionality while retaining critical information enhances performance and interpretability. Here are some practical examples:
- Financial Services: Detecting anomalies in transaction data, optimizing risk models by reducing correlated features, and improving fraud detection systems.
- Healthcare: Simplifying high-dimensional medical imaging data (e.g., MRI scans), analyzing gene expression datasets, and predicting disease outcomes.
- Retail: Reducing redundant features to improve product recommendation engines, customer segmentation, and sales predictions.
Hyperparameters
The PCA algorithm in SageMaker requires three mandatory hyperparameters:
- num_components: Specifies the number of principal components to retain. Lower values reduce dimensionality but may result in information loss. (e.g., num_components=10 retains the top 10 principal components)
- mini_batch_size: Defines the number of samples to process in each batch during training. Larger batch sizes improve training speed but require more memory. (e.g., mini_batch_size=500 processes 500 samples per batch)
- feature_dim: Specifies the total number of input features (dimensions) in the dataset. This must match the input feature size. (e.g., feature_dim=1000 assumes the input data has 1,000 features)
Usage
Principal Component Analysis (PCA) is an unsupervised algorithm that reduces the number of input features in a dataset while retaining as much information as possible. PCA projects data onto a set of orthogonal components that explain the most variance.
The number of principal components to retain is a key hyperparameter, defined as num_components. Additionally, feature_dim specifies the number of input features, and mini_batch_size ensures efficient processing during training.
For example, in the code below, we use a P2 instance (ml.p2.xlarge) for training. PCA supports both CPU and GPU instances, with GPU instances such as P2, P3, G4dn, and G5 often being more performant for large, high-dimensional datasets:
container = get_image_uri(session.boto_region_name, "pca") estimator = sagemaker.estimator.Estimator( container, role, instance_count=1, instance_type="ml.p2.xlarge", output_path=output_path, sagemaker_session=session, hyperparameters={ "feature_dim": "1000", "num_components": "10", "mini_batch_size": "500", }, )
In this example, the dataset contains 1,000 features (feature_dim=1000), and PCA will project the data onto the top 10 principal components (num_components=10). The mini-batch size ensures that data is processed efficiently by dividing it into manageable chunks.
Random Cut Forest (RCF)
The Amazon SageMaker Random Cut Forest algorithm is an unsupervised algorithm used for detecting anomalies in datasets. It identifies data points that deviate from well-structured or patterned data, such as spikes in time series data or unclassifiable data points. The algorithm works by creating a model that learns the underlying patterns in the data and then assigns an anomaly score to each data point based on how well it fits the learned patterns.
Practical Applications
RCF is widely used for unsupervised anomaly detection, making it ideal for scenarios where identifying unusual patterns or outliers is critical. Here are its most significant applications:
- Time Series Monitoring: Detecting anomalies such as sudden spikes, drops, or irregular patterns in sequential data. RCF is particularly effective for applications in financial services, such as identifying fraudulent transactions, and in IoT systems, where it can flag sensor irregularities for predictive maintenance.
- Operational Anomaly Detection: Monitoring real-time performance of systems to identify failures, faults, or unexpected behavior. RCF is commonly used in manufacturing to detect machine performance issues and equipment faults, as well as in IT infrastructure for identifying server load anomalies or potential system outages.
- Event-Based Outlier Detection: Identifying data points that deviate from expected patterns in non-sequential data. This includes applications in cybersecurity for detecting unusual access patterns or potential intrusions, and in healthcare for identifying irregular patient records or test results that may require further investigation.
Hyperparameters
RCF only has a single required hyperparameter:
- feature_dim: The total number of input features (dimensions) in the dataset. (e.g., feature_dim=100 assumes the input data has 100 features)
RCF also supports additional optional hyperparameters, used to tune performance:
- num_trees: Specifies the number of trees in the forest. More trees improve accuracy but increase computation time. (e.g., num_trees=50 builds 50 trees)
- num_samples_per_tree: Defines the number of samples each tree uses during training. Larger values capture finer patterns. (e.g., num_samples_per_tree=512 uses 512 samples per tree)
Usage
The Random Cut Forest (RCF) algorithm in SageMaker is used for unsupervised anomaly detection. It identifies data points that deviate significantly from patterns in the input data and assigns an anomaly score based on how well each point aligns with the learned structure.
For RCF, the only required hyperparameter is feature_dim, which specifies the total number of input features in the dataset. Optional hyperparameters like num_trees and num_samples_per_tree can be used to tune the model's performance.
For example, in the code below, we configure an RCF model with feature_dim=100 using a P2 instance for training:
container = get_image_uri(session.boto_region_name, "randomcutforest") estimator = sagemaker.estimator.Estimator( container, role, instance_count=1, instance_type="ml.p2.xlarge", # GPU instance for large datasets output_path=output_path, sagemaker_session=session, hyperparameters={ "feature_dim": "100", # Total number of input features "num_trees": "50", # Optional: Number of trees "num_samples_per_tree": "512" # Optional: Samples per tree }, )
In this example, the RCF model processes data with 100 input features and builds a forest of 50 trees, each trained on 512 samples. A P2 instance is used to accelerate training, especially for larger datasets.
Further Resources
For those looking to deepen their understanding of these algorithms and their implementation in SageMaker, the official Amazon SageMaker documentation is the most up-to-date and reliable resource. It provides the guidance needed to fine-tune models, navigate SageMaker's machine learning tools, and integrate solutions seamlessly into your workflows.
While this blog serves as a practical, hands-on guide to get you started, AWS documentation remains the definitive reference for production-level implementation. Whether you're exploring unsupervised learning techniques, such as anomaly detection with IP Insight or clustering with K-means, or building more complex machine learning workflows, this resource will help you scale your SageMaker expertise effectively. Examples using the SageMaker estimator can be found here.