![]() |
Today, we’re announcing the general availability of flexible Amazon SageMaker HyperPod training plans to help data scientists train large foundation models (FMs) within their timelines and budgets, saving them weeks of effort managing the training process based on compute availability.
At AWS re:Invent 2023, we introduced SageMaker HyperPod to reduce FM training time by up to 40 percent and scale across thousands of compute resources in parallel with preconfigured distributed training libraries and built-in resiliency. Most generative AI model development tasks require parallel accelerated computing resources. Our customers struggle to find timely access to computing resources to complete training within their time and budget constraints.
With today’s announcement, you can find the accelerated compute resources you need for training, create the most optimal training plans, and run training workloads across different capacity blocks based on the availability of compute resources. In just a few steps, you can determine the training completion date, budget, calculate resource requirements, create optimal training plans, and run fully managed training jobs without the need for manual intervention.
SageMaker HyperPod Training Plans in Action
To get started, go to the Amazon SageMaker AI console, select Training plans in the left navigation bar and select Create a training plan.
For example, select your preferred training date and time (10 days), instance type and number (16 ml.p5.48xlarge
) for the SageMaker HyperPod cluster and select Find a training plan.
SageMaker HyperPod suggests a training schedule that is divided into two five-day segments. This includes the total upfront cost of the plan.
If you accept this training plan, add your training details in the next step and choose Create your plan.
After creating a training plan, you can view a list of training plans. When you create a training plan, you must pay for it in advance within 12 hours. One plan is in Active state and already running, with all instances used. The second plan is Scheduled start later, but you can already submit tasks that run automatically when the schedule starts.
In the active state, compute resources are available in the SageMaker HyperPod, automatically resume after pauses in availability, and terminate at the end of the schedule. The first segment is currently running and the next segment is queued to run after the current segment.
This is similar to Managed Spot training in SageMaker AI, where SageMaker AI takes care of interrupting instances and continues training without manual intervention. To learn more, visit the SageMaker HyperPod training plans in the Amazon SageMaker AI Developer Guide.
Now available
Amazon SageMaker HyperPod training plans are now available in the US East (N. Virginia), US East (Ohio), US West (Oregon) AWS Regions and Support ml.p4d.48xlarge
, ml.p5.48xlarge
, ml.p5e.48xlarge
, ml.p5en.48xlarge
and ml.trn2.48xlarge
instance. Trn2 and P5en instances are only in the eastern US (Ohio) region. To learn more, visit the SageMaker HyperPod product page and the SageMaker AI pricing page.
Try the HyperPod training plans in the Amazon SageMaker AI console and submit feedback on AWS re:Post for SageMaker AI or through your usual AWS support contacts.
— Channy