I want to use GPU capacity for deep learning models. Sagemaker is great in its flexibility of starting on demand clusters for training. However, my department wants to have guarantees we won't overspend on the AWS budget. Is there a way to 'cap' the costs without resorting to using a dedicated machine?
Best Answer
Here are couple ideas:
- You can use Service Catalog on top of SageMaker to restrict howend-users consume the product (and for example limit instance typesand permissions)
- On SageMaker Notebooks: you can use a lifecycle configuration to automatically shutdown notebooks that are idle.
- You can also create AWS Lambda functions that automate controls (egshut down notebook instances at night, send a notif if big machinesare used etc)
- On SageMaker Training you can use Spot capacity to benefit of significant savings on the training costs (up to 90% is possible). You can also apply a max training duration with
train_max_run
- it's a good idea to make sure your code makes good use of GPU. For example on P3 instance (V100 cards), you should try to use mixed-precision training so that training is faster and cheaper. Also, tune data loading, batch size and algorithm complexity so that GPU have enough work to do and don't just spendtheir time doing data reads and model updates. Also, scale training vertically first (bigger and bigger machine) vs horizontally. Multi-machine training is generally more costly as harder to write and debug and with more communication overhead. This is more on your side that on SageMaker side though.
- Training in the SageMaker Training API, not in SageMaker Notebook. When you write code in a Notebook, very little time is spent on training and much more in writing, debugging and reading documentation. All this time, GPU sits idle. On the other hand, SageMaker Training provides ephemeral, fresh instances billed per second only for the duration of training and comes with additional benefits such as spot, managed metrics, logs, data I/O and metadata management.
- You can create cost allocation tags to make cost reporting at custom granularity