Remote
C2C
ankita.k@hirekeyz.com
The GPU Cluster Architect is responsible for designing, provisioning, and operating AMD MI350–based GPU clusters on a cloud platform. The role ensures scalable, secure, and reproducible GPU infrastructure to support distributed training and high-performance workloads.
Key Responsibilities
- Design end-to-end GPU cluster architecture covering compute, networking, storage, and control services.
- Provision and operationalize up to 9 AMD MI350 GPU clusters based on confirmed cloud SKU availability.
- Configure GPU compute nodes including base OS images, GPU drivers, runtime libraries, and distributed training dependencies.
- Implement automation for node imaging, bootstrapping, lifecycle management, patching, and upgrades.
- Standardize environments using reproducible builds and Infrastructure-as-Code (IaC).
- Enable workload portability through containerized environments and documented deployment patterns.
- Implement OS baseline hardening, restricted administrative access, and secure cluster access controls.
- Establish monitoring, logging, and operational runbooks to ensure reliability and performance.
