GPU Infrastructure Architect – GPU Cluster Stand-up, Configuration & Operations (AMD MI350)

Remote
C2C

The GPU Cluster Architect is responsible for designing, provisioning, and operating AMD MI350–based GPU clusters on a cloud platform. The role ensures scalable, secure, and reproducible GPU infrastructure to support distributed training and high-performance workloads.

 

Key Responsibilities

  • Design end-to-end GPU cluster architecture covering compute, networking, storage, and control services.
  • Provision and operationalize up to 9 AMD MI350 GPU clusters based on confirmed cloud SKU availability.
  • Configure GPU compute nodes including base OS images, GPU drivers, runtime libraries, and distributed training dependencies.
  • Implement automation for node imaging, bootstrapping, lifecycle management, patching, and upgrades.
  • Standardize environments using reproducible builds and Infrastructure-as-Code (IaC).
  • Enable workload portability through containerized environments and documented deployment patterns.
  • Implement OS baseline hardening, restricted administrative access, and secure cluster access controls.
  • Establish monitoring, logging, and operational runbooks to ensure reliability and performance.
Scroll to Top