Currently Empty: $0.00
AI Infrastructure Engineer
Job Summary
We are seeking a highly skilled AI Infrastructure Engineer to design, implement, and maintain scalable infrastructure that supports artificial intelligence and machine learning workloads. The ideal candidate will have experience with cloud platforms, high-performance computing (HPC), GPU environments, containerization, and infrastructure automation. You will play a critical role in ensuring the reliability, performance, and scalability of AI systems used across the organization.
Key Responsibilities
- Design, deploy, and manage AI/ML infrastructure environments for training and inference workloads.
- Build and maintain high-performance computing (HPC) clusters optimized for AI applications.
- Configure and manage GPU-based systems and distributed computing environments.
- Optimize storage, networking, and compute resources to maximize performance and cost efficiency.
- Implement infrastructure automation using Infrastructure as Code (IaC) tools.
- Monitor system performance, availability, and security across AI platforms.
- Collaborate with Data Scientists, ML Engineers, and Software Developers to support AI model deployment and operations.
- Troubleshoot infrastructure bottlenecks and performance issues.
- Ensure platform scalability, reliability, disaster recovery, and business continuity.
- Manage containerized workloads using Kubernetes and Docker.
- Maintain cloud-based AI environments on AWS, Azure, or Google Cloud Platform.
- Establish security best practices for AI infrastructure and sensitive data.
Required Qualifications
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field.
- 3+ years of experience in cloud infrastructure, DevOps, Site Reliability Engineering (SRE), or AI platform engineering.
- Strong knowledge of Linux system administration.
- Experience with cloud platforms such as AWS, Azure, or GCP.
- Hands-on experience with Kubernetes, Docker, and container orchestration.
- Understanding of GPU computing technologies such as NVIDIA CUDA and GPU clusters.
- Experience with infrastructure automation tools like Terraform, Ansible, or CloudFormation.
- Knowledge of networking, storage systems, and distributed computing concepts.
- Familiarity with AI/ML frameworks such as TensorFlow, PyTorch, or JAX.
Preferred Qualifications
- Experience managing large-scale AI training environments.
- Knowledge of MLOps practices and tools.
- Experience with monitoring tools such as Prometheus, Grafana, ELK Stack, or Datadog.
- Familiarity with Apache Spark, Ray, or distributed AI frameworks.
- Relevant cloud certifications (AWS, Azure, or GCP).
Technical Skills
- Linux Administration
- Kubernetes & Docker
- Terraform / Ansible
- AWS / Azure / GCP
- GPU Infrastructure Management
- Python, Bash, or Go
- Networking & Storage Systems
- Monitoring & Observability
- CI/CD Pipelines
- AI/ML Platform Engineering
Salary Range
- Typical Salary: $170,000 – $230,000+ per year (depending on experience, location, and expertise)
Benefits
- Competitive compensation package
- Performance bonuses
- Health, dental, and vision insurance
- Remote or hybrid work options
- Professional development and certification support
- Access to cutting-edge AI technologies and projects
Job Features
| Job Category | ai manager, Data Science |


