In the Software Department, you're shaping robotic solutions that redefine human-machine collaboration. You'll work with cutting-edge technology, setting industry-changing standards. Not only will you help develop our solutions, but you'll also set new trends and drive innovations forward. In an agile and interdisciplinary team, you'll engage in exciting projects. With clear Scrum processes like daily stand-ups, sprint planning, and reviews, you remain flexible and efficient. Collaborating closely with other departments allows you to create software solutions that are both technically advanced and practically effective. Here, you'll find an environment where creativity and technological excellence go hand in hand. If you're eager to turn ideas into reality and enjoy taking technology to the next level, the Software Development Team at NEURA offers the perfect challenge for you.
GPU Cluster Engineer - Large-Scale AI Training Infrastructure (Human)
Neura Robotics GmbH • Metzingen
Shape the Future of Human-Robot Collaboration
Metzingen
- You are the go-to expert for NEURA's GPU cluster infrastructure - a large-scale AWS HyperPod environment running cutting-edge GPU instances for foundation model training and customer fine-tuning workloads. You design the operational framework, build self-service tooling for ML teams, and work directly with AWS to influence the platform at the hyperscaler level.
- Your focus is on cluster engineering and operations — not on ML research itself, but on making sure the people doing that research have rock-solid, efficient, and accessible infrastructure under them.
- Setting up, configuring, and continuously evolving NEURA's HyperPod clusters, including HyperPod/Slurm and HyperPod/EKS orchestration models.
- Designing and implementing strategies for cluster stability: node failure detection, automated job recovery, checkpoint coordination, and fault-tolerant multi-node training workflows.
- Providing a workload priority management framework that allows multiple teams and use cases like foundation model pretraining, fine-tuning, customer workloads, to share cluster capacity efficiently and fairly.
- Optimizing end-to-end GPU utilization: identifying and resolving bottlenecks across compute, GPU memory, EFA networking, and storage throughput.
- Working directly and closely with the AWS HyperPod product and solutions engineering teams, escalating operational issues, sharing learnings from one of the platform's largest deployments, and placing concrete requirements on the roadmap.
- Providing self-service tooling that allows ML researchers and engineers to launch, monitor, and manage training jobs independently, without requiring infrastructure intervention for routine operations.
- Developing onboarding documentation, training materials, and internal workshops that enable users to operate efficiently, follow best practices, and understand cost implications of their workloads.
- Infrastructure as Code is a given for you. Every cluster configuration, every operational change, every new environment is code first.
- Owning the cost and capacity strategy: Spot instance management, Reserved Instance planning, Savings Plans, and ongoing commitment negotiations with AWS.
- 5+ years of experience in infrastructure or systems engineering, with a strong focus on GPU cluster or HPC operations.
- Deep hands-on experience with AWS HyperPod and AWS instances; direct prior experience with HyperPod is a strong differentiator.
- Solid understanding of both Slurm and Kubernetes as cluster orchestration layers, and the ability to evaluate their trade-offs for large-scale GPU workloads.
- Practical knowledge of distributed training - you understand what affects throughput and how to debug it.
- Experience building self-service tooling and operational documentation for technical end users.
- You make complex infrastructure accessible, not just functional.
- Strong understanding of cloud cost management at scale: Spot interruption handling, capacity reservations, cost attribution across teams and workloads.
- Comfort working across organizational boundaries — your primary partners are ML researchers, but you'll also work closely with product, finance, and cloud vendor teams.
- Strong English communication skills. German is a plus.
What you can look forward to
Creative Freedom and Agility
Enjoy a dynamic, self-reliant work culture with flat hierarchies, flexible hours, and 30 vacation days. Ideal for those seeking an inspiring professional setting, whether you're starting out or an experienced exec.
Passion for Winning
A passionate and highly skilled team of international experts aiming to redefine robot assistants.
Attractive Compensation
Enjoy a competitive salary package along with exclusive employee discounts.
One Team
Whether it's a summer party or company town hall meetings, we celebrate our successes together.
Professional Growth
Support for your personal and professional development.
Andre Jank
Our values. The cornerstones of our success.
We are a team. We strive to achieve great things by promoting the success of our colleagues and partners.
We strive for technological progress in order to give people back their valuable time for enjoyable activities.
We strive to revolutionize the world of robotics by pushing the boundaries of technology every day.
We live a high level of appreciation through open communication and transparency.
We do our best to always be two steps ahead. We achieve this through empowerment, freedom of action and personal responsibility.
People are at the center of everything we do.
Our Location
Our headquarters in Metzingen and Riederich are the heart of our company. It's not just home to our offices, but also our production facilities, Academy, logistics, and Tech Labs—all working together to turn ideas into reality. Riederich itself is a small, peaceful town, just a kilometer away from Metzingen, a city with its own unique character. Metzingen is globally renowned as Outlet City, attracting visitors from all over the world. Here, you can enjoy exclusive designer stores in a relaxed and charming setting. The city also offers a variety of restaurants, cafés, and a down-to-earth Swabian coziness—perfect for unwinding after work.
Our application process
We ensure a transparent and efficient process and look forward to getting to know you during the application process.