Nebius

Senior HPC Cluster Engineer

Job Summary

The role involves contributing to the development of hyperscaler platforms with a focus on hardware virtualization, performance optimization, and hardware support. The ideal candidate has extensive experience with Linux systems, server architecture, and virtualization technologies like QEMU/KVM. Responsibilities include improving infrastructure, analyzing issues, and enhancing hardware compatibility, particularly in GPU-accelerated and HPC environments. The position offers competitive compensation, opportunities for growth, and a collaborative work setting within a company focused on AI and ML industries.

Required Skills

Networking

Deep Learning Frameworks

MPI

Linux System

Hardware Virtualization

Device Emulation

GPU-accelerated computing

Performance Programming

Server Architecture

PCIe Devices

NICs

Kernel Drivers

QEMU/KVM

Hypervisor

nccl

Benefits

Competitive Salary

Professional Growth Opportunities

Comprehensive Benefits Package

Collaborative Work Environment

Hybrid Working

Job Description

Why work at Nebius
Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside some of the most experienced and innovative leaders and engineers in the field.

Where we work
Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team of over 800 employees includes more than 400 highly skilled engineers with deep expertise across hardware and software engineering, as well as an in-house AI R&D team.

The role

We’re looking for a Senior HPC Cluster Engineer to join our team and play a key role in the development of our cutting-edge hyperscaler platform. The GPU & InfiniBand team is responsible for enhancing and optimizing the core components of our Cloud platform, with a specific focus on GPU computing, InfiniBand networks, and the KVM/QEMU stack. You’ll work closely with hardware virtualization and device emulation technologies, ensuring high performance and security in multi-GPU, HPC environments. The role involves analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system.

In this position, you will be responsible for:

Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions.
Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM.
Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.

We expect you to have:

5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming).
3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).

It would be a plus if you have:

Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking.
Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads).
Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication.
Background in Software-Defined Networking (SDN) and experience with HPC cluster networking.
Understanding of QEMU/KVM virtualization and managing virtualized environments.
Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems.
Familiarity with collective communication libraries like MPI and NCCL for distributed computing.

We conduct coding interviews as part of the process.

What we offer

Competitive salary and comprehensive benefits package.
Opportunities for professional growth within Nebius.
Hybrid working arrangements.
A dynamic and collaborative work environment that values initiative and innovation.

We’re growing and expanding our products every day. If you’re up to the challenge and are excited about AI and ML as much as we are, join us!

Interested in this job?

Application deadline: Open until filled

Nebius

Discover the most efficient way to build, tune and run your AI models and applications on top-notch NVIDIA® GPUs.

See more jobs

Date PostedJuly 24th, 2025

Job TypeFull Time

LocationPrague, Czech Republic; Remote - Europe

SalaryCompetitive rates

Exciting remote opportunity (requires residency in Czech Republic) for a Senior HPC Cluster Engineer at Nebius. Offering competitive salary (full time). Explore more remote jobs on FlexHired!

Safe Remote Job Search Tips

Verify Employer Thoroughly

Research the company's identity thoroughly before applying. Check for a professional website with contacts, active social media, and LinkedIn profiles. Verify details across platforms and look for reviews on Glassdoor or Trustpilot to confirm legitimacy.

Never Pay to Get a Job

Legitimate employers never require payment for applications, training, background checks, or equipment. Always reject upfront payment requests or demands for bank details, even if they claim it's for purchasing necessary work gear on your behalf.

Safeguard Your Personal Information

Protect sensitive data like SSN, bank details, or ID copies. Share this only after accepting a formal, written job offer. Ensure it's submitted via a secure company system or portal, never through insecure channels like standard email attachments.

Scrutinize Communication & Interviews

Full Time

Amsterdam, Netherlands; Berlin, Germany; London, United Kingdom; Prague, Czech Republic; Remote - Europe; Remote - United States; United States

Senior HPC Cluster Engineer

Job Summary

Required Skills

Benefits

Job Description

The role

Interested in this job?

Nebius

Safe Remote Job Search Tips

Verify Employer Thoroughly

Never Pay to Get a Job

Safeguard Your Personal Information

Scrutinize Communication & Interviews

Beware of Unrealistic Offers

Insist on a Formal Contract

Related Jobs

Senior Hypervisor Engineer

Senior Network Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer — AI Studio (Inference Platform)

Senior Software Engineer — AI Studio (Inference Platform)

Subscribe Newsletter