Talent.com
HPC Performance and Validation Engineer
HPC Performance and Validation EngineerNorthMark Strategies • Dallas, TX, United States
serp_jobs.error_messages.no_longer_accepting
HPC Performance and Validation Engineer

HPC Performance and Validation Engineer

NorthMark Strategies • Dallas, TX, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

The Company

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

As an HPC Validation and Performance Engineer at NMC², you will take ownership of the validation and optimization of our HPC CPU and GPU calc farms. This critical role will involve developing a validation and performance baselining framework, which ensures system readiness for AI / ML and HPC workloads across multiple architectures. Your role will be essential in providing continuous performance benchmarking, real-time observability, and long-term strategic readiness. You will drive the implementation of advanced tooling and frameworks, maintaining an infrastructure that is crucial to our cutting-edge research efforts. You will be accountable for providing data driven performance metrics to support architectural design choices as we continue to globally scale our datacenter footprint. We are looking for someone with deep technical expertise in compute, storage or networking optimizations and performance engineering who can develop solutions that scale with our growing infrastructure. This role demands a forward-thinking engineer who can anticipate industry trends and adopt emerging architectures and strategies to keep NMC² at the forefront of innovation.

Responsibilities :

  • Architecting and implementing a validation framework to certify the readiness and utilization of GPU nodes across a large, distributed HPC environment.
  • Defining methodologies to continually assess performance and optimising infrastructure across AI / ML workloads
  • Developing and executing comprehensive performance testing using industry and customer specific benchmarks, ensuring optimal performance across HPC compute, storage and networking
  • Contribute to research reports that will describe the discoveries of the benchmarking, evaluating the complete HW performance and efficiency
  • Leading efforts to debug, identify and then resolve bottlenecks in system performance
  • Building robust, scalable tools for automated validation and testing, utilising Python, Go, Kubernetes and CI / CD pipelines to streamline continuous validation and benchmarking processes
  • Implementing monitoring solutions using Prometheus, Grafana and other modern monitoring technologies to track performance metrics and real-time health of the cluster
  • Defining and implementing best practice for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge
  • Staying informed on industry trends and advancements to ensure long-term strategic alignment
  • Working cross-functionally with engineering, infrastructure and research teams to align validation efforts with the broader business objectives, ensuring that the platform meets evolving research demands

Requirements :

  • Accelerator performance experience, including profiling and tuning with large-scale GPU clusters
  • In-depth understanding of NVIDIA ClusterKit, Nsight and Validation Suite, MLPerf and DCGM tools for GPU and DPUs
  • Networking & Storage performance experience, including profiling and optimisation with NVIDIA ClusterKit, iPerf or equivalent across InfiniBand / RoCe network implementations
  • System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent
  • Experience with HPC workloads across distributed global locations, bringing data driven performance data to compliment key architectural decisions
  • Strong proficiency in developing automation tools and micro benchmarking frameworks for validation using Python, Go, and Kubernetes in a Ubuntu Linux environment
  • Expertise with key monitoring platforms including OTEL, Prometheus, ELK and Grafana and in definition and implementing the overall observability strategy for HPC validation and performance monitoring
  • A deep understanding of emerging technologies, architectures and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long-term plan
  • Proven ability to lead complex technical projects, influence decisions and engage with stakeholders across technical and research teams
  • serp_jobs.job_alerts.create_a_job

    Validation Engineer • Dallas, TX, United States

    Job_description.internal_linking.related_jobs
    Threat Modelling Engineer

    Threat Modelling Engineer

    ApTask • Dallas, TX, United States
    serp_jobs.job_card.full_time
    Title : Threat Modelling Engineer.We are seeking an ideal candidate with 8+ years of experience in a range of technologies and processes, including : . Proficiency in GCP - essential.Strong knowledge o...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Sr Principal Engineer, CPU Design

    Sr Principal Engineer, CPU Design

    GlobalFoundries • Richardson, Texas, USA
    serp_jobs.job_card.full_time
    GlobalFoundries is a leading full-service semiconductor foundry providing a unique combination of design development and fabrication services to some of the worlds most inspired technology companie...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Model Validator Senior

    Model Validator Senior

    USAA • Plano, TX, United States
    serp_jobs.job_card.full_time
    At USAA, our mission is to empower our members to achieve financial security through highly competitive products, exceptional service and trusted advice. We seek to be the choice for the military co...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Engineer I, Firewheel Town Center - Full Time

    Engineer I, Firewheel Town Center - Full Time

    Macy’s • Garland, US
    serp_jobs.job_card.full_time +1
    Macy’s is more than just a store.One that’s captured the hearts and minds of America for more than 160 years.A story about innovations and traditions…about inspiring stores and irresistible product...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted
    HPC Software Engineer II

    HPC Software Engineer II

    University of Texas at Dallas • Richardson, TX, United States
    serp_jobs.job_card.full_time
    High Performance Computing Software Engineer II.Position End Date (if temporary).Reporting to the Director High Performance Computing (HPC) Facilitation, this is a mid-level HPC systems software en...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Generative AI Engineer F2F interview

    Generative AI Engineer F2F interview

    Tekshapers Inc • Richardson, Texas, USA
    serp_jobs.job_card.full_time +1
    Design develop and deploy agentic AI systems leveraging LLMs and modern AI frameworks.Integrate GenAI models into full-stack applications and internal workflows. Collaborate on prompt engineering mo...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Software Engineer Data / AI / Intelligent Systems I (Full Time) - United States

    Software Engineer Data / AI / Intelligent Systems I (Full Time) - United States

    Cisco • Dallas, TX, United States
    serp_jobs.job_card.full_time
    Please note this posting is to advertise potential job opportunities.This exact role may not be open today but could open in the near future. When you apply, a Cisco representative may contact you d...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Performance Engineer

    Performance Engineer

    Meta Force Technology Staffing LLC • Plano, Texas, USA
    serp_jobs.job_card.full_time +1
    They are open to Plano TX as well but local candidate.Hybrid 3 days onsite and 2 days remote.Client & E - Global Markets Production Support. These roles will support multiple application onboardings...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    VAVE Engineer

    VAVE Engineer

    Butler Recruitment Group LLC • Dallas, Texas, United States
    serp_jobs.job_card.full_time
    serp_jobs.filters_job_card.quick_apply
    A well-established HVAC manufacturer is seeking a .VAVE (Value Analysis / Value Engineering) Engineer.This role combines hands-on technical analysis, supplier collaboration, and cross-functional...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days
    Sr Applications Development Analyst (Richardson)

    Sr Applications Development Analyst (Richardson)

    BEPC Inc. - Business Excellence Professional Consulting • Richardson, TX, United States
    serp_jobs.job_card.temporary
    W2 Contract, 6 month contract with possibility for extensions.Determined by experience, paid weekly).Medical, Dental, Vision & Life Insurance. W2 only No C2C or C2H candidates.IT / OT (Information Te...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Principal Engineer, Billing Systems Design

    Principal Engineer, Billing Systems Design

    T-Mobile • Frisco, Texas, USA
    serp_jobs.job_card.full_time +1
    At T-Mobile we invest in YOU! Our Total Rewards Package ensures that employees get the same big love we give our customers. All team members receive a competitive base salary and compensation packag...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Reliability Engineer II

    Reliability Engineer II

    Sherwin-Williams • Garland, Texas, USA
    serp_jobs.job_card.full_time
    This position is responsible for implementing maintaining auditing and assisting with Reliability Excellence (RX) and Process Safety Management (PSM). Additional duties performed by this role includ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Prompt Engineer

    Prompt Engineer

    VDart Inc • Plano, Texas, USA
    serp_jobs.job_card.full_time
    Craft and iterate prompts for various use cases (e.Monitor AI outputs run A / B tests and refine prompt strategies for improved accuracy and relevance. Work with data scientists ML engineers and produ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Methods Engineer

    Methods Engineer

    Spirit AeroSystems • Dallas, TX, United States
    serp_jobs.job_card.full_time
    Spirit AeroSystems designs and builds aerostructures for both commercial and defense customers.With headquarters in Wichita, Kansas, Spirit operates sites in the U. The company's core products inclu...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Software Engineer, AI Engine / HPC

    Software Engineer, AI Engine / HPC

    Topaz Labs • Dallas, TX, United States
    serp_jobs.job_card.full_time
    At Topaz Labs, we help over 1 million paying customers (including teams at Google, Nvidia, and NASA) maximize the visual quality of over 1 billion of these photos and videos.Topaz Labs is a full-st...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Dallas-Ft Worth HES FE

    Dallas-Ft Worth HES FE

    Unisys Corporation • Dallas, TX, United States
    serp_jobs.job_card.full_time
    What success looks like in this role : .Serves as Field Engineer for large, complex clients and across the range of company products and services. Hands-on experience on server and storage break and f...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior System Engineer (Hybrid WFH / Azure) (Richardson)

    Senior System Engineer (Hybrid WFH / Azure) (Richardson)

    Bowman Williams • Richardson, TX, US
    serp_jobs.filters.remote
    serp_jobs.job_card.part_time
    We are a DFW based Managed Service Provider that is expanding its project delivery team and seeking a Senior System Engineer with strong MSP experience. This position is ideal for an engineer who en...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Prompt Engineer Plano, TX (onsite)

    Prompt Engineer Plano, TX (onsite)

    MINDPROSOLUTIONS • Plano, Texas, USA
    serp_jobs.job_card.full_time
    Prompt Design & Optimization : Craft and iterate prompts for various use cases (e.Performance Analysis : Monitor AI outputs run A / B tests and refine prompt strategies for improved accuracy and releva...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted