Talent.com
HPC Performance and Validation Engineer
HPC Performance and Validation EngineerNorthMark Strategies • Dallas, TX, United States
serp_jobs.error_messages.no_longer_accepting
HPC Performance and Validation Engineer

HPC Performance and Validation Engineer

NorthMark Strategies • Dallas, TX, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

The Company

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

As an HPC Validation and Performance Engineer at NMC², you will take ownership of the validation and optimization of our HPC CPU and GPU calc farms. This critical role will involve developing a validation and performance baselining framework, which ensures system readiness for AI / ML and HPC workloads across multiple architectures. Your role will be essential in providing continuous performance benchmarking, real-time observability, and long-term strategic readiness. You will drive the implementation of advanced tooling and frameworks, maintaining an infrastructure that is crucial to our cutting-edge research efforts. You will be accountable for providing data driven performance metrics to support architectural design choices as we continue to globally scale our datacenter footprint. We are looking for someone with deep technical expertise in compute, storage or networking optimizations and performance engineering who can develop solutions that scale with our growing infrastructure. This role demands a forward-thinking engineer who can anticipate industry trends and adopt emerging architectures and strategies to keep NMC² at the forefront of innovation.

Responsibilities :

  • Architecting and implementing a validation framework to certify the readiness and utilization of GPU nodes across a large, distributed HPC environment.
  • Defining methodologies to continually assess performance and optimising infrastructure across AI / ML workloads
  • Developing and executing comprehensive performance testing using industry and customer specific benchmarks, ensuring optimal performance across HPC compute, storage and networking
  • Contribute to research reports that will describe the discoveries of the benchmarking, evaluating the complete HW performance and efficiency
  • Leading efforts to debug, identify and then resolve bottlenecks in system performance
  • Building robust, scalable tools for automated validation and testing, utilising Python, Go, Kubernetes and CI / CD pipelines to streamline continuous validation and benchmarking processes
  • Implementing monitoring solutions using Prometheus, Grafana and other modern monitoring technologies to track performance metrics and real-time health of the cluster
  • Defining and implementing best practice for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge
  • Staying informed on industry trends and advancements to ensure long-term strategic alignment
  • Working cross-functionally with engineering, infrastructure and research teams to align validation efforts with the broader business objectives, ensuring that the platform meets evolving research demands

Requirements :

  • Accelerator performance experience, including profiling and tuning with large-scale GPU clusters
  • In-depth understanding of NVIDIA ClusterKit, Nsight and Validation Suite, MLPerf and DCGM tools for GPU and DPUs
  • Networking & Storage performance experience, including profiling and optimisation with NVIDIA ClusterKit, iPerf or equivalent across InfiniBand / RoCe network implementations
  • System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent
  • Experience with HPC workloads across distributed global locations, bringing data driven performance data to compliment key architectural decisions
  • Strong proficiency in developing automation tools and micro benchmarking frameworks for validation using Python, Go, and Kubernetes in a Ubuntu Linux environment
  • Expertise with key monitoring platforms including OTEL, Prometheus, ELK and Grafana and in definition and implementing the overall observability strategy for HPC validation and performance monitoring
  • A deep understanding of emerging technologies, architectures and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long-term plan
  • Proven ability to lead complex technical projects, influence decisions and engage with stakeholders across technical and research teams
  • serp_jobs.job_alerts.create_a_job

    Validation Engineer • Dallas, TX, United States

    Job_description.internal_linking.related_jobs
    Threat Modelling Engineer

    Threat Modelling Engineer

    ApTask • Dallas, TX, United States
    serp_jobs.job_card.full_time
    Title : Threat Modelling Engineer.We are seeking an ideal candidate with 8+ years of experience in a range of technologies and processes, including : . Proficiency in GCP - essential.Strong knowledge o...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    GCP Data Engineer (Richardson)

    GCP Data Engineer (Richardson)

    Infosys • Richardson, TX, US
    serp_jobs.job_card.part_time
    Infosys is seeking a Google Cloud (GCP) data engineer with experience in Github and python.In this role, you will enable digital transformation for our clients in a global delivery model, research ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    HPC Software Engineer II

    HPC Software Engineer II

    University of Texas at Dallas • Richardson, TX, United States
    serp_jobs.job_card.full_time
    High Performance Computing Software Engineer II.Position End Date (if temporary).Reporting to the Director High Performance Computing (HPC) Facilitation, this is a mid-level HPC systems software en...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Traveling VDC Engineer - MSG / ATG

    Traveling VDC Engineer - MSG / ATG

    Turner Construction Company • Dallas, TX, United States
    serp_jobs.job_card.full_time +1
    Critical Facilities-Data Centers.This position is for a fulltime traveling assignment.Locations are across the United States supporting our Advanced Tech projects and will report to our Markets Seg...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    VP, Model Validation

    VP, Model Validation

    Veterans Staffing • Dallas, TX, US
    serp_jobs.job_card.full_time
    Role Summary / Purpose : The VP, Model Validation is within Synchrony Model Risk Management function and responsible for leading a model validation team of quantitative analysis, focusing on the valid...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Software Engineer Data / AI / Intelligent Systems I (Full Time) - United States

    Software Engineer Data / AI / Intelligent Systems I (Full Time) - United States

    Cisco • Dallas, TX, United States
    serp_jobs.job_card.full_time
    Please note this posting is to advertise potential job opportunities.This exact role may not be open today but could open in the near future. When you apply, a Cisco representative may contact you d...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    CONFLUENT KAFKA ENGINEER

    CONFLUENT KAFKA ENGINEER

    QUALIS1 INC • Dallas, TX, United States
    serp_jobs.job_card.full_time
    Work Location : Dallas, TX 75252.Detailed JD (Roles and Responsibilities).As a Confluent Consulting Engineer, you will be responsible for designing, developing, and maintaining scalable real-time da...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    MS Intune Engineer (Addison)

    MS Intune Engineer (Addison)

    BravoTECH • Addison, TX, United States
    serp_jobs.job_card.full_time
    Our Dallas based manufacturing client is searching for a Modern Workplace Engineer to develops deploy, administer, and support Information Technology (IT) equipment, services, processes and softwar...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    VAVE Engineer

    VAVE Engineer

    Butler Recruitment Group LLC • Dallas, Texas, United States
    serp_jobs.job_card.full_time
    serp_jobs.filters_job_card.quick_apply
    A well-established HVAC manufacturer is seeking a .VAVE (Value Analysis / Value Engineering) Engineer.This role combines hands-on technical analysis, supplier collaboration, and cross-functional...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days
    Sr Applications Development Analyst (Richardson)

    Sr Applications Development Analyst (Richardson)

    BEPC Inc. - Business Excellence Professional Consulting • Richardson, TX, United States
    serp_jobs.job_card.temporary
    W2 Contract, 6 month contract with possibility for extensions.Determined by experience, paid weekly).Medical, Dental, Vision & Life Insurance. W2 only No C2C or C2H candidates.IT / OT (Information Te...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Controls Engineer

    Controls Engineer

    Gpac • Dallas, Texas, United States
    serp_jobs.job_card.full_time
    serp_jobs.filters_job_card.quick_apply
    A highly established local Mechanical Contractor is looking to fill a .Excellent benefits, competitive compensation, and opportunities for professional development. Experience working with buil...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days
    PwC Technology - Senior AI Engineer

    PwC Technology - Senior AI Engineer

    PwC • Dallas, TX, United States
    serp_jobs.job_card.full_time
    IFS - Information Technology (IT).At PwC, our people in software and product innovation focus on developing cutting-edge software solutions and driving product innovation to meet the evolving needs...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    IBM BPM Engineer

    IBM BPM Engineer

    Zone IT Solutions • Plano, TX, US
    serp_jobs.job_card.full_time
    serp_jobs.filters_job_card.quick_apply
    Zone IT Solutions is seeking a skilled IBM BPM Engineer to join our team.In this role, you will be responsible for developing, implementing, and maintaining business process management (BPM) soluti...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30
    System Engineer - Identity

    System Engineer - Identity

    Altice USA • Plano, TX, United States
    serp_jobs.job_card.full_time
    Are you looking to Optimize your life? Start your exciting path to a rewarding career today!.We are Optimum, a leader in the fast-paced world of connectivity, and we're on the hunt for enthusiastic...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Methods Engineer

    Methods Engineer

    Spirit AeroSystems • Dallas, TX, United States
    serp_jobs.job_card.full_time
    Spirit AeroSystems designs and builds aerostructures for both commercial and defense customers.With headquarters in Wichita, Kansas, Spirit operates sites in the U. The company's core products inclu...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Data Engineer

    Senior Data Engineer

    EXL • Dallas, TX, United States
    serp_jobs.job_card.full_time
    Role : Senior Data Engineer(Data Quality Framework Team).Location : Hybrid(3 days onsite, 2 days’ work from Home)Pittsburgh / Cleveland / Dallas / Birmingham, AL / Phoenix. Design and build scalable data pipe...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Systems Engineer

    Systems Engineer

    RouteOne • Dallas, TX, United States
    serp_jobs.job_card.full_time
    The Systems Engineer will collaborate with internal and external teams to.Engineer must have experience with hands-on support of. Serve as part of a 24x7 on-call rotation for.Establish and manage op...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Software Engineer, AI Engine / HPC

    Software Engineer, AI Engine / HPC

    Topaz Labs • Dallas, TX, United States
    serp_jobs.job_card.full_time
    At Topaz Labs, we help over 1 million paying customers (including teams at Google, Nvidia, and NASA) maximize the visual quality of over 1 billion of these photos and videos.Topaz Labs is a full-st...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted