About Us :
Positron.ai specializes in developing custom hardware systems to accelerate AI inference. These inference systems offer significant performance and efficiency gains over traditional GPU-based systems, delivering advantages in both performance per dollar and performance per watt. Positron exists to create the world's best AI inference systems.
Senior Software Engineer - Machine Learning Systems & High-Performance LLM Inference
We are seeking a Senior Software Engineer to contribute to the development of high-performance software that powers execution of open-source large language models (LLMs) on our custom appliance . This appliance leverages a combination of FPGAs and x86 CPUs to accelerate transformer-based models . The software stack is written primarily in modern C++ (C++17 / 20) and heavily relies on templates, SIMD optimizations, and efficient parallel computing techniques .
Key Areas of Focus & Responsibilities
- Design and implement high-performance inference software for LLMs on custom hardware.
- Develop and optimize C++-based libraries that efficiently utilize SIMD instructions, threading, and memory hierarchy .
- Work closely with FPGA and systems engineers to ensure efficient data movement and computational offloading between x86 CPUs and FPGAs.
- Optimize model execution via low-level optimizations , including vectorization, cache efficiency, and hardware-aware scheduling.
- Contribute to performance profiling tools and methodologies to analyze execution bottlenecks at the instruction and data flow levels.
- Apply NUMA-aware memory management techniques to optimize memory access patterns for large-scale inference workloads.
- Implement ML system-level optimizations such as token streaming, KV cache optimizations, and efficient batching for transformer execution.
- Collaborate with ML researchers and software engineers to integrate model quantization techniques, sparsity optimizations, and mixed-precision execution .
- Ensure all code contributions include unit, performance, acceptance, and regression tests as part of a continuous integration-based development process .
Required Skills & Experience
7+ years of professional experience in C++ software development, with a focus on performance-critical applications .Strong understanding of C++ templates and modern memory management .Hands-on experience with SIMD programming (AVX-512, SSE, or equivalent) and intrinsics-based vectorization .Experience in high-performance computing (HPC), numerical computing, or ML inference optimization .Experience with ML model execution optimizations , including efficient tensor computations and memory access patterns .Knowledge of multi-threading, NUMA architectures, and low-level CPU optimization .Proficiency with systems-level software development , profiling tools (perfetto, VTune, Valgrind), and benchmarking.Experience working with hardware accelerators (FPGAs, GPUs, or custom ASICs) and designing efficient software-hardware interfaces .Preferred Skills (Nice to Have)
Familiarity with LLVM / Clang or GCC compiler optimizations .Experience in LLM quantization, sparsity optimizations, and mixed-precision computation .Knowledge of distributed inference techniques and networking optimizations .Understanding of graph partitioning and execution scheduling for large-scale ML models.Why Join Us?
Work on a cutting-edge ML inference platform that redefines performance and efficiency for LLMs.Tackle challenging low-level performance engineering problems in AI and HPC.Collaborate with a team of hardware, software, and ML experts building an industry-first product.Opportunity to contribute to and shape the future of open-source AI inference software .