Talent.com
Infrastructure Engineer
Infrastructure EngineerFAR.AI • Berkeley, California, United States
Infrastructure Engineer

Infrastructure Engineer

FAR.AI • Berkeley, California, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

About FAR.AI

FAR.AI is a non-profit AI research institute dedicated to ensuring advanced AI is safe and beneficial for everyone. Our mission is to facilitate breakthrough AI safety research, advance global understanding of AI risks and solutions, and foster a coordinated global response.

Since our founding in July 2022, we've grown quickly to 30+ staff, producing over 40 influential academic papers, and established the leading AI Safety events for research, and international cooperation . Our work is recognized globally, with publications at premier venues such as NeurIPS, ICML, and ICLR, and features in the Financial Times , Nature News, and MIT Technology Review .

We drive practical change through red-teaming with frontier model developers and government institutes. Additionally, we help steer and grow the AI safety field through developing research roadmaps with renowned researchers such as Yoshua Bengio, running FAR.Labs, an AI safety-focused co-working space in Berkeley housing 40 members, and supporting the community through targeted grants to technical researchers.

About FAR.Research

Our research team likes to move fast. We explore promising research directions in AI safety and scale up only those showing a high potential for impact. Unlike other AI safety labs that take a bet on a single research direction, FAR.AI aims to pursue a diverse portfolio of projects.

Our current focus areas include :

Investigating deception in AI (e.g. lie detectors can either induce honesty or evasion )

Building a science of robustness (e.g. finding vulnerabilities in superhuman Go AIs )

Advancing model evaluation techniques (e.g. inverse scaling and codebook features , and learned planning ).

We also put our research into practice through red-teaming engagements with frontier AI developers, and collaborations with government institutes.

About the Role

We’re seeking an Infrastructure Engineer to develop and manage scalable infrastructure to support our research workloads. You will own our existing Kubernetes cluster, deployed on top of bare-metal H100 cloud instances. You will oversee and enhance the cluster to 1) support new workloads, such as multi-node LoRA training; 2) new users, as we double the size of our research team in the next twelve to eighteen months; and 3) new features, such as fine-grained experiment compute usage tracking.

You will be the point-person for cluster-related work. You will work on the Foundations team alongside experienced engineers, including those who built and designed the cluster, who can provide guidance and backup. However, as our first dedicated infrastructure hire, you will need to work autonomously, design solutions to varied and complex problems, and communicate with researchers who are technically skilled but less knowledgeable about our cluster and infrastructure.

This is an opportunity to build the technical foundations of the largest independent AI safety research institute, with one of the most varied research agendas. You will be working directly with both the Foundations team and researchers across the organization to enable bleeding-edge research workloads across our research portfolio.

Responsibilities

Build and Maintain

You will deliver a scalable and easy to use compute cluster to support impactful research by :

Empowering the research team to solve their own day-to-day compute problems, such as debugging simple issues and streamlining recurring tasks (e.g. running batch experiments, launching an interactive devbox, etc.).

Maintaining and developing in-cluster services, such as backups, experiment tracking, and our in-house LLM-based cluster support bot.

Maintaining adequate cluster stability to avoid interfering with research workloads (currently >

95% uptime outside of planned maintenance windows).

Maintaining situational awareness of the cloud GPU market and assisting leadership with vendor comparisons to ensure we are using the most effective compute platforms.

Support Security

We often collaborate with partners with stringent security requirements (e.g. governments, frontier developers) and handle sensitive information (e.g. non-public exploits, CBRN datasets). You will implement security measures towards :

Securing the cluster against insider threats (architecting it to have adequate isolation to provide data confidentiality and integrity for sensitive workloads) and external threats (through minimizing the attack surface, and ensuring security updates are promptly installed).

Making secure workflows the default, e.g. streamlining the deployment of internal web dashboards behind an OAuth reverse proxy.

Championing security across the FAR.AI team, including maintaining and extending our mobile device management (MDM) system.

Bleeding-edge Workloads

You will work with the Foundations team and specific research teams to support novel ML workloads (e.g. fine-tuning a new open-weight model release) by :

Architecting our Kubernetes cluster to flexibly support novel workloads.

Assisting projects with bespoke requirements, designing and implementing effective infrastructure solutions, and sharing your infrastructure wisdom with ML researchers.

Improving observability over cluster resources and GPU utilization to allow us to rapidly diagnose and work around hardware issues or software bugs that may only arise on novel workloads.

About You

It is essential that you

Have Kubernetes or other system administration experience.

Have a curiosity and willingness to rapidly learn the needs of a new space.

Are self-directed and comfortable with ambiguous or rapidly evolving requirements.

Are willing to be on-call during waking hours for cluster issues ahead of major deadlines (for a few weeks a quarter).

Are interested in improving our security posture through identifying, implementing and administering security policies.

It is preferable that you

Have experience supporting ML / AI workloads.

Have previously worked in research environments or startups.

Are experienced in administering compute or GPU clusters.

Are able to adopt a security mindset.

Are willing to be part of an eventual on-call rotation, if required.

Example Projects

Configure the cluster and user-space development environments to support InfiniBand nodes for high-performance multi-node training.

Improve our default devbox K8s pod template to incorporate best-practice workflows for our researchers.

Roll out a new mobile device management system to ensure corporate devices meet our security requirements.

Streamline onboarding to the cluster for new starters (possibly in different timezones), and candidates on time-limited work trials.

Be “holder of the keys”, managing permissions and access control for FAR.AI’s team members to technical systems, including streamlining / automating (e.g. via SAML, SCIM) where appropriate.

Analyze storage patterns and propose infrastructure improvements for backups, disaster recovery, and usability.

Logistics

You will be a full-time employee of FAR AI, a 501(c)(3) research non-profit.

Location : Both remote and in-person (Berkeley, CA) are possible, though 2 hours of overlap with Berkeley timezones are required. We sponsor visas for CA in-person employees, and can also hire remotely in most countries.

Hours : Full-time (40 hours / week).

Compensation : $100,000-$175,000 / year depending on experience and location. We will also pay for work-related travel and equipment expenses. We offer catered lunch and dinner at our offices in Berkeley.

Application process : A programming assessment, a short screening call, two 1-hour interviews, and a 1 week paid work trial.

If you have any questions about the role, please reach out at talent@far.ai. If you don't have questions, the best way to ensure a proper review of your skills and qualifications is by applying directly via the application form. Please don't email us to share your resume (it won't have any impact on our decision). Thank you!

serp_jobs.job_alerts.create_a_job

Infrastructure Engineer • Berkeley, California, United States

Job_description.internal_linking.related_jobs
Network Engineer

Network Engineer

Qualys • Foster City, CA, United States
serp_jobs.job_card.permanent
Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!.The successful applicant will be performing work in FedRAMP environments, and th...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Infrastructure Engineer

Staff Infrastructure Engineer

Ironclad • San Francisco, California, United States
serp_jobs.filters.remote
serp_jobs.job_card.full_time
Ironclad is the #1 contract lifecycle management platform for innovative companies.Every company, in every country, in every industry runs on contracts, but managing these contracts slows companies...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Infrastructure Engineer

Staff Infrastructure Engineer

Replit • Foster City, California, United States
serp_jobs.job_card.full_time
Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
IT Infrastructure Engineer III

IT Infrastructure Engineer III

Prometheus Real Estate Group • San Mateo, California, United States
serp_jobs.job_card.full_time
Founded in 1965, Prometheus is the largest privately held owner of apartments in the San Francisco Bay Area, with a portfolio of over 13,000 apartments in the Silicon Valley, Portland and Seattle r...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Cluster Infrastructure Engineer

Cluster Infrastructure Engineer

Cartesia • San Francisco, California, United States
serp_jobs.job_card.full_time
Our mission is to build the next generation of AI : ubiquitous, interactive intelligence that runs wherever you are.Today, not even the best models can continuously process and reason over a year-lo...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Lead Cloud Infrastructure Engineer

Lead Cloud Infrastructure Engineer

Together Ai • San Francisco, California, United States
serp_jobs.job_card.full_time
Together AI is hiring a Lead Cloud Infrastructure Engineer to own and operate the cloud foundation that powers our rapidly scaling data platforms. In this role, you will be the primary engineer resp...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Platform & Infrastructure Engineer

Platform & Infrastructure Engineer

Mindsdb • San Francisco, California, United States
serp_jobs.job_card.full_time
MindsDB is a fast-growing AI startup headquartered in San Francisco, California.MindsDB is an AI Analytics solution that connects to diverse data sources and applications then unifies structured an...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Data Infrastructure Engineer

Data Infrastructure Engineer

Openai • San Francisco, California, United States
serp_jobs.job_card.full_time
You’ll join the team that’s behind OpenAI’s data infrastructure that powers critical engineering, product, alignment teams that are core to the work we do at OpenAI. The systems we support include o...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Mercor • San Francisco, California, United States
serp_jobs.job_card.full_time
Mercor is training models that predict how well someone will perform on a job better than a human can.We use our platform to source, vet, and onboard expert contractors who help train AI models in ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Retool • San Francisco, California, United States
serp_jobs.job_card.full_time
Nearly every company in the world runs on custom software for critical operations like tracking performance metrics, handling customer support workflows, building admin dashboards, and countless ot...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Vibecode • San Francisco, California, United States
serp_jobs.job_card.full_time
We're democratizing software creation.Our platform lets anyone describe an idea and instantly turn it into a working application—no coding required. We're solving one of computing's fundamental chal...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Lead Infrastructure Engineer

Lead Infrastructure Engineer

PIP Labs • San Francisco, California, United States
serp_jobs.job_card.full_time
Story aims to grow the creativity of the internet.The internet has introduced Story is building the IP infrastructure for the internet era, where creativity and intelligence move at the speed of cu...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
MTS, Infrastructure Engineer

MTS, Infrastructure Engineer

Delphina • San Francisco, California, United States
serp_jobs.job_card.full_time
Today’s Data Scientists are in pain - spending their time manually wrangling data, building models through slow trial and error, taking on painstaking rewrites for deployment, and dealing with coun...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Infrastructure Engineer - InfraOps

Senior Infrastructure Engineer - InfraOps

Bitgo • San Francisco, California, United States
serp_jobs.job_card.full_time
BitGo is the leading infrastructure provider of digital asset solutions, delivering custody, wallets, staking, trading, financing, and settlement services from regulated cold storage.Since our foun...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
ML Infrastructure Engineer

ML Infrastructure Engineer

Virtue AI • San Francisco, California, United States
serp_jobs.job_card.full_time
Virtue AI is at the forefront of AI security.As enterprises increasingly adopt Large Language Models, the need for robust, trustworthy, and safe AI has never been greater.Our mission is to build th...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Infrastructure Software Engineer, Public Sector

Infrastructure Software Engineer, Public Sector

Scale AI, Inc. • San Francisco, CA, United States
serp_jobs.job_card.full_time
Scale AI is seeking a highly skilled and motivated.Software Engineer, AI Infrastructure & Security.Public Sector Engineering team. As a part of this team, you will play a critical role in delivering...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Principal Infrastructure Engineer

Principal Infrastructure Engineer

Nextdata Technologies Inc • San Francisco, California, United States
serp_jobs.job_card.full_time
The future of data lies in decentralization, and the concept of a data mesh is the proven approach for implementing this at Enterprise scale. We’re here to make it a reality.Nextdata OS is a data-me...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Data Infrastructure Engineer

Data Infrastructure Engineer

zaimler • San Mateo, California, United States
serp_jobs.job_card.full_time
We’re creating the foundation for AI systems that don’t just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we’ve begun partnering with Fortune 500 design p...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted