Talent.com
Data Engineer
Data EngineerInstitute Of Foundation Models • Sunnyvale, California, United States
Data Engineer

Data Engineer

Institute Of Foundation Models • Sunnyvale, California, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

As a Data Engineer specializing in Natural Language Processing (NLP) and large-scale data processing, you will quickly and effectively gather, curate, and prepare high-quality datasets to support cutting-edge NLP research. Your role will be instrumental in enabling researchers by delivering essential data through efficient and scalable engineering practices, including web crawling, LLM-generated content refinement, and robust data pipelines, primarily leveraging Python and related technologies.

Key Responsibilities

  • Rapidly collect, curate, and preprocess datasets based on detailed specifications provided by NLP researchers, delivering data within tight timelines (typically within 1-2 days).
  • Develop and maintain efficient web crawling solutions, APIs, and automated workflows to continuously improve data collection processes.
  • Refine and evaluate outputs from Large Language Models (LLMs) to generate structured datasets suitable for model training and benchmarking.
  • Implement scalable data pipelines, ensuring efficient data processing, storage, retrieval, and distribution to research teams.
  • Collaborate closely with researchers and engineers to ensure collected data meets specified quality and relevance criteria.
  • Document data collection methodologies, dataset characteristics, and pipeline architecture clearly and effectively.
  • Engage with peer teams and participate in technical reviews to uphold best practices and data quality standards.
  • Represent MBZUAI at industry and research forums, showcasing technical capabilities in large-scale data processing and AI data infrastructure.
  • Perform all other duties as reasonably directed by the line manager commensurate with these functional objectives.

Academic Qualifications

  • Bachelor's degree in Computer Science, Data Science, Engineering, or a related technical field required
  • Master’s degree or equivalent experience in Computer Science, Data Engineering, or related technical fields preferred.
  • Professional Experience - Required

  • Extensive experience in data engineering, data processing, and automation using Python.
  • Demonstrated proficiency in designing and deploying web crawling solutions, automated data extraction, and processing pipelines.
  • Strong understanding of data structures, algorithms, databases, SQL, and performance optimization.
  • Experience working with cloud infrastructure and distributed data processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
  • Excellent problem-solving abilities, attention to detail, and the capability to rapidly address technical challenges.
  • Strong communication and collaboration skills with cross-functional teams.
  • Professional Experience - Preferred

  • Proven track record of supporting NLP or AI research teams with rapid and reliable data delivery.
  • Experience with refining outputs from large-scale AI models, such as LLM-generated data.
  • Contributions to open-source projects, coding competitions, or high visibility in coding communities (e.g., GitHub, Stack Overflow).
  • Familiarity with the latest advancements in NLP data processing and large language model technologies.
  • $100,000 - $500,000 a year

    Visa Sponsorship

    This position is eligible for visa sponsorship.

    Benefits Include

  • Comprehensive medical, dental, and vision benefits
  • Bonus
  • 401K Plan
  • Generous paid time off, sick leave and holidays
  • Paid Parental Leave
  • Employee Assistance Program
  • Life insurance and disability
  • serp_jobs.job_alerts.create_a_job

    Data Engineer • Sunnyvale, California, United States

    Job_description.internal_linking.related_jobs
    Software Engineer - Analytics & AI

    Software Engineer - Analytics & AI

    Cxapp Us, Inc. • San Ramon, California, United States
    serp_jobs.job_card.full_time
    At CXApp, we are the innovators of Indoor Intelligence, delivering actionable insights for people, places and things.Our flagship product the “CXApp” is a workplace experience platform for the ente...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Engineer (Data Science)

    Machine Learning Engineer (Data Science)

    Autonomous Healthcare • Santa Clara, CA, US
    serp_jobs.job_card.full_time
    At Autonomous Healthcare, we are at the forefront of medical innovation, developing the next generation of devices that will revolutionize patient care. Our mission is to commercialize breakthrough ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Data Engineer

    Senior Data Engineer

    Sigmaways Inc • Sunnyvale, CA, United States
    serp_jobs.job_card.full_time
    If you’re hands on with modern data platforms, cloud tech, and big data tools and you like building solutions that are secure, repeatable, and fast, this role is for you. As a Senior Data Engineer, ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Software Engineer - Data Search

    Software Engineer - Data Search

    Woven • Palo Alto, California, United States
    serp_jobs.job_card.full_time
    Toyota’s once-in-a-century transformation into a mobility company.Inspired by a legacy of innovating for the benefit of others, our mission is to challenge the current state of mobility through hum...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    AWS Data Migration consultant / Data Engineer

    AWS Data Migration consultant / Data Engineer

    SRS Consulting Inc • San Jose, CA, US
    serp_jobs.job_card.full_time
    Role : AWS Data Migration consultant / Data Engineer.Duration : 3 month + Possible extension.We're seeking an experienced consultant for a focused 2-3 month engagement to migrate data transformation ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Sr. Staff Back-end Engineer, Data

    Sr. Staff Back-end Engineer, Data

    Coupanginternal • Mountain View, California, United States
    serp_jobs.job_card.full_time
    Data & AI is at the center of everything we do at Coupang.It helps power all aspects of Coupang including search ranking & relevance, price products and services, streamline logistics and delivery,...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Data Engineer

    Data Engineer

    Midjourney • Hayward, California, United States
    serp_jobs.job_card.full_time
    Midjourney is a research lab exploring new mediums to expand the imaginative powers of the human species.We are a small, self-funded team focused on design, human infrastructure, and AI.We have no ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Principal Data Engineer

    Principal Data Engineer

    Sanas • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Founded by a team of Stanford researchers and entrepreneurs with deep industry experience, Sanas has developed the world’s first real-time speech transformation platform capable of accent translati...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    MlOps / Data Engineer

    MlOps / Data Engineer

    TEKsystems • Cupertino, CA, United States
    serp_jobs.job_card.full_time
    Expected skills : Python, Golang / Rust (nice to have).Data Engineering tools : pyiceberg, daft to name a few.The candidate should be familiar with data engineering supporting and building systems at P...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Staff Data Center Design Engineer

    Staff Data Center Design Engineer

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Software Engineer, Datapath

    Senior Software Engineer, Datapath

    Pure Storage • Santa Clara, California, United States
    serp_jobs.job_card.full_time
    We’re in an unbelievably exciting area of tech and are fundamentally reshaping the data storage industry.Here, you lead with innovative thinking, grow along with us, and join the smartest team in t...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Data Platform Engineer / AI Workloads

    Data Platform Engineer / AI Workloads

    The Crypto Recruiters • San Jose, CA, United States
    serp_jobs.job_card.permanent
    We are actively searching for a Data Infrastructure Engineer to join our team on a permanent basis.In this founding engineer role you will focus on building next-generation data infrastructure for ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Bigdata Engineer

    Bigdata Engineer

    Net2Source (N2S) • Mountain View, CA, United States
    serp_jobs.job_card.full_time
    Net2Source is a Global Workforce Solutions Company headquartered at NJ, USA with its branch offices in Asia Pacific Region. We are one of the fastest growing IT Consulting company across the USA and...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Service Engineer

    Service Engineer

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Databricks Data Engineer - Manager - Consulting - Location Open 1

    Databricks Data Engineer - Manager - Consulting - Location Open 1

    Ernst & Young Oman • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Technology – Data and Decision Science – Data Engineering – Manager.At EY, we’re all in to shape your future with confidence. We’ll help you succeed in a globally connected powerhouse of diverse tea...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Software Engineer - Data Engine

    Software Engineer - Data Engine

    Applied Intuition • Mountain View, California, United States
    serp_jobs.job_card.full_time
    Applied Intuition is the vehicle intelligence company that accelerates the global adoption of safe, AI-driven machines.Founded in 2017, Applied Intuition delivers the toolchain, Vehicle OS, and aut...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Software Engineer, Data Warehouse

    Software Engineer, Data Warehouse

    Newsbreak • Mountain View, California, United States
    serp_jobs.job_card.full_time
    NewsBreak is redefining the way users interact with local news and their communities.By bridging local users, local content creators, and local businesses, our mission is to foster safer, more vibr...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Sr. Software Engineer, Data

    Sr. Software Engineer, Data

    Match Group • Palo Alto, California, United States
    serp_jobs.job_card.full_time
    As humans, there are few things more exciting than meeting someone new.At Tinder, we’re inspired by the challenge of keeping the magic of human connection alive. With tens of millions of users, hund...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted