Machine Learning Operations Engineer
Company: RAI Institute
Location: Cambridge
Posted on: April 4, 2025
Job Description:
Our MissionOur mission is to solve the most important and
fundamental challenges in AI and Robotics to enable future
generations of intelligent machines that will help us all live
better lives.Machine Learning Operations (ML-Ops) Engineers build
infrastructure that supports the entire lifecycle of Machine
Learning (ML) projects from development to scaling and to
deployment. If you have a passion for building the foundation that
enables robotics research and engineering, you will want to join
us!What You Will Do
- Design, develop, and maintain company-wide platforms and
tooling that utilize Kubernetes infrastructure to enable machine
learning and data processing applications
- Enable self-service access to ML-compute for our on-prem and
cloud compute clusters, including support for job scheduling,
workload scalability and workload fault tolerance
- Enhance observability across ML applications through
integrations with tools and services such as FluentD, Prometheus,
Grafana and DataDog
- Integrate ML applications with experiment tracking and
management services like Weights and Biases
- Elevate code quality and champion best practices in our
engineering processes
- Collaborate with Machine Learning Engineers, Data Engineers,
DEVOPs engineers and researchers to build scalable solutions that
improve engineering and research velocity.What You Will Bring
- BS or MS in Computer Science, Engineering, or equivalent
- 3+ years of experience in an MLOPs, DevOps, ML Engineering or
software engineering role
- Strong hands-on experience deploying and managing applications
running on Kubernetes
- Experience developing MLOPS platforms to manage the lifecycle
of ML experiments; including one or more of data and artifact
management, reproducibility, fault-tolerance, experiment tracking
and model serving
- Experience with Docker and Python environment management tools
such as pip, poetry, uv or similar
- Proficient in software practices such as version control (Git),
CI/CD (Github Actions, ArgoCD), Infrastructure as
Code(Terraform).Extra Skills We Value
- Experience with Kueue, or similar job scheduling
mechanisms
- Experience with workflow orchestration tools such as Airflow,
Metaflow, Argo Workflows or similar
- Hands-on experience deploying and managing cloud infra on
platforms like GCP and AWS
- Experience with hybrid-cloud compute and data environments
- Experience with Ray, Pytorch Lightning or similar scalable
AI/ML platforms
- Experience with application and system logging with tools and
services like FluentD, Prometheus, Grafana and DataDog or
similar
- Experience with Bazel build tool or similar
- Experience with ML model serving frameworks such as Torchserve,
ONNX runtime or similar
- Experience working with research teams in an academic or
industrial environment.We provide equal employment opportunities to
all employees and applicants for employment and prohibit
discrimination and harassment of any type without regard to race,
color, religion, age, sex, national origin, disability status,
genetics, protected veteran status, sexual orientation, gender
identity or expression, or any other characteristic protected by
federal, state or local laws.
#J-18808-Ljbffr
Keywords: RAI Institute, Concord , Machine Learning Operations Engineer, Engineering , Cambridge, New Hampshire
Didn't find what you're looking for? Search again!
Loading more jobs...