New

Senior Principal Engineer - GenAI / LLM Infrastructure Ops

Vertex Pharmaceuticals Incorporated
paid time off, 401(k)
United States, Massachusetts, Boston
50 Northern Avenue (Show on map)
Apr 25, 2025
Job Description The Data, Technology and Engineering (DTE) Infrastructure team is expanding its Generative AI and large language model (LLM) capabilities and is looking for a senior principal engineer specialized in managing operational aspects of designing, developing, deploying and optimizing Gen AI solutions at scale, and responsible for on-going aspects of deploying, managing, and optimizing Large Language Models (LLM) in the Azure OpenAI infrastructure. They ensure that the models perform effectively, are reliable, and meet the needs of various applications and services. This will enable the Data Strategy and Solutions (DSS) team and other internal teams to build LLM powered solutions using the deployed models. Generative AI engineers who are passionate about designing, developing, deploying and optimizing Gen AI products at scale. Responsibilities: 1. Model Deployment and Management Deploying Models: Manage the deployment of LLMs into production environments. Configure model serving infrastructures, including APIs and endpoints. Model Versioning and Lifecycle Management: Maintain version control of models, ensuring the correct models are in production. Plan and execute model updates and decommissioning of outdated models. 2. Performance Optimization Model Serving Efficiency: Optimize the inference performance of LLMs for low latency and high throughput. Implement techniques like model quantization, pruning, or distillation. Resource Optimization: Analyze and optimize compute resources required for model serving. Adjust configurations to balance performance and cost. Cost Management: Monitor costs associated with model training and serving. Identify opportunities to reduce expenses without compromising quality. 3. System Monitoring and Maintenance Monitoring Model Performance: Continuously track model accuracy, response time, and user feedback. Use monitoring tools to detect anomalies in model behavior. Maintenance Tasks: Schedule retraining and fine-tuning of models based on new data. Update models to address identified issues or improve performance. 4. Troubleshooting and Incident Response Resolving Technical Issues: Diagnose and resolve issues related to model outputs, such as incorrect or biased responses. Debug model-related errors in production. Incident Management: Lead the response to incidents affecting model performance or availability. Document incidents and implement preventive measures. Disaster Recovery Planning: Develop strategies to recover models and data in case of system failures. 5. Collaboration with Cross-Functional Teams Working with Data Scientists and ML Engineers: Collaborate to understand model requirements and provide operational support. Assist in experiments and evaluate new models or features. Establish AI model governance standards and collaborate with cross-functional teams Supporting Developers: Help application developers integrate LLM APIs into products and services. Oversee cloud and AI services, ensuring robust CI/CD pipelines for continuous delivery. Provide technical guidance on best practices for using LLMs. Stakeholder Communication: Communicate model updates, new features, and performance metrics to stakeholders. 6. Supporting AI Ethics and Responsible AI Practices Ensuring Ethical AI Deployment: Implement policies to ensure models are used responsibly. Monitor for misuse of AI capabilities. Bias Detection and Mitigation: Develop and apply techniques to detect and reduce biases in model outputs. Collaborate on fairness and inclusivity initiatives. User Privacy: Ensure compliance with data privacy laws and regulations. Manage and protect sensitive information used in model training. 8. Documentation and Knowledge Sharing Creating Documentation: Document model architectures, training processes, and operational procedures. Maintain records of experiments and performance evaluations. Training and Mentoring: Share knowledge with team members about best practices in LLM operations. Provide training sessions on new tools or methodologies. Building Runbooks: Develop standard operating procedures for common tasks and incident responses. 9. Integrations and Data Pipeline Management Managing Data Workflows and Integrations: Set up and maintain data ingestion pipelines for training and fine-tuning LLMs. Ensure data is processed efficiently and securely. Set up integrations with external data sources. Data Preprocessing: Collaborate with DSS team to preprocess and clean datasets for model training. Implement data augmentation techniques to enhance model performance. Data Storage Solutions: Optimize storage solutions for training data Implement data retention policies and archiving strategies. 10. Continuous Improvement and Learning Staying Current with Technology Trends: Keep updated on advancements in LLMs and natural language processing. Attend workshops, conferences, and training sessions. Experimentation: Test new algorithms or techniques to improve model capabilities. Feedback Integration: Collect and analyze user feedback to enhance model performance. 11. Compliance with Best Practices and Standards Adhering to AI Development Standards: Follow industry best practices for AI model development and deployment. Quality Assurance: Implement testing frameworks to validate model outputs pre-deployment. Standardization: Establish standards for model development, naming conventions, and versioning. 12. Support GenAI/LLM off the shelf products like Microsoft 365 Copilot Develop best practices for M365 Copilot: Follow industry best practices and document them Day to day end user support: Assist end users with driving adoption, training, etc. Knowledge and Skills: Strong understanding of GenAI / LLM Ops in the cloud, preferably on the Microsoft Azure OpenAI infrastructure. Strong problem-solving and troubleshooting skills Ability to work on multiple concurrent projects and activities as both a lead and team member Able to reliably estimate level of effort needed for assignments and work within those parameters Able to work independently with minimal guidance Strong verbal and written communication skills, organizational skills, and attention to detail Demonstrated ability to collaborate in cross-functional teams Education and Experience: Bachelor's degree in computer sciences discipline preferred and 10+ years of professional experience in software or data engineering, including at least 3-5 years in machine learning and GenerativeAI/LLM technology. Experience with the cutting edge of AI, automation, data, and how to apply these capabilities to drive measurable impact, including proven experience designing and scaling enterprise-grade AI/ML platforms, with an emphasis on GenAI systems and workflow orchestration Strong understanding of GenAI design patterns and system components including Retrieval-Augmented Generation (RAG), vector databases, prompt orchestration, and agentic frameworks Experience with LLMOps tools to implement guardrails, track accuracy, hallucinations, bias and other metrics in Gen AI products Experience with leading technical solution design, translating business requirements into technical specifications Experience in build, deployment, integration and scaling of AI and data-focused applications Experience leading agile cross-functional technical teams to execute on technology infrastructure evolutions and custom applications, as well as their supporting ongoing maintenance Proven ability to operate with a transparent mindset, communicating openly with stakeholders at various levels of the organization Candidates with Cloud ML/GenAI/LLM Certifications Preferred. Flex Designation: Hybrid-Eligible or On-Site Eligible Flex Eligibility Status In this Hybrid-Eligible role, you can choose to be designated as: 1. Hybrid: work remotely up to two days per week; or select 2. On-Site: work five days per week on-site with ad hoc flexibility. Note: The Flex status for this position is subject to Vertex's Policy on Flex @ Vertex Program and may be changed at any time Pay Range: $0 - $0 Disclosure Statement: The range provided is based on what we believe is a reasonable estimate for the base salary pay range for this job at the time of posting. This role is eligible for an annual bonus and annual equity awards. Some roles may also be eligible for overtime pay, in accordance with federal and state requirements. Actual base salary pay will be based on a number of factors, including skills, competencies, experience, and other job-related factors permitted by law. At Vertex, our Total Rewards offerings also include inclusive market-leading benefits to meet our employees wherever they are in their career, financial, family and wellbeing journey while providing flexibility and resources to support their growth and aspirations. From medical, dental and vision benefits to generous paid time off (including a week-long company shutdown in the Summer and the Winter), educational assistance programs including student loan repayment, a generous commuting subsidy, matching charitable donations, 401(k) and so much more. Flex Designation: Hybrid-Eligible Or On-Site Eligible Flex Eligibility Status: In this Hybrid-Eligible role, you can choose to be designated as: 1. Hybrid: work remotely up to two days per week; or select 2. On-Site: work five days per week on-site with ad hoc flexibility. Note: The Flex status for this position is subject to Vertex's Policy on Flex @ Vertex Program and may be changed at any time. Company Information Vertex is a global biotechnology company that invests in scientific innovation. Vertex is committed to equal employment opportunity and non-discrimination for all employees and qualified applicants without regard to a person's race, color, sex, gender identity or expression, age, religion, national origin, ancestry, ethnicity, disability, veteran status, genetic information, sexual orientation, marital status, or any characteristic protected under applicable law. Vertex is an E-Verify Employer in the United States. Vertex will make reasonable accommodations for qualified individuals with known disabilities, in accordance with applicable law. Any applicant requiring an accommodation in connection with the hiring process and/or to perform the essential functions of the position for which the applicant has applied should make a request to the recruiter or hiring manager, or contact Talent Acquisition at ApplicationAssistance@vrtx.com