HPC & AI Workload Management Engineer (Scientist 2/3)
- Req. Number: IRC138986
- Organization : HPC-ENV/HPC Environments
- City, State: Los Alamos, New Mexico
Join the High Performance Computing Environments Group (HPC-ENV) at Los Alamos National Laboratory, where we manage and operate advanced large scale computing infrastructure supporting a diverse range of critical workloads. We seek experienced professionals in HPC scheduling and resource management to ensure the reliable operation of our production systems, which include:
- Large-Scale Modeling & Simulation (e.g., physics, engineering, climate)
- Artificial Intelligence & Machine Learning (AI/ML) deployments
- Emerging Architectures for next-generation computing needs
Key Responsibilities
- Manage & Optimize HPC scheduling and resource allocation for a variety of production workloads
- Collaborate with vendors and internal teams for issue resolution and system enhancements
- Deploy & Maintain HPC job scheduling software and databases on new and existing systems
- Ensure Seamless Integration of scheduling solutions with emerging HPC and AI architectures
This position will be filled at either the Scientist 2 or Scientist 3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.
What You Need
Minimum Job Requirements:
Scientist 2 ($96,100 - $164,100)
Responsibilities:
- Configure and analyze scheduling software for production environments
- Collaborate with vendors on support tickets and system updates
- Optimize system efficiency and workflow processes
- Apply Modern Software Practices to streamline workflow automation and efficiency
- Participate in on-call rotations
- Strong Linux administration
- Programming skills (Python, Bash, C/C++)
- Familiarity with:
- Large-scale system administration or job scheduling
- modeling/simulation workflows OR AI/ML deployment and tooling
Scientist 3 ($119,200 - $201,100)
In addition to what was outlined at the lower level, at this level you will:
Responsibilities:
- Lead the deployment of new HPC systems and architectures
- Drive optimization for large-scale, diverse workloads (simulation, AI/ML, etc.)
- Collaborate with stakeholders and vendors to define system requirements
- Champion adoption of innovative practices in HPC management and automation
- Work hand in hand with production system administration in determining the most difficult problems involving applications running on HPC systems.
Requirements:
- Proven experience in managing production HPC environments
- Demonstrable accomplishments as technical lead on a project
- Demonstrated experience administering an HPC or AI job scheduler or resource manager (i.e. Slurm, Moab, LSF, PBS/Torque, Grid Engine, FLUX, Dakota, Swift/T, Run:AI, etc.)
- Experience with the internals of an MPI (i.e. OpenMPI, MPICH), PMIx or similar parallel runtime program model
Education/Experience for the lower level: Positions requires a Bachelor' degree in a STEM field from an accredited college and university and 4 years of related experience, typically with post-doctoral research experience at a university or national lab or equivalent experience directly related to the occupation
Education/Experience for the higher level: Position requires a Master's degree in a STEM field from an accredited college or university and 6 years of relevant experience or an equivalent combination of education and experience directly related to the occupation.
Desired Qualifications:
- Knowledge of virtualization, containerization, and orchestration
- Expertise in managing large, complex computing environments
- Anticipate, experiment with and optimize workload and workflow efficiency
- Work hand in hand with production system administration in determining the most difficult problems involving applications running on HPC systems.
- Experience working with, debugging, and adding features to large code bases in languages like C and C++.
- Experience with monitoring, dashboards, data visualization (i.e. Splunk, Grafana)
Work Location:
The work location for this position is hybrid and is located in Los Alamos, NM. Hybrid is defined as working partially onsite/partially offsite but within 2 hours ground commute of this location. All work locations are at the discretion of management and can change at any time with appropriate notice.
Position commitment: Regular appointment employees are required to serve a period of continuous service in their current position in order to be eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the time required, they may only apply for Laboratory jobs with the documented approval of their Division Leader. The position commitment for this position is 1 year.
Note to Applicants:
We encourage applicants to include a cover letter. This should directly address the requirements and qualifications above that you meet, referencing them by number.
Due to federal restrictions contained in the current National Defense Authorization Act, citizens of the People's Republic of China-including the special administrative regions of Hong Kong and Macau-as well as citizens of the Islamic Republic of Iran, the Democratic People's Republic of Korea (North Korea), and the Russian Federation, who are not Lawful Permanent Residents ("green card" holders) are prohibited from accessing facilities that support the mission, functions, and operations of national security laboratories and nuclear weapons production facilities, which includes Los Alamos National Laboratory.
Where You Will Work
Located in beautiful northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. Our generous benefits package includes:
§ PPO or High Deductible medical insurance with the same large nationwide network
§ Dental and vision insurance
§ Free basic life and disability insurance
§ Paid childbirth and parental leave
§ Award-winning 401(k) (6% matching plus 3.5% annually)
§ Learning opportunities and tuition assistance
§ Flexible schedules and time off (PTO and holidays)
§ Onsite gyms and wellness programs
§ Extensive relocation packages (outside a 50 mile radius)
Additional Details
Directive 206.2 - Employment with Triad requires a favorable decision by NNSA indicating employee is suitable under NNSA Supplemental Directive 206.2. Please note that this requirement applies only to citizens of the United States. Foreign nationals are subject to a similar requirement under DOE Order 142.3A.
Clearance: Q (Position will be cleared to this level). Selected applicants will be subject to a background investigation conducted by or on behalf of the Federal Government, and must meet eligibility requirements* for access to classified matter. This position requires a Q clearance. and obtaining such clearance requires US Citizenship except in extremely rare circumstances. Dependent upon the position, additional authorization to access classified information may be required, which may or may not be available to dual citizens. Receipt of a Q clearance and additional access authorization ultimately is a decision of the Federal Government and not of Triad.
*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.
New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing. Although New Mexico and other states have legalized the use of marijuana, use and possession of marijuana remain illegal under federal law. A positive drug test for marijuana will result in termination of employment, even if the use was pre-offer.
Regular position: Term status Laboratory employees applying for regular-status positions are converted to regular status.
Internal Applicants: Regular appointment employees who have served the required period of continuous service in their current position are eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the required period of continuous service, they may only apply for Laboratory jobs with the documented approval of their Division Leader. Please refer to Policy Policy P701 for applicant eligibility requirements.
Equal Opportunity: Los Alamos National Laboratory is an equal opportunity employer. All employment practices are based on qualification and merit, without regard to protected categories such as race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal, state, and local laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to applyhelp@lanl.gov or call (505)-664-6947 opt. 3.