HPC Linux Administrator for AI Infrastructure (Scientist 2/3)

What You Will Do

Join the High Performance Computing Operations Group (HPC-OPS) in operating and maintaining some of the fastest supercomputers in the world. Designing, operating and maintaining these systems requires highly skilled personnel that specialize in both the hardware and software aspects of High Performance Computing. Innovators at heart, HPC-OPS Linux Administrators work both independently and collaboratively to maintain and implement capability improvements across a complex computing environment. This team is currently building on-premise cloud-like infrastructure to support the AI/ML/LLM needs of the laboratory.

The Platforms Team is seeking to add highly knowledgeable and motivated team members to help build and deploy the AI/ML/LLM infrastructure for LANL. This person will be an expert Linux Administrator who will help design, build and run our production NVidia DGX/HGX pods optimized for our environment and workflow. They will run and manage both admin and user-facing services with an understanding of modern AI/ML/LLM user workflows, Kubernetes, and other common tools. The successful candidate will participate in periodic on-call responsibilities managing NVidia SuperPods and Kubernetes clusters, while actively growing their technical skills and staying up to date with the latest technologies in the field. In addition, the selected candidate will have the opportunity to develop technical products such as technical documentation, presentations, technical papers, and reports, to communicate findings internally and at conferences.

The selected HPC Cluster/ Nvidia SuperPod Linux Administrator (Scientist 2/3) will provide strategic design, testing, analysis, administration, configuration management, verification, and validation of the newly developed cloud-like infrastructure and specialized compute infrastructure for AL/ML workloads. Mentoring of students, junior staff, and peers in technical and professional growth activities is highly valued, as is maintaining state-of-the-art technical expertise and knowledge within HPC system administration and developing new skills in related disciplines. This is your chance to directly support our national security mission and continue to make LANL the best place to work as a member of a dynamic, team-oriented, and leading-edge technical capability team.

This position will be filled at either the Scientist 2/3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.

What You Need

Minimum Job Requirements:

Scientist 2: ($101,700 - $168,200)
  • Advanced Linux Administration Expertise: Demonstrated knowledge of administering production Linux computer systems, including strong command line Linux operating system skills, working knowledge of or experience with hardware and software security practices, and experience scripting in Bash, Perl, Python, or similar languages.
  • Configuration Management Expertise: Demonstrated experience with configuration and automation tools and practices, such as Chef, Puppet, Ansible, Salt, CFEngine, or similar tools.
  • Troubleshooting and Technical Analysis Acumen: Significant knowledge and demonstrated experience in formulating and testing hypotheses, investigating alternative solutions, and recommending solutions to technical problems.
  • Computer Networking Expertise: Working knowledge of networking concepts and practices.
  • Communication and Teaming Skills: Demonstrated effective communication skills, both verbal and written, including the ability to communicate technical information to both technical and non-technical personnel, to provide assistance and knowledge to peers, to collaborate with Group members, other HPC Group personnel and vendor representatives, as required, and to formulate and communicate technical results and findings to technical audiences and readerships (examples can include publications, team projects, and presentations).
  • Troubleshooting skills: Demonstrated ability to troubleshoot hardware and software errors, prioritizing problems and assessing impact to stakeholders, documenting problems and solutions.

Additional Job Requirements for Scientist 3: ($122,300 - $206,300):

In addition to the Job Requirements outlined above, qualification at the Scientist 3 level requires:
  • Container Orchestration Expertise: Demonstrated experience managing, administering and maintaining large production Kubernetes clusters.
  • Troubleshooting Expertise: Experience troubleshooting and debugging user workflows in a Kubernetes environment.
  • Computer Networking Expertise: High performance interconnects, preferably NVLink and InfiniBand networks.
  • Leadership: Demonstrated experience with project planning and management. Ability developing and leading complex projects, generating formal project plans, delegating tasks, and providing routine updates to management.
  • HPC Experience: Demonstrated experience building, installation, and administration of HPC systems. Experience with modern image building and provisioning tools.
  • Mentoring: Ability to mentor and lead individual junior team members and students.

Education/Experience at Scientist 2:

Position requires a Bachelor' degree in a STEM field from an accredited college and university and 4 years of relevant experience or an equivalent combination of education and experience directly related to the occupation.

Education/Experience at Scientist 3:

Position requires a Master's degree in a STEM field from an accredited college or university and 6 years of relevant experience or an equivalent combination of education and experience directly related to the occupation.

Desired Qualifications:
  • Experience running NVidia DGX/HGX systems or pods in a production environment
  • Experience writing and debugging Kubernetes microservices in Go
  • Knowledge of Cloud technologies
  • Experience integrating operational metrics into a monitoring system such as Splunk
  • Demonstrated effective communication skills, including demonstrated ability to work productively with customers and vendors
  • High attention to detail including excellent organizational skills, analytical thinking, observational and problem-solving skills. Proven ability to independently multi-task and adjust to the workings of a dynamic and fast paced environment.
  • Experience with Git, creating issues, branches, merge requests and using CI/CD pipelines
  • Experience modifying Unix/Linux operating systems (e.g., enabling/disabling kernel modules).
  • Practical experience with Splunk or other monitoring tools.
  • Knowledge of or demonstrated experience with parallel and distributed storage systems; knowledge of file systems such as ZFS, EXT, XFS; working knowledge of file system structures and algorithms; and/or experience with Object storage and RESTful storage interfaces. Experience administering cluster storage technologies such as Ceph.
  • Demonstrated ability to develop new methods, techniques, or approaches to address critical technical problems and/develop new technical capabilities.
  • An Active DOE Q Clearance

Work Location:

This position will be located in Los Alamos, NM, with the potential for a hybrid work arrangement (60% onsite/40% offsite) from a location within 2 hours ground commute of this location. Reporting onsite will be required. Hybrid is at the discretion of management and can change at any time with appropriate notice.

Position commitment: Regular appointment employees are required to serve a period of continuous service in their current position in order to be eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the time required, they may only apply for Laboratory jobs with the documented approval of their Division Leader. The position commitment for this position is 1 year.

Note to Applicants:

For consideration, applicants should submit a cover letter addressing how their knowledge, skills and abilities meet the minimum requirements along with a resume.
Where You Will Work

Located in beautiful northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. Our generous benefits package includes:

§ PPO or High Deductible medical insurance with the same large nationwide network

§ Dental and vision insurance

§ Free basic life and disability insurance

§ Paid childbirth and parental leave

§ Award-winning 401(k) (6% matching plus 3.5% annually)

§ Learning opportunities and tuition assistance

§ Flexible schedules and time off (PTO and holidays)

§ Onsite gyms and wellness programs

§ Extensive relocation packages (outside a 50 mile radius)
Additional Details

Directive 206.2 - Employment with Triad requires a favorable decision by NNSA indicating employee is suitable under NNSA Supplemental Directive 206.2. Please note that this requirement applies only to citizens of the United States. Foreign nationals are subject to a similar requirement under DOE Order 142.3A.

Clearance: Q (Position will be cleared to this level). Selected applicants will be subject to a background investigation conducted by or on behalf of the Federal Government, and must meet eligibility requirements* for access to classified matter. This position requires a Q clearance. and obtaining such clearance requires US Citizenship except in extremely rare circumstances. Dependent upon the position, additional authorization to access classified information may be required, which may or may not be available to dual citizens. Receipt of a Q clearance and additional access authorization ultimately is a decision of the Federal Government and not of Triad.

*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.

New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing. Although New Mexico and other states have legalized the use of marijuana, use and possession of marijuana remain illegal under federal law. A positive drug test for marijuana will result in termination of employment, even if the use was pre-offer.

Regular position: Term status Laboratory employees applying for regular-status positions are converted to regular status.

Internal Applicants: Regular appointment employees who have served the required period of continuous service in their current position are eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the required period of continuous service, they may only apply for Laboratory jobs with the documented approval of their Division Leader. Please refer to Policy Policy P701 for applicant eligibility requirements.
Equal Opportunity: Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regard to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to applyhelp@lanl.gov or call 1-505-664-6947 option 2.