Los Alamos National Laboratory Archive Storage System Administrator (Scientist 2/3) in Los Alamos, New Mexico

What You Will Do

The High Performance Computing (HPC) Division at Los Alamos National Laboratory provides scientific computing resources consisting of some of the largest HPC systems in the world, including a large (19K+ node) Cray system called Trinity, as well as numerous large commodity cluster systems. The High Performance Computing (HPC) Data Storage Teamwithin the HPC Systems Group (HPC-SYS)provides vanguard production support, research, and development for existing and future systems that feed and unleash the power of the supercomputer. The Data Storage Team designs, builds and maintains some of the largest, fastest and most complex data movement and storage systems in the world, including systems supporting 100 Petabytes of capacity. We provide storage systems spanning the full range of tiers from the most resilient archival systems to the pinnacle of high-speed storage, including all-flash file systems and systems supplying bandwidth that exceeds a terabyte per second to some of the largest and fastest supercomputers in the world.

Innovators and builders at heart, the Data Storage team seeks highly motivated, productive, inquisitive, and multi-talented candidates who are equally comfortable working independently as well as part of a team. There are frequent opportunities for collaborative work with scientists and staff within the group (for instance with scientists designing and operating our high-speed networking infrastructure) or with scientists from other groups, including close collaborative research opportunities with LANL’s Ultrascale Systems Research Center (USRC), to help drive cutting edge advances.

This role requires strong communication skills, as well as comprehensive troubleshooting and analytical skills. Team member duties include: designing, building, and maintaining world-class data movement and storage systems; evaluating and testing new technology and solutions; system administration of HPC storage infrastructure in support of compute clusters; diagnosing, solving, and implementing solutions for various system operational problems; tuning file systems to increase performance and reliability of services; process automation; interacting with vendors; and communicating and collaborating with other groups, teams, projects and sites. Specifically, the selected candidate will support Archival storage in both production and forward looking efforts. The role will likely include database administration in addition to general storage administration.

The selected candidate will participate in a regularly scheduled rotation of on-call support of production systems, including some systems under 7x24 hour support. In addition, some non-standard working hours may occasionally be required. This position is full-time and is located at Los Alamos National Laboratory in Los Alamos, New Mexico.

This position will be filled at either the Scientist 2 or Scientist 3 level, depending upon the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.

Scientist 2 ($87,800 - $144,800)

  • Participate in periodic on-call responsibilities.

  • Work both independently and collaboratively with other members of the archive team or group after receiving initial direction and requirements from technical project leads.

  • Troubleshoot, diagnose root cause of system failures, and isolate the components / failure scenarios while working with internal & external stakeholders

  • Develop and publish updates on resolutions and communicate findings internally.

  • Work with team members to make modifications and additions to existing systems, code, and methods.

  • Work with team to bring up new hardware and test functionality.

  • Participate in process improvement, including deep multi-system problem isolation and resolution often in collaboration with administrators of other HPC subsystems.

  • Work with team members to document, design, and implement new ideas and approaches for newer architectures and improve those for existing ones.

  • Present best practices, experience reports, and/or research results to managers and to peers locally or at conferences.

Scientist 3 ($96,600 - $161,300)

In addition to the duties outlined above, a successful Scientist 3 candidate will be required to:

  • Work as a technical leader/subject matter expert to propose and implement solutions to current problems and future deficiencies in our HPC archive storage environment in conjunction with junior and senior administrators and technical staff within and across teams.

  • Proactively create experiments and tooling to validate solutions and to detect and diagnose hardware health issues.

  • Analyze published research papers in the area of archive and data storage, summarize, and share implications and connections to ongoing work with team members.

  • Interact and/or collaborate with people from other teams, groups, divisions, directorates, and programs to develop, implement, and/or communicate technical solutions.

  • Enhance technical and professional expertise of other staff and students through active mentoring and training activities.

  • Contribute to peer review of the work of others across organizations or disciplines within the laboratory..

  • Present best practices and research results to national peers at conferences, workshops, and meetings, as well as participate in national strategic partnerships.

What You Need

Minimum Job Requirements:

  • Strong interpersonal and written and oral communication skills.

  • Demonstrated ability to work within a team environment.

  • Experience with relational database administration.

  • Demonstrated knowledge of building, configuring, and administering production Linux computer/storage systems.

  • Practical experience scripting in Bash, Perl, Python, or similar languages.

  • Strong command line Linux operating system skills.

  • Ability to mentor and lead individual junior team members and students.

  • Broad knowledge of data storage administration.

  • Knowledge of storage system hardware.

  • Working knowledge of networking concepts and practices.

  • Knowledge of or experience with hardware and software security practices.

Additional Job Requirements for Scientist 3:

In addition to the Job Requirements outlined above, qualification at the Scientist 3 level requires:

  • Broad demonstrated knowledge of production HPC system management topics, including networking, programming, file systems, operating systems, and configuration management, with depth in one or more areas.

  • Demonstrated programming experience including compiled languages and advanced scripting.

  • Ability to lead and mentor teams, students, or junior team members.

  • Demonstrated ability to initiate, design, and lead projects.

  • Demonstrated ability to evaluate competing HPC subsystem technologies.

  • Ability to analyze published research papers in the area of data storage, summarize research results, and share implications and connections to ongoing work with team members.

  • Ability to present technical papers and/or technical work to peers locally or at conferences.

Desired Skills:

  • Experience managing tape storage infrastructure (Oracle, Quantum, IBM, SpectraLogic)

  • Experience with backup/archival software (Tivoli Storage Manager, Commvault, HPSS, Oracle HSM, etc.)

  • Experience deploying and managing SAN infrastructure.

  • Knowledge of parallel/distributed file systems (e.g., Lustre, GPFS, Panasas, Glustre).

  • Demonstrated experience building, configuring and managing parallel or distributed file systems.

  • Knowledge of file systems such as ZFS, EXT, XFS.

  • Experience working in a production computing environment, preferably with HPC data storage systems or at large scale.

  • Working knowledge of file system structures and algorithms.

  • Experience with Object storage and RESTful storage interfaces.

  • Experience diagnosing system software problems.

  • Experience supporting a scientific user base.

  • Experience with multiple Linux distributions.

  • Experience with multiple network technologies (e.g., Ethernet, IB, OPA).

  • Experience with revision control systems such as RCS, Subversion, or Git.

  • Experience with low-level system administration tools such as perf, strace, tcpdump, and vmstat.

  • Experience managing computers in a DOE or DOD classified environment.

  • Familiarity with Cfengine, Chef, Puppet, Ansible, Salt, or similar configuration and automation tools and practices.

  • Deep knowledge of and demonstrated experience with parallel and distributed storage systems.

  • Contribution to open source or non-work-related projects.

  • Ability to acquire and maintain a DOE Q-level clearance.

Additional Details:

Clearance: Q(Position will be cleared to this level). Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements* for access to classified matter.

*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.

New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing.

Regular position:Term status Laboratory employees applying for regular-status positions are converted to regular status.

Equal Opportunity:Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regards to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation or preference, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to applyhelp@lanl.gov or call 1-505-665-4444 option 1.

Where You Will Work

Located in northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. LANL enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.

The High Performance Computing (HPC) Division provides production high performance computing systems services to the Laboratory. HPC Division serves all Laboratory programs requiring a world-class high performance computing capability to enable solutions to complex problems of strategic national interest. Our work starts with the early phases of acquisition, development, and production readiness of HPC platforms, and continues through the maintenance and operation of these systems and the facilities in which they are housed. HPC Division also manages the network, parallel file systems, storage, and visualization infrastructure associated with the HPC platforms. The Division directly supports the Laboratory’s HP user base and aids, at multiple levels, in the effective use of HPC resources to generate science. Additionally, we engage in research activities that we deem important to our mission.

Location: Los Alamos, NM, US

Contact Name: Doyle, Christine Louise

Organization Name: HPC-SYS/ High Performance Computing Systems

Email: cdoyle@lanl.gov

Job Title: Archive Storage System Administrator (Scientist 2/3)

Appointment Type: Regular

Req ID: IRC62917