HPC Storage Systems Administrator (DAOS) – National Lab Support
Remote
Posted 1 month ago
Myticas Consulting is seeking an experienced HPC Storage Systems Administrator on behalf of a confidential national laboratory client. This is a senior, 100% remote contract role focused on maintaining reliable object storage for demanding scientific workloads in a high-performance computing (HPC) environment.
- Location: Remote (Supporting client near Argonne, Illinois)
- Contract Type: Full-time (40 hours/week)
- Experience: 3–7 years administering Linux in production; 2+ years operating high-performance distributed storage systems at scale.
- Focus: Day-to-day operations, maintenance, and incident resolution for large-scale HPC storage clusters, specifically DAOS.
Key Responsibilities: Distributed Storage Operations and Automation
The administrator is responsible for the health, stability, and security of the client’s cutting-edge distributed storage technology.
- Daily Operations: Provide daily operations support, maintenance, and issue resolution for HPC storage clusters, with a focus on DAOS.
- Diagnostics & Vendor Coordination: Monitor system health, perform diagnostics and root-cause analysis, and coordinate with internal teams and hardware/software vendors (e.g., HPE) to resolve storage incidents.
- Maintenance: Perform upgrades, patches, and configuration changes to maintain system stability and security.
- Automation: Automate routine administration tasks using scripting (Bash, Python) and/or configuration management tools (Ansible).
- Documentation: Create or follow operational runbooks and documentation to ensure the availability and reliability of large-scale distributed object storage.
Required Skills & Expertise: Storage at Scale
Success requires experience managing petabyte-scale storage systems and hands-on hardware/software troubleshooting abilities.
- Linux Systems: 3–7 years administering Linux systems in production environments, including command-line administration.
- Distributed Storage (2+ years): Experience with large-scale distributed object or parallel file systems (e.g., HPE DAOS, Lustre, GPFS/Spectrum Scale, Ceph) and coordinating with vendors.
- Hardware: 2+ years hands-on experience with server and storage hardware troubleshooting and maintenance.
- Automation: 1+ years automating system administration tasks with scripting (Bash, Python) or configuration management tools (Ansible).
- Residency: Must reside in the United States and be authorized to work without sponsorship.
Preferred Skills:
- Hands-on production experience administering HPE DAOS.
- Experience supporting HPC clusters, supercomputing environments, or scientific workflows.
- Experience creating operational runbooks, monitoring dashboards, and documentation.
Job Features
| Job Category | Cloud Engineering, Software Engineering |