Site Reliability Engineer II

Remote

Posted 5 months ago

Pinterest is hiring a Site Reliability Engineer II (SRE) to join their Engineering team. The SRE organization is crucial for ensuring Pinterest’s overall availability and enhancing engineering teams’ ability to design, build, and operate robust systems at scale. This role focuses on developing and building systems that assure the reliability of large-scale distributed systems handling billions of page views and petabytes of data.

This is a Regular, full-time position that is Remote. The role requires being in the office for in-person collaboration 1-2 times per half (twice a year), meaning candidates can be situated anywhere in the country.

Role Summary and Reliability Mandate

The SRE II will apply software engineering principles to infrastructure and operations problems, specializing in scaling, optimizing, and automating critical systems. You will gain a deep understanding of complex system behaviors to identify risks and implement long-term solutions that minimize operational overhead.

Key Responsibilities:

Software Development for Reliability: Develop software solutions to enable the reliability and operability of large-scale distributed systems that handle petabytes of data.
System Insight and Risk Identification: Build a deep understanding of how Pinterest’s systems behave, scale, interact, and fail, using that insight to identify risks and opportunities for remediation.
Toil Elimination and Automation: Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes, and best practices for use across Pinterest Engineering.
SLI/SLO Implementation: Build meaningful, insightful, and actionable SLIs (Service Level Indicators).
Process Automation: Automate critical portions of Pinterest’s engineering processes to minimize risk and maximize the speed of innovation.
Capacity Management: Manage capacity and performance to help scale the infrastructure across both public and private clouds around the world.

Required Experience and Technical Qualifications

The ideal candidate is a software-oriented SRE with experience in large-scale distributed systems, a strong background in Linux internals, and proficiency in modern programming and infrastructure technologies.

Experience (Required):
- 2+ years of experience programming using Python or Go.
- Strong knowledge of Linux/Unix/BSD internals and experience working with open source software (e.g., MySQL, Hadoop, Envoy, HAProxy, Nginx).
- Experience with technologies such as ElasticSearch, ZooKeeper, HBase, Hadoop, Memcache, and Kafka with a focus on reliability, automation, operability, and performance.
Education: Bachelor’s degree in Computer Science or a related field, or equivalent experience.
Infrastructure Knowledge (A Plus):
- Infrastructure as Code (IaC) experience is a plus (e.g., Terraform, Puppet, Chef, Ansible, Salt, Fabric, Docker, etc.).
- Experience with deploying web apps to cloud infrastructure (AWS, etc.) and working with distributed, service-oriented architecture.

Job Features

Job Category

Software Engineering

Role Summary and Reliability Mandate

Key Responsibilities:

Required Experience and Technical Qualifications

Job Features

Apply For This Job