|
Site Reliability Engineering (SRE) is an engineering discipline that combines software development and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for the availability and reliability of our firm's most critical platform services and ensures they meet the requirements of our internal and external users. We also develop and operate the observability platforms that all other engineering teams use to make their services reliable. We look for engineers who are motivated to collaborate with other engineering teams and our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business and regulatory environment. How will you fulfil your potential?
- Balance feature development velocity and reliability with well-defined SLOs.
- Run the Production environment by monitoring availability and taking a holistic view of system health.
- Drive incident management process and support a blameless post-mortems culture.
- Partner with development teams to improve services via rigorous testing and release procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Create sustainable systems and services through automation and uplifts.
- Champion reliability and resilience engineering practices and knowledge across the firm.
Basic Qualifications
- Minimum of 2+ years of hands-on experience in Site Reliability Engineering, with a proven track record in building, and maintaining highly available, scalable, and fault-tolerant systems at an enterprise level.
- BS degree in Computer Science or related technical field involving coding and / or systems engineering.
- Proficiency in one or more of the following: Go, Python, C, C++, Java, Perl, Ruby or shell scripting.
- Experience with product engineering practices, algorithms, data structures, software design and/or Experience with UNIX operating systems internals and / or networking.
Preferred Qualifications
- Experience in developing AI tools and working with Cloud Platforms
- Experience with distributed systems design, maintenance, and troubleshooting.
- Hands-on experience with debugging and optimizing code, as well as automation.
- Strong interpersonal skills, drive, and ownership.
- Coding beyond simple scripts.
- Solving novel problems from first principles.
- Experience working in highly regulated, financial services firms.
- Excellent people leadership skills either as an engineering manager or individual contributor
|