|
Head of Production Services Governance, Incident & Problem Management Role Summary The Head of Production Services Governance, Incident & Problem Management is accountable for the enterprise governance, standards, and performance of Technology Incident Management and Problem Management (including root cause analysis) across BNY's Platforms. This leader oversees a team that sets the operating model, drives consistent execution, improves quality and speed of restoration, and strengthens auditability and regulatory credibility. The role is the senior point of accountability for:
- Firm-wide incident/problem governance and ITIL-aligned standards
- High-severity incident command and communications frameworks
- End-to-end RCA quality and timeliness, including corrective/preventive actions
- Regulatory and client-facing incident narratives and responses
- Internal oversight engagement with groups such as ORR and ERO
- Automation and AI augmentation to modernize and scale incident/problem practices
This position partners closely with engineering, SRE/operations, cyber, resiliency, risk, compliance, and business stakeholders to ensure stability, transparency, and continuous improvement of production services. Key Objectives
- Protect service availability and client experience by ensuring rapid restoration and disciplined incident handling.
- Improve resiliency and reduce repeat incidents through high-quality problem management, robust RCAs, and effective remediation governance.
- Strengthen governance and audit defensibility by ensuring consistent process adherence, evidence capture, and clear accountability.
- Modernize production governance through automation, AIOps capabilities, and AI-assisted workflows.
- Elevate operational excellence through measurable improvements in MTTR, recurrence, SLA adherence, and control effectiveness.
Primary Responsibilities 1) Enterprise Incident Management Governance (ITIL)
- Own the Incident Management practice and ensure it is implemented consistently across Platform Production Services and aligned to ITIL principles.
- Establish and maintain incident taxonomy, severity models, prioritization rules, escalation paths, and functional/organizational RACI.
- Define Major Incident Management (MIM) framework: incident command roles, war-room orchestration, communications cadence, stakeholder engagement, and decision rights.
- Ensure end-to-end controls: accurate incident logging, categorization, impact assessment, timeline reconstruction, evidence retention, and closure criteria.
- Drive performance through standard KPIs (e.g., MTTA/MTTR, reopen rate, SLA compliance, major incident frequency, customer-impact minutes, incident backlog health).
2) Enterprise Problem Management & RCA Excellence (ITIL)
- Own the Problem Management practice including proactive problem identification, trending, and prevention of recurrence.
- Establish RCA standards (methodologies such as 5 Whys, fishbone, fault tree, "cause-trigger-control gap" framing) and ensure consistent quality across teams.
- Govern Corrective and Preventive Action (CAPA) management: remediation backlog, prioritization, due dates, owner accountability, and validation of effectiveness.
- Maintain governance for Known Errors and Workarounds, enabling faster recovery and better knowledge reuse.
- Drive systemic improvements by connecting incidents/problems to resiliency risks, architectural weaknesses, control gaps, and engineering quality.
3) Regulatory, Client, and Executive Communications & Responses
- Serve as accountable executive for regulatory responses and supervisory requests relating to incidents, outages, recovery actions, RCA findings, and resiliency improvements.
- Lead firm readiness for time-sensitive regulatory deliverables-ensuring accuracy, consistency, and defensible evidence.
- Coordinate and quality-assure client communications for impactful incidents (internal/external statements, timelines, cause, remediation, and prevention).
- Provide clear executive narratives and materials for senior leadership, risk committees, audit committees, and business stakeholders.
4) Oversight & Partnership Model (ORR, ERO, Risk, Audit, Compliance)
- Act as the primary interface to internal oversight groups (e.g., ORR, ERO, Operational Risk, Compliance, Internal Audit, and Technology Risk Management).
- Ensure incidents/problems are appropriately mapped to relevant governance constructs (e.g., operational risk events where applicable) with clear traceability.
- Lead continuous improvement of control coverage and evidence quality to support audits and examinations.
- Partner with Resiliency teams to connect operational learning to scenario testing, dependency mapping, recovery planning, and service resiliency metrics.
5) Standardization, Quality Assurance, and Continuous Improvement
- Build and run a Quality Management System for incident/problem practices: sampling, assurance reviews, coaching, playbooks, and maturity assessments.
- Develop and maintain standard artifacts (runbooks, major incident playbooks, comms templates, RCA templates, PIR guidance).
- Run Continual Improvement programs: trend analysis, "top drivers" remediation themes, performance benchmarking, and maturity roadmaps.
- Drive adoption of consistent tooling, workflows, and data standards across platforms.
6) Automation & AI Enablement (AIOps / Intelligent Operations) This role is expected to use AI responsibly to improve speed, quality, and scale of incident/problem management while meeting security, privacy, and model-risk expectations. Key AI and automation outcomes include:
- AI-assisted triage: classification, routing, deduplication, and severity recommendation based on history and signals.
- Correlation and probable cause insights using telemetry, topology, and change data to identify likely blast radius and suspects.
- Automation for repetitive tasks: stakeholder updates, timeline capture, evidence packaging, and post-incident documentation generation.
- RCA acceleration: AI-supported timeline reconstruction, log summarization, anomaly explanation, and "similar incident" retrieval.
- Knowledge management uplift: automated drafting of knowledge articles/workarounds; improvement suggestions based on recurrence patterns.
- Establish governance for AI usage: model transparency, human-in-the-loop controls, data handling, audit logs, and bias/quality monitoring.
7) Leadership & Talent Development
- Lead and develop a high-performing team of incident/problem governance professionals (e.g., problem managers, automation analysts).
- Establish role clarity, training paths, and ITIL-aligned capability development.
- Foster a culture of calm, disciplined execution during crises and a learning culture post-incident-focused on prevention, not blame.
Scope & Decision Rights
- Enterprise-level authority to define and enforce incident/problem standards and minimum controls.
- Authority to convene major incident response, direct escalations, and require timely executive updates.
- Authority to gate incident/problem closure based on quality criteria (documentation, evidence, RCA completeness, CAPA commitments).
- Joint governance with engineering/production leaders to prioritize remediation work and measure effectiveness.
Key Interfaces
- Platform Production Services leaders, SRE/Operations, Engineering, Architecture
- Cybersecurity Operations, Fraud/Financial Crime Technology (as relevant)
- Enterprise Resiliency Office (ERO)
- Office of Regulatory Relations (ORR)
- Operational Risk, Compliance, Legal, Privacy
- Internal Audit, Technology Risk Management
- Business/Product leadership and client coverage teams
Required Qualifications
- 10-15+ years in technology operations, SRE/production services, service management, or resiliency roles in complex enterprises; regulated financial services strongly preferred.
- Demonstrated leadership in Major Incident Management and Problem Management/RCA at enterprise scale.
- Strong command of ITIL practices (Incident, Problem, Monitoring & Event, Service Level, Change Enablement, Continual Improvement; familiarity with CMDB/Service Configuration is a plus).
- Proven experience driving process standardization, operating model change, and measurable performance improvements (e.g., MTTR reduction, recurrence reduction).
- Experience leading regulatory/audit-facing responses with strong evidence discipline and executive communication.
Preferred Qualifications / Certifications
- ITIL 4 Managing Professional (MP) and/or ITIL Strategic Leader (SL); ITIL Foundation minimum.
- Familiarity with ISO/IEC 20000, NIST, and resiliency/operational risk expectations in financial services (helpful but not required).
- Experience with AIOps platforms/observability tooling (e.g., event correlation, log analytics, tracing, anomaly detection).
- Experience with Agile/DevOps/SRE operating models and integrating incident/problem practices into product/platform delivery.
Core Competencies (What "Great" Looks Like)
- Crisis leadership: calm command presence, structured decision-making, clear communications under pressure.
- Governance rigor: sets standards that are pragmatic, scalable, and audit-defensible.
- Analytical excellence: uses trends and data to drive prevention, not just restoration.
- Influence without friction: partners effectively with engineering leaders to get remediation done.
- Automation mindset: removes manual steps, improves quality through workflow and tooling.
- AI fluency with controls: leverages AI safely with strong human oversight and evidence trails.
Success Metrics (Illustrative)
- Reduced major incident frequency and customer-impact minutes (YoY).
- Improved MTTR/MTTA and decreased escalations due to better routing/triage.
- Increased RCA timeliness and quality scores, fewer incomplete RCAs, higher CAPA completion on time.
- Reduced repeat incidents driven by top recurring causes.
- Improved audit/regulatory outcomes: fewer findings, faster response cycles, higher evidence quality.
- Increased automation coverage: % of incidents with AI-assisted classification/correlation; reduction in manual documentation hours.
|