OpenText
Senior Site Reliability Engineer
- Design and maintain a highly available Amazon EKS cluster with Istio service mesh across multiple AWS Availability Zones, sustaining 99.99% uptime and enabling elastic pod and node auto-scaling — optimizing performance and cost efficiency at scale.
- Lead end-to-end observability and automated incident response using New Relic and PagerDuty, reducing production incident detection time by 40% and improving on-call responsiveness across distributed services.
- Own everything-as-code Kubernetes cluster deployments using Terraform and GitLab CI/CD pipelines, eliminating manual provisioning steps and reducing deployment lead time by over 60%.
- Design and maintain multi-cloud infrastructure across AWS and Azure using Terraform, ensuring consistent, repeatable environments and reducing configuration drift across dev, staging, and production.
- Engineer intelligent New Relic dashboards and alerting policies to enable fast, data-driven troubleshooting, proactively surfacing performance degradations before they impact end users.
- Collaborate with multiple engineering teams to diagnose and resolve production incidents across multi-layered microservices; contribute to blameless postmortems and implement preventive automation to eliminate recurring issues.
- Participate in on-call rotation, owning incident response, triage, and resolution for production platform issues — driving down MTTR through runbooks and automated remediation.
- Lead technical screening and structured onboarding programs for new hires, accelerating ramp-up time and improving team capability.
- Mentor and coach junior and intermediate engineers on Kubernetes, cloud infrastructure, and DevOps best practices — raising team capability and fostering a culture of knowledge sharing and continuous improvement.