Senior Site Reliability Engineer – (Development)

  • Full Time
  • Colombo

Sysco LABS Sri Lanka

Job description
Senior Site Reliability Engineer – (Development)

Sysco LABS is the captive innovation arm of Sysco Corporation (NYSE: SYY), the world’s largest foodservice company. Sysco is a Fortune 500 company and the global leader in selling, marketing, and distributing food products to restaurants, healthcare, and educational facilities, lodging establishments and other customers who prepare meals away from home. Its family of products also includes equipment and supplies for the foodservice and hospitality industries. With more than 76,000 colleagues, the company operates 334 distribution facilities worldwide and serves approximately 730,000 customer locations. For fiscal year 2024 that ended July 1, 2024, the company generated sales of more than $78.8 billion.

Operating with the agility and tenacity of a tech–startup, powered by the expertise of the industry leader, Sysco LABS is perfectly poised to transform one of the world’s largest industries.

Sysco LABS’s engineering teams based out of Colombo, Sri Lanka and Austin and Houston, TX, innovate across the entire food service journey – from the enterprise grade technology that enables Sysco’s business, to the technology that revolutionizes the way that Sysco connects with restaurants and the technology that shapes the way those restaurants connect with customers.

Sysco LABS technology is present in the sourcing of food products, merchandising, storage and warehouse operations, order placement and pricing algorithms, the delivery of food and supplies to Sysco’s global network, the in-restaurant dining experience of the end-customer and much more.

We are seeking a Senior Site Reliability Engineer to drive technical operations, reliability, and incident management for the Sysco Shop platform. This role requires expertise in Root Cause Analysis (RCA), IT Service Management (ITSM), and problem-solving, ensuring high-service delivery and operational excellence.

As a Senior SRE, you will resolve customer issues, enhance system resilience, automate operations, and improve monitoring. You’ll work closely with Engineering, DevOps, and Product teams to optimize reliability, availability, and incident response.

The Role:

Incident Management & Reliability
• Mature Incident & Crisis Response by enabling faster diagnosis, mitigation, and restoration while ensuring effective customer communication.
• Drive resolution of complex production issues by performing in-depth Root Cause Analysis (RCA) and implementing long-term corrective actions.
• Conduct postmortems for major incidents, identifying recurring patterns and driving process improvements.
• Eliminate toil by automating responses for failures, scaling events, and operational changes.
• Collaborate closely with Engineering teams for RCA investigations, fostering a knowledge-sharing culture and shifting knowledge left.

Monitoring & Observability
• Design and implement observability solutions to proactively monitor system health, detect anomalies, and improve overall reliability.
• Establish Service Health Monitoring frameworks and fine-tune alerts to reduce noise and improve signal accuracy.
• Investigate system anomalies and work with engineering teams to implement resilient solutions.

Performance & SLIs/SLOs
• Define and maintain SLIs (Service Level Indicators) and SLOs (Service Level Objectives), ensuring alignment with business and technical goals.
• Collaborate with testing teams to identify performance bottlenecks and enhance system resilience through chaos engineering.

Automation & Continuous Improvement
• Continuously improve and implement automation to enhance system performance, reliability, and efficiency.
• Mentor junior SREs, fostering a culture of reliability, automation, and operational excellence.

The Profile:
• Bachelor’s or Master’s degree in Software Engineering
• 2- 4 years in Site Reliability Engineering, Technical Support, or Application Reliability for enterprise-level e-commerce or web applications.
• Strong expertise in Java and JavaScript; experience with Node.js, React, and Spring Boot is a plus.
• Hands-on experience with Terraform, Jenkins, GoCD, GitHub, and automation best practices.
• Strong experience with Docker, Kubernetes, AWS ECS, and AWS Fargate for scalable deployments.
• Proficiency in tools like OpenTracing, Prometheus, Grafana, and Datadog for system health monitoring and proactive issue detection.
• Deep understanding of networking, cloud-native architectures, and scalable distributed systems.
• Strong performance and functional troubleshooting skills, with expertise in Root Cause Analysis (RCA) and postmortems.
• Experience with Relational (PostgreSQL) and NoSQL (Elasticsearch, MongoDB) databases.
• Experience with GraphQL and Kafka is a plus; Bash or Python scripting knowledge is beneficial for automation tasks.
• Proven experience in leading and mentoring technical teams, fostering a culture of reliability and operational excellence.

Benefits:
• US dollar-linked compensation
• Performance-based annual bonus
• Performance rewards and recognition
• Agile Benefits – special allowances for Health, Wellness & Academic purposes
• Paid birthday leave
• Team engagement allowance
• Comprehensive Health & Life Insurance Cover – extendable to parents and in-laws
• Overseas travel opportunities and exposure to client environments
• Hybrid work arrangement

Sysco LABS is an Equal Opportunity Employer.

To apply for this job email your details to cv@ezjobs.online

Scroll to Top