Site Reliability Engineer
ref nr: 49/6/2025/WM/91379
In Antal we have been dealing with recruitment for over 20 years. Thanks to the fact that we operate in 10 specialised divisions, we have an excellent orientation in current industry trends. We precisely determine the specific nature of the job, classifying key skills and necessary qualifications. Our mission is not only to find a candidate whose competences fit the requirements of the given job advertisement, but first and foremost a position which meets the candidate’s expectations. Employment agency registration number: 496.
Site Reliability Engineer
📍 Kraków (Hybrid – minimum 2 days/week in the office)
💼 Employment type: B2B
Are you looking for an opportunity to join a high-impact project in a global financial institution that invests heavily in cloud, AI, and DevOps? We're building a new Site Reliability Engineering (SRE) team in Kraków to support a mission-critical Counterparty Credit Risk (CCR) platform, and we're looking for experienced engineers to join the journey.
As part of this role, you'll contribute to the stability, scalability, and observability of a high-volume, distributed platform operating on both Google Cloud Platform and on-prem infrastructure.
What you’ll do:
-
Ensure the reliability and high availability of production systems used in global credit risk management.
-
Monitor, detect, and troubleshoot incidents in distributed systems running in cloud and hybrid environments.
-
Implement observability tools (Grafana, Prometheus, Loki, etc.) and improve monitoring and alerting strategies.
-
Lead root cause analysis (RCA) and post-incident reviews to improve resilience and operational efficiency.
-
Collaborate with developers, DevOps engineers, and global support teams to implement SRE best practices.
-
Contribute to CI/CD automation, deployment pipelines, and security/vulnerability remediation.
What you need to succeed in this role:
-
5+ years of experience in supporting or developing distributed systems (Java-based environments preferred).
-
Hands-on experience with monitoring and logging tools: Grafana, Prometheus, Loki, Splunk, etc.
-
Solid understanding of Unix/Linux systems, cloud infrastructure (GCP preferred), and databases (RDBMS).
-
Experience with CI/CD tooling, such as Ansible, Jenkins, GitHub Actions, and vulnerability management.
-
Familiarity with job scheduling tools (e.g., Control-M or equivalent).
-
Strong communication skills and ability to drive technical discussions with multiple support teams.
-
Experience working in Agile/Scrum teams.
What we offer:
-
The chance to build and shape a new SRE team supporting a critical platform for global risk management.
-
Work in a modern technology stack: Java, GCP, Apache Beam, Spring Boot, DevOps tooling.
-
Hybrid working model with at least 2 days/week in our Kraków office.
-
Stable, long-term project with excellent opportunities for growth and learning.
📩 Interested? Apply now and take the next step in your career with a team that’s redefining reliability at a global scale.