Senior SRE Specialist
Пълно описание
Over 20 years of market experience, we brings together technologists, creators and innovators in Europe, North and Latin America, and the Middle East. Join our international team and take the mission to solve the advanced tech challenges of tomorrow!
What project we have for you:
We are a publicly-traded FTSE250 FinTech company who run mobile, web and desktop platforms that help our clients trade stocks & shares, leveraged products, Futures & Options and Crypto.
Your team: The SRE Team comprises highly skilled software engineers dedicated to embedding performance and reliability into our trading platform. You’ll work with cutting-edge distributed systems handling high-throughput, low-latency trading operations that demand zero downtime.
As a Site Reliability Engineer, you’ll champion reliability patterns, improve observability, establish 24/7 operations, and drive operational excellence across our crypto trading platform infrastructure and associated applications.
What you will do: System Reliability & 24/7 Operations
This role excludes on call support. Implement comprehensive monitoring and observability using OpenTelemetry and distributed tracing Establish and maintain 24/7 operational readiness including automated deployments, blue/green releases, and zero-downtime patching strategies Define and track Service Level Objectives (SLOs) and Error Budgets for critical crypto trading services Identify and eliminate single points of failure in distributed systems
Application Instrumentation & Observability:
Instrument Java applications with OpenTelemetry spans, metrics, and traces Work hands-on with development teams to add observability to their code Guide teams on implementing meaningful SLIs that reflect user experience
Technical Leadership & Enablement:
Partner with development teams on system design, capacity planning, and architectural reviews Provide technical guidance and hands-on support for teams transitioning from traditional deployments to containerized infrastructure Mentor developers on reliability patterns including circuit breakers, retry logic, and fault tolerance Lead by example – write production code that demonstrates SRE best practices
Software Development & Automation:
Write clean, maintainable code in Java and Python following industry best practices Build automation tools and CI/CD pipelines that embed reliability practices Contribute to application codebases to implement instrumentation and reliability patterns Apply software engineering discipline including version control, code reviews, and testing
What you need for this:
Java development experience– Must be able to read, write, and instrument Java code. Deep understanding of JVM internals and experience with complex distributed Java applications
Observability & Instrumentation – Hands-on experience with OpenTelemetry, distributed tracing concepts (spans, trace context propagation), and observability platforms such as Honeycomb, Datadog, Dynatrace, Splunk or Grafana. Strong understanding of OpenTelemetry Collector pipelines, including data transformation, enrichment, and labeling, use of processors (attributes, resource, transform, span, tail sampling), and propagation of custom business identifiers (e.g., customer/tenant/transaction IDs) across services to enable end-to-end trace correlation between heterogeneous systems, applications, and environments.
SLO/SLI Expertise – Proven experience defining SLOs based on SLIs, establishing error budgets, and working with development teams on reliability measurement
Reliability Patterns – Solid understanding of circuit breakers, retry logic, bulkheads, and other fault tolerance patterns
Cloud – AWS & Kubernetes Platform Engineering– Strong hands-on experience with AWS as the primary cloud provider, including production workloads on Amazon EKS. Proven expertise in Kubernetes networking, covering ingress and egress controllers (e.g., ALB / NGINX / Envoy), service configuration and fine-tuning (requests/limits, HPA/VPA, pod disruption budgets, network policies), and traffic management. Demonstrated ability to investigate and optimize performance and reliability using metrics, logs, and traces, complemented by chaos engineering practices (fault injection, node/pod failures, network latency, dependency outages) to validate system resilience and high availability under real-world failure scenarios.
Message Brokers – Production experience with ActiveMQ, Kafka, or similar messaging systems
Containerization – Hands-on experience with container orchestration (Nomad experience is advantageous, Kubernetes acceptable)
CI/CD – Experience building and maintaining deployment pipelines, preferably with GitLab
Experience Requirements:
Track record in high-throughput, production environments (financial services, trading platforms, or similar mission-critical systems preferred) Demonstrated ability to improve system reliability and performance at scale Experience working collaboratively with development teams to implement observability and reliability improvements Strong troubleshooting skills in distributed systems environments Core Competencies
Systems thinking approach to problem-solving Excellent communication skills for cross-functional collaboration and technical enablement Ability to balance hands-on development work with operational responsibilities Strong bias toward automation and eliminating manual toil Comfortable working in a fast-paced environment with evolving requirements
What it’s like to work with us:
We are committed to being an equal opportunity employer, fostering equity, diversity, and inclusion. We welcome and celebrate the differences of all qualified applicants. Join our team for a career where your unique perspectives are not only valued but crucial to our success.
What project we have for you:
We are a publicly-traded FTSE250 FinTech company who run mobile, web and desktop platforms that help our clients trade stocks & shares, leveraged products, Futures & Options and Crypto.
Your team: The SRE Team comprises highly skilled software engineers dedicated to embedding performance and reliability into our trading platform. You’ll work with cutting-edge distributed systems handling high-throughput, low-latency trading operations that demand zero downtime.
As a Site Reliability Engineer, you’ll champion reliability patterns, improve observability, establish 24/7 operations, and drive operational excellence across our crypto trading platform infrastructure and associated applications.
What you will do: System Reliability & 24/7 Operations
This role excludes on call support. Implement comprehensive monitoring and observability using OpenTelemetry and distributed tracing Establish and maintain 24/7 operational readiness including automated deployments, blue/green releases, and zero-downtime patching strategies Define and track Service Level Objectives (SLOs) and Error Budgets for critical crypto trading services Identify and eliminate single points of failure in distributed systems
Application Instrumentation & Observability:
Instrument Java applications with OpenTelemetry spans, metrics, and traces Work hands-on with development teams to add observability to their code Guide teams on implementing meaningful SLIs that reflect user experience
Technical Leadership & Enablement:
Partner with development teams on system design, capacity planning, and architectural reviews Provide technical guidance and hands-on support for teams transitioning from traditional deployments to containerized infrastructure Mentor developers on reliability patterns including circuit breakers, retry logic, and fault tolerance Lead by example – write production code that demonstrates SRE best practices
Software Development & Automation:
Write clean, maintainable code in Java and Python following industry best practices Build automation tools and CI/CD pipelines that embed reliability practices Contribute to application codebases to implement instrumentation and reliability patterns Apply software engineering discipline including version control, code reviews, and testing
What you need for this:
Java development experience– Must be able to read, write, and instrument Java code. Deep understanding of JVM internals and experience with complex distributed Java applications
Observability & Instrumentation – Hands-on experience with OpenTelemetry, distributed tracing concepts (spans, trace context propagation), and observability platforms such as Honeycomb, Datadog, Dynatrace, Splunk or Grafana. Strong understanding of OpenTelemetry Collector pipelines, including data transformation, enrichment, and labeling, use of processors (attributes, resource, transform, span, tail sampling), and propagation of custom business identifiers (e.g., customer/tenant/transaction IDs) across services to enable end-to-end trace correlation between heterogeneous systems, applications, and environments.
SLO/SLI Expertise – Proven experience defining SLOs based on SLIs, establishing error budgets, and working with development teams on reliability measurement
Reliability Patterns – Solid understanding of circuit breakers, retry logic, bulkheads, and other fault tolerance patterns
Cloud – AWS & Kubernetes Platform Engineering– Strong hands-on experience with AWS as the primary cloud provider, including production workloads on Amazon EKS. Proven expertise in Kubernetes networking, covering ingress and egress controllers (e.g., ALB / NGINX / Envoy), service configuration and fine-tuning (requests/limits, HPA/VPA, pod disruption budgets, network policies), and traffic management. Demonstrated ability to investigate and optimize performance and reliability using metrics, logs, and traces, complemented by chaos engineering practices (fault injection, node/pod failures, network latency, dependency outages) to validate system resilience and high availability under real-world failure scenarios.
Message Brokers – Production experience with ActiveMQ, Kafka, or similar messaging systems
Containerization – Hands-on experience with container orchestration (Nomad experience is advantageous, Kubernetes acceptable)
CI/CD – Experience building and maintaining deployment pipelines, preferably with GitLab
Experience Requirements:
Track record in high-throughput, production environments (financial services, trading platforms, or similar mission-critical systems preferred) Demonstrated ability to improve system reliability and performance at scale Experience working collaboratively with development teams to implement observability and reliability improvements Strong troubleshooting skills in distributed systems environments Core Competencies
Systems thinking approach to problem-solving Excellent communication skills for cross-functional collaboration and technical enablement Ability to balance hands-on development work with operational responsibilities Strong bias toward automation and eliminating manual toil Comfortable working in a fast-paced environment with evolving requirements
What it’s like to work with us:
We are committed to being an equal opportunity employer, fostering equity, diversity, and inclusion. We welcome and celebrate the differences of all qualified applicants. Join our team for a career where your unique perspectives are not only valued but crucial to our success.