Site Reliability Engineer

location_on гр. София

Пълно работно време
Постоянна работа
Дистанционно интервю

Пълно описание

Tradu is a new multi-asset global trading platform and is part of the Stratos group of companies. Tradu, built by traders for traders, provides the most sophisticated traders with a serious platform that allows them to move easily between asset classes such as stocks, CFDs and crypto, depending on the regulations that govern the trader’s market.

At Tradu, we believe that talent knows no borders. We are a team of 650+ multilingual people around the globe. Our commitment to diversity, inclusion, and innovation extends across continents, creating a dynamic blend of skills and experiences that drives our success.

We are seeking an experienced Site Reliability Engineer (SRE) to join our technology team. The SRE will be responsible for ensuring the reliability, scalability, and performance of our systems and services, with a strong focus on AWS, automation, and infrastructure as code. This role blends software engineering with systems engineering, driving resilience and efficiency across our production platforms.

Key Responsibilities

Design, build, and maintain reliable, scalable, and performant systems across AWS-based and on-premises environments, with a cloud-first approach.
Implement monitoring, alerting, and observability solutions to ensure visibility into system health and application performance.
Automate operational tasks, deployments, and configuration management to reduce manual intervention and improve efficiency.
Participate in incident response and postmortem processes, driving improvements to system reliability and reducing mean time to recovery (MTTR).
Collaborate with development teams to embed reliability, performance, and scalability into the software development lifecycle.
Manage capacity planning, performance tuning, and cost optimization within AWS.
Ensure security, compliance, and audit requirements are met in all infrastructure and operational practices.

Qualifications

5+ years of hands-on experience in Site Reliability Engineering, DevOps, or related roles.
Strong background in Linux systems administration.
Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.).
Deep experience with AWS services (EC2, ECS/EKS, RDS, S3, IAM, networking, etc.).
Proven expertise with tools like Puppet, Chef, Ansible for configuration management and Terraform for infrastructure as code.
Strong knowledge of CI/CD pipelines and deployment automation (Jenkins, GitLab, or similar).
Hands-on experience with monitoring/observability tools (Prometheus, Grafana, ELK, Datadog, etc.).
Solid understanding of networking, load balancing, and DNS fundamentals.
Excellent problem-solving skills and ability to work effectively under pressure during incidents.

Preferred Skills

Experience with Kubernetes or other container orchestration systems.
Knowledge of service-level objectives (SLOs), SLIs, and error budgeting.
Background in financial systems or other mission-critical, high-availability environments.

Working Hours: 40/week, Monday–Friday. Hybrid: 3 days in-office.

Contract type: Labor contract with Stratos Support EAD

Please submit your CV in English. Only shortlisted candidates will be contacted for an interview.

Site Reliability Engineer

Пълно описание

Key Responsibilities

Qualifications

Preferred Skills

Необходими основни умения

Необходими основни знания

Добави резюме

Финален преглед

Информация за обратна връзка

Резюме