Site Reliability Engineer
tooling — metrics, logging, tracing, and alerting (e.g., Cloud Monitoring, Datadog, or Prometheus/Grafana) Understanding...
tooling — metrics, logging, tracing, and alerting (e.g., Cloud Monitoring, Datadog, or Prometheus/Grafana) Understanding...
tooling — metrics, logging, tracing, and alerting (e.g., Cloud Monitoring, Datadog, or Prometheus/Grafana) Understanding...
AI pipelines using Langfuse and Datadog to debug failed agent runs, detect regressions, and maintain production health. Team... AI pipelines, and utilizing AI observability tooling (e.g., Langfuse, Datadog). You bring strong engineering skills in Python...
, Prometheus, Elastic , DataDog or similar. Oversee all planned outages, assess RCA and assist with major upgrades to ensure..., Grafana, Prometheus, Datadog, Dynatrace) to ensure system performance and reliability. Proven experience managing distributed...
, security groups PostgreSQL on Amazon RDS (~15 instances) Datadog + CloudWatch (APM, logs, alerting) Java microservices / API... or similar) Strong written English for escalation + post-incident write-ups Nice-to-have: Datadog / CloudWatch fluency AWS...
, security groups PostgreSQL on Amazon RDS (~15 instances) Datadog + CloudWatch (APM, logs, alerting) Java microservices / API...) Strong written English for escalation + post-incident write-ups Nice-to-have: Datadog / CloudWatch fluency AWS data services...
in using tools such as DataDog, Splunk, App Insights, New Relic, etc. Proven experience with CI/CD (Continuous Integration...
). Required: Strong AWS expertise;Terraform/Terragrunt/OpenTofu;Datadog for observability (logs, metrics, traces) to enable monitoring...
). Required: Strong AWS expertise;Terraform/Terragrunt/OpenTofu;Datadog for observability (logs, metrics, traces) to enable monitoring...
). Required: Strong AWS expertise;Terraform/Terragrunt/OpenTofu;Datadog for observability (logs, metrics, traces) to enable monitoring...