Bangalore, Karnataka, India
Information Technology
Full-Time
UST
Overview
Role Description
Sre,Python, Any Cloud
- Design, implement, and maintain scalable and highly reliable cloud infrastructure using Any Cloud services such as Compute Engine, Kubernetes Engine, Cloud Functions, and BigQuery.
- Write Python scripts to automate operations, deployment processes, and enhance system performance.
- Collaborate with engineering teams to improve system architecture, application deployment, and continuous integration/continuous deployment (CI/CD) pipelines.
- Develop and maintain system observability frameworks including logs, metrics, and tracing to ensure visibility into system health and performance.
- Implement and manage monitoring and ing systems using tools like Prometheus, Grafana, or Stackdriver to ensure system reliability and uptime.
- Participate in on-call rotations to address production incidents and drive incident management and root cause analysis.
- Work on improving system performance, cost management, and security using GCP-native tools.
- Define and track SLOs (Service Level Objectives) and SLIs (Service Level Indicators) to ensure that systems meet reliability targets.
- Automate and streamline processes for system provisioning, configuration, and deployment.
- Conduct post-incident reviews to identify areas for improvement and prevent recurrence of issues.
- 3+ years of experience in Site Reliability Engineering (SRE), DevOps, or similar roles.
- Strong experience with Python programming, including automation, scripting, and system management tools.
- Hands-on experience with Cloud services, such as Compute Engine, Kubernetes Engine, Cloud Functions, and BigQuery.
- Strong understanding of containerization and orchestration tools, particularly Docker and Kubernetes.
- Proficiency in monitoring and ing tools, such as Prometheus, Grafana, Stackdriver, or similar.
- Experience working with CI/CD tools and practices (e.g., GitLab, Jenkins).
- Solid understanding of system performance optimization, security, and cost management practices on Cloud.
- Strong knowledge of networking concepts, high-availability architectures, and system troubleshooting techniques.
- Experience with infrastructure automation and configuration management tools (e.g., Terraform, Ansible).
- Experience in production environment management, incident resolution, and on-call support.
- Good understanding of software development practices and collaboration with development teams to improve reliability.
Sre,Python, Any Cloud
Similar Jobs
View All
Talk to us
Feel free to call, email, or hit us up on our social media accounts.
Email
info@antaltechjobs.in