Site Reliability Engineering – Azure (Certified)

C2C
Minimum of [10] years of experience in cloud operations engineering or a related field, with a strong focus on Azure cloud infrastructure, Kubernetes, and site reliability engineering.

Technical Skills:

· Proficiency in Azure cloud services and tools.

. Enable embedded SRE model across squads

. SRE reviews for all changes goal is to implement 90% or more incident free changes

. Build instrumentation to have visibility and detect any issues from changes

. Preventive problem management – Identify opportunities to fix the issues permanently and track it for closure

. Work with business ops members to get issues fixed

. Toil Reduction

. Identify opportunities for automation to pre-empt or facilitate information flow to business teams

. Demonstrate strong programming skills and thorough knowledge of systems

· Strong knowledge of Infrastructure as Code (IaC) tools such as Azure Resource Manager (ARM) templates, Terraform, or Ansible.

· Experience with monitoring and logging tools (Azure Monitor, Log Analytics, Application Insights). · Expertise in deploying and managing Kubernetes clusters, particularly within Azure Kubernetes Service (AKS). · Familiarity with containerization tools (Docker). · Expertise in incident management and response. · Strong scripting skills (PowerShell, Python, or similar). Soft Skills: · Excellent problem-solving and analytical skills. · Strong communication and collaboration abilities. · Ability to work independently and as part of a team.

Roles & Responsibilities:

· Operational Strategy Development: Design and implement strategies to optimize operational processes and improve system reliability and performance within the Azure cloud environment.

· Infrastructure Management: Oversee the management and provisioning of Azure cloud infrastructure using tools like Azure Resource Manager (ARM) templates, Terraform, or Ansible.

· Kubernetes Management: Deploy, manage, and optimize Kubernetes clusters within Azure Kubernetes Service (AKS) to ensure high availability and scalability.

· Monitoring and Incident Management: Implement monitoring solutions and establish incident management protocols to ensure high availability and reliability of Azure services and Kubernetes clusters.

· Performance Optimization: Analyze system performance and implement improvements to enhance scalability and efficiency in the Azure cloud and Kubernetes environments.

· Collaboration: Work closely with development, QA, and operations teams to ensure seamless integration and delivery of software within the Azure and Kubernetes environments.

· Security Best Practices: Implement security best practices in operational processes, Azure infrastructure management, and Kubernetes cluster configurations.

· Documentation: Create and maintain comprehensive documentation for operational processes, Azure infrastructure configurations, Kubernetes deployments, and incident management procedures.

· Training and Mentorship: Provide training and mentorship to team members on Azure operational practices, Kubernetes management, tools, and methodologies.

Key Words:

· Certifications in relevant Azure technologies (e.g., Microsoft Certified: Azure Solutions Architect Expert, Microsoft Certified: Azure DevOps Engineer Expert).

· Experience in developing and delivering training programs.

· Knowledge of CI/CD tools and practices.

Role Descriptions: Site Reliability Engineering (SRE) Essential Skills: Site Reliability Engineering (SRE) Desirable Skills: Azure data factory| Databricks| Azure Synapse Analytics | Python| Sql | Pyspark Databricks
Keyword: ~Azure data factory| Databricks| Azure Synapse Analytics | Python| Sq l |Pyspark

Scroll to Top