Technical Skills:
· Proficiency in Azure cloud services and tools.
. Enable embedded SRE model across squads
. SRE reviews for all changes goal is to implement 90% or more incident free changes
. Build instrumentation to have visibility and detect any issues from changes
. Preventive problem management – Identify opportunities to fix the issues permanently and track it for closure
. Work with business ops members to get issues fixed
. Toil Reduction
. Identify opportunities for automation to pre-empt or facilitate information flow to business teams
. Demonstrate strong programming skills and thorough knowledge of systems
· Strong knowledge of Infrastructure as Code (IaC) tools such as Azure Resource Manager (ARM) templates, Terraform, or Ansible.
· Experience with monitoring and logging tools (Azure Monitor, Log Analytics, Application Insights). · Expertise in deploying and managing Kubernetes clusters, particularly within Azure Kubernetes Service (AKS). · Familiarity with containerization tools (Docker). · Expertise in incident management and response. · Strong scripting skills (PowerShell, Python, or similar). Soft Skills: · Excellent problem-solving and analytical skills. · Strong communication and collaboration abilities. · Ability to work independently and as part of a team.
Roles & Responsibilities:
· Operational Strategy Development: Design and implement strategies to optimize operational processes and improve system reliability and performance within the Azure cloud environment.
· Infrastructure Management: Oversee the management and provisioning of Azure cloud infrastructure using tools like Azure Resource Manager (ARM) templates, Terraform, or Ansible.
· Kubernetes Management: Deploy, manage, and optimize Kubernetes clusters within Azure Kubernetes Service (AKS) to ensure high availability and scalability.
· Monitoring and Incident Management: Implement monitoring solutions and establish incident management protocols to ensure high availability and reliability of Azure services and Kubernetes clusters.
· Performance Optimization: Analyze system performance and implement improvements to enhance scalability and efficiency in the Azure cloud and Kubernetes environments.
· Collaboration: Work closely with development, QA, and operations teams to ensure seamless integration and delivery of software within the Azure and Kubernetes environments.
· Security Best Practices: Implement security best practices in operational processes, Azure infrastructure management, and Kubernetes cluster configurations.
· Documentation: Create and maintain comprehensive documentation for operational processes, Azure infrastructure configurations, Kubernetes deployments, and incident management procedures.
· Training and Mentorship: Provide training and mentorship to team members on Azure operational practices, Kubernetes management, tools, and methodologies.
Key Words:
· Certifications in relevant Azure technologies (e.g., Microsoft Certified: Azure Solutions Architect Expert, Microsoft Certified: Azure DevOps Engineer Expert).
· Experience in developing and delivering training programs.
· Knowledge of CI/CD tools and practices.
Role Descriptions: Site Reliability Engineering (SRE) Essential Skills: Site Reliability Engineering (SRE) Desirable Skills: Azure data factory| Databricks| Azure Synapse Analytics | Python| Sql | Pyspark Databricks
Keyword: ~Azure data factory| Databricks| Azure Synapse Analytics | Python| Sq l |Pyspark
