Site Reliability Engineering Resume Samples

4.8 (115 votes) for Site Reliability Engineering Resume Samples

The Guide To Resume Tailoring

Guide the recruiter to the conclusion that you are the best candidate for the site reliability engineering job. It’s actually very simple. Tailor your resume by picking relevant responsibilities from the examples below and then add your accomplishments. This way, you can position yourself in the best way to get hired.

Craft your perfect resume by picking job responsibilities written by professional recruiters

Pick from the thousands of curated job responsibilities used by the leading companies

Tailor your resume & cover letter with wording that best fits for each job you apply

Resume Builder

Create a Resume in Minutes with Professional Resume Templates

Resume Builder
CHOOSE THE BEST TEMPLATE - Choose from 15 Leading Templates. No need to think about design details.
USE PRE-WRITTEN BULLET POINTS - Select from thousands of pre-written bullet points.
SAVE YOUR DOCUMENTS IN PDF FILES - Instantly download in PDF format or share a custom link.

Resume Builder

Create a Resume in Minutes with Professional Resume Templates

Create a Resume in Minutes
KL
K Littel
Karolann
Littel
40979 King Cove
Dallas
TX
+1 (555) 152 9940
40979 King Cove
Dallas
TX
Phone
p +1 (555) 152 9940
Experience Experience
Dallas, TX
Site Reliability Engineering Manager
Dallas, TX
Waelchi, Harber and Haley
Dallas, TX
Site Reliability Engineering Manager
  • Handling communication and providing transparency on major site issues to the executive management team and rest of the SAP Hybris organization
  • Manage on-call rotations and provide inputs to your team and partners to sustain SLAs
  • Document root cause analysis reports and develop standard operating procedures
  • Work with the recruiting team to attract, onboard, and retain diverse top talent
  • Cultivate a culture of feedback to ensure that teams and individuals are collaborative, purposeful, and high performing
  • Perform application and web site troubleshooting to quickly resolve the issues per documented procedures
  • Collaborate with other leaders in the organization to ensure that Spotify is a safe, fun, and challenging place to work
Chicago, IL
Senior Manager, Site Reliability Engineering
Chicago, IL
Marvin Inc
Chicago, IL
Senior Manager, Site Reliability Engineering
  • Base knowledge of the W3C’s Web Content Accessibility Guidelines v2.0
  • Exposure to current tools used by people with disabilities (screen readers and other assistive technologies)
  • Participates in the management of full life cycle product development to include analysis and planning related to product development, launch and deployment. Assist peer product development organizations with launch readiness and post-launch triage and analysis
  • Conduct War Game exercises to simulate adverse conditions and situations in a controlled fashion. Collect and communicate learnings and areas of opportunity to advance Operational Stance and Preparedness
  • Demonstrates technical leadership and mentoring on the application of new technologies and systems management methodologies. Can tailor and adapt approach to build consensus and alignment across peer Operational and Engineering Groups
  • Monitors technical and engineering progress to ensure strategies, goals and objectives are met. Aligns operational plans with business objectives. Communicates changes to all affected personnel
  • Presents periodic updates to Senior Management on impairments, mitigation opportunities and progress
present
Chicago, IL
Director of Site Reliability Engineering
Chicago, IL
Rowe, Schroeder and Botsford
present
Chicago, IL
Director of Site Reliability Engineering
present
  • Interface with Dev/QA/OPS teams to identify root cause analysis and re-instrument triggers to prevent future network degradation and outages
  • Effectively communicate in and among the other organizations on various standards, services and system metrics. Other organizations include but not limited to Customer Care, Engineering, DevOps, OPS, Marketing, and Sales
  • Drive process and run book documentation to minimize mean-time-to-repair (MTTR) on network events, including processes on field dispatches, internal and external escalations, and vendor engagement
  • Provide leadership and managerial coaching to SRE management team
  • Manage engineers working with the engineering teams on our back-end services like our Hadoop, HDFS, Memcached, Reddis, Kubernetes, AWS, Java, Golang, Linux, etc
  • Troubleshoot issues across the entire stack: hardware, software, application and network
  • Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
Education Education
Bachelor’s Degree in Computer Science
Bachelor’s Degree in Computer Science
Rutgers University
Bachelor’s Degree in Computer Science
Skills Skills
  • Systematic problem solving approach, coupled with a strong sense of ownership and drive
  • Approachability
  • Knowledge of C#, ASP, SQL
  • Experience in cloud infrastructure and tooling
  • Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way
  • Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing)
  • Experience with algorithms, data structures, complexity analysis and software design
  • Experience working in a Microsoft environment
  • Expertise in designing, analyzing and troubleshooting large-scale distributed systems
  • Experience using Agile methodologies
Create a Resume in Minutes

15 Site Reliability Engineering resume templates

1

Site Reliability Engineering Resume Examples & Samples

  • Work in engineering team to design, build, deploy, and maintain systems
  • Write scripts to monitor systems and automate tasks
  • Troubleshoot issues across the entire stack - hardware, software, application and network
  • Take part in a shared 24x7 on-call rotation
  • 4+ years of managing user facing applications at web scale
  • Familiarity with systems management tools such as Chef
  • Practical knowledge of shell scripting in various language including Python
  • Experience supporting Hadoop, Scribe and Hive
  • Track record of practical problem solving, excellent communication, and documentation skills
  • Capable of leading technical teams through designs and implementations across an organization
  • Experience managing services at web scale
  • Knowledge of Postgres, Sensu, HAProxy, and S3/EMR applications
  • Ability to debug C++ and Python applications
2

Senior Manager, Site Reliability Engineering Resume Examples & Samples

  • Drive the adoption and implementation of Operational Best Practices and associated tooling to improve resiliency and reliability. Improve tooling and instrumentation to accelerate triage and remediation
  • Conduct War Game exercises to simulate adverse conditions and situations in a controlled fashion. Collect and communicate learnings and areas of opportunity to advance Operational Stance and Preparedness
  • Presents periodic updates to Senior Management on impairments, mitigation opportunities and progress
  • Evaluates various architectural solutions and implementations and supports development and deployment of solutions as determined by the SRE team
  • Identifies trends, services and/or capabilities that may be beneficial to product offerings. Manages and forecasts resource needs to meet departmental objectives. Recommends action plans or solutions
  • Ensures effective implementation of the department budget. Prepares financial statements and monthly forecasts and reports. Prepares and analyzes monthly financial performance and makes budget and new technology recommendations
3

Manager, Site Reliability Engineering Resume Examples & Samples

  • Responsible for developing and managing a team of engineers who are focused on Site and Service Reliability. This team is well versed in the application technologies and Platforms used to deliver an excellent email experience for Comcast customers
  • The SRE group is focused on improving the availability and responsiveness of internal and external components and Platforms through the application of engineering best practices, tooling and instrumentation advances and cross organizational coordination. The SRE team helps drive efforts to improve triage time and bring down MTTR (Mean Time to Repair) and provides follow-up support to provide mitigation in the future
  • This individual will manage a team which may include exempt and non-exempt employees. Employees will be a mix of local and remote workers which includes offshore outsourced resources. They will provide subject matter guidance to employees as required and serve as a point of escalation
  • Passionate and driven to improve the customer experience through solving problems which impede reliability, resiliency and responsiveness
  • Develops processes and procedures to drive departmental efficiencies, assist in development, and meeting of departmental budget
  • Participates in the management of full life cycle product development to include analysis and planning related to product development, launch and deployment. Assist peer product development organizations with launch readiness and post-launch triage and analysis
  • Being proactive, evaluating multiple options and considering our customer's experience is key to our success
  • Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams
  • Document and detail areas of improvement to bolster architecture, design, technical requirements and service specifications. Present architecture, design, and technical choices to internal audiences
  • Demonstrates technical leadership and mentoring on the application of new technologies and systems management methodologies. Can tailor and adapt approach to build consensus and alignment across peer Operational and Engineering Groups
  • Monitors technical and engineering progress to ensure strategies, goals and objectives are met. Aligns operational plans with business objectives. Communicates changes to all affected personnel
  • Establishes and maintains productive relationships with peer organizations and equipment and software vendors
4

Manager, Site Reliability Engineering Resume Examples & Samples

  • Experience with Unix/Linux systems with scripting experience in Shell, Perl or Python
  • Strong knowledge of core protocols and tech such as: TCP/IP, HTTP, DNS, load balancers, distributed file systems, key-value and relational databases
  • Extensive experience with configuration management tools such as Puppet, Chef, Salt, or Ansible
  • Experience with specific software such as Hadoop, Kafka, Spark, CouchBase, and similar technologies is desirable, but the ability to quickly learn new technology is most important
  • Experience with container-based platform like Mesosphere, Docker, Kubernetes, and similar technologies
  • Capable of technical deep-dives into code, networking, systems, and storage with very bright, experienced engineers
  • Expertise in problem solving and analyzing global scale distributed systems
  • Identify gaps in processes, skills, tooling, technology choices and work with upper management to drive improvements within the organization
  • Generally requires 8-11 years related experience
5

Director of Site Reliability Engineering Resume Examples & Samples

  • Refine and Maintain a 24x7 production cloud based infrastructure supporting RingCentral’s SaaS platform and customers, to provide 5 9’s uptime and availability
  • Provide leadership and direction to SRE staff that are responsible for break-fix, uptime and reliability for core services, distribution, and customer access network elements and related interfaces
  • Interface with Dev/QA/OPS teams to identify root cause analysis and re-instrument triggers to prevent future network degradation and outages
  • Effectively communicate in and among the other organizations on various standards, services and system metrics. Other organizations include but not limited to Customer Care, Engineering, DevOps, OPS, Marketing, and Sales
  • Drive process and run book documentation to minimize mean-time-to-repair (MTTR) on network events, including processes on field dispatches, internal and external escalations, and vendor engagement
  • Establish infrastructure and service uptime and availability metrics and devise ways to improve them
  • Provide leadership and managerial coaching to SRE management team
  • Drive appropriate metrics for measuring group and individual performance. Drive for automation of metrics and measurement systems for scalable performance management
  • Establish vendor relationships and direct activity to insure vendor related technical issues are appropriately escalated, managed and measured in accordance with proper guidelines
  • 10 - 15 years of experience managing operations teams (DevOps, Infrastructure, data centers, NOC, QA, DBA)
  • Excellent project management skills. Understands the difference between waterfall, agile, scrum and any other project management tools to effectively strike the right balance
  • Experience mentoring manager as well as line worker
  • Deep understanding of the software delivery process with the ability to implement and enforce that process across the organization
  • Experience with VMWare, AWS or any other relevant cloud technologies
  • Strong experience in planning, architecting and managing software development life cycle and best practices
  • Experience with complex networking technology including firewalls, VPN, routing, switching, load balancers, monitoring, security and DNS
  • Have expert level skills in Unix/Linux system and network administration and agile implementation of production systems
  • Experience architecting and managing customer facing large Big Data Hadoop platforms
  • Experience managing and coaching team of Technical Operations staff including system and network administrators
  • Ability to nurture and support a strong operations culture: customer/service focus; excellent technology; high quality implementations; self-motivated innovation and problem-solving
  • Proven experience in voice and data network management and engineering
  • Demonstrated Director-Level Management experience
  • Strong problem solving skills, demonstrated ability to make effective decisions
  • Strong oral communication and interpersonal skills
  • Proven ability to work effectively with customers, contractors, and vendors
  • Ability to manage multiple projects to completion
  • Ability to meet deadlines and deals efficiently with a fast paced environment
6

Site Reliability Engineering Lead Resume Examples & Samples

  • Extensive experience leading a team and running large critical applications
  • Strong hands-on skills in software engineering and IT operations
  • Experience supporting containers, container orchestration platforms
  • Experience operating applications on public and private cloud solutions
  • Experience with running large scale systems and meeting SLA expectations
  • 5+ years of relevant working experience and at least 2 year in a technical lead role
7

Site Reliability Engineering Manager, Search Resume Examples & Samples

  • You will have a maniacal focus on site uptime and service reliability
  • Automate and support Search services hosted on public cloud infrastructure that are reliable, efficient, and maintainable
  • Partner closely with the Search product development team using a strong devops mindset
  • Constantly improve operational processes and efficiency
  • Automate, Automate, Automate
  • Manage a team of site reliability engineers to deliver cloud infrastructure and production operations for Search services
  • Lead issue triage and resolution for complex production issues spanning the whole stack: code, OS, network, and storage
  • Automate production operations and continuous deployment processes
  • Implement comprehensive service monitoring to ensure uptime and performance, including synthetic, real user, system, application performance, volume or rate based, anomalies, dashboarding, etc
  • Define, measure, and meet key Service Level Objectives including availability, performance, incidents and chronic problems, site traffic and conversion, etc
  • Manage a team that owns 24x7x365 support for a website that never sleeps
  • Proven results-oriented leader with at least 4 years of management experience delivering significant web engineering and operations automation for large-scale, high resilient web sites
  • Experience building large scale, revenue generating web applications in a public cloud IaaS, including real world experience with at least one public cloud provider: Amazon Web Services, Google Cloud Platform, or Microsoft Azure
  • Experience building, scaling, and running large scale production operations for Search platforms including Apache Solr, Lucidworks, ElasticSearch, Endeca, and others
  • Track record of driving extremely high levels of availability for web services with resilient architecture, scalable infrastructure, technical operations automation, 360 degrees of application performance monitoring, and a highly trained operations staff
  • Experience implementing a continuous deployment tool chain for development teams including microservices architecture, full continuous integration automation, and canary and blue/green production deployments
  • Experience with automation tools such as Puppet, Chef, Ansible, Salt Stack, etc
  • Experience with Application Performance Monitoring tools: New Relic, Dynatrace, Appdynamics, etc
  • Excellent communications, organization, and time management skills
  • Experience in attracting, engaging, and developing world-class technical talent
  • Able to create and drive strategic goals through accountability and execution excellence
  • Demonstrated experience in executing/delivering projects in a dynamic, fast-paced environment. Not afraid to multi-task
  • Project leadership ability
8

Site Reliability Engineering Resume Examples & Samples

  • Preferred Python, fine with other language like Ruby, Perl, Java, C
  • Familiar with JQuery, Ajax, HTML, CSS, and Javascripts with 2-5 years experience in this area
  • Knowledge of Django, Bootstrap, node.js
  • Knowledge of Mysql or other database
  • Knowledge of networking is a plus
  • Knowledge of Linux environment
  • Knowledge of GIT or other source control system
  • Ability to express ideas clearly within the team and across other groups
9

Senior Site Reliability Engineering Manager Resume Examples & Samples

  • Have a maniacal focus on high availability, performance, scalability, and security of mission critical production services, 24x7x365
  • Attract, recruit, grow, and retain highly motivated Site Reliability Engineers
  • Plan and manage capex/opex budgets that improve and sustain reliability
  • Collaborate, empower, and share success with development teams using a strong devops mindset
  • Create and influence system design, standards, and processes that improve production reliability
  • 7+ years of management experience leading teams that ensure high availability for very large scale, revenue generating production environments
  • Highly collaborative. Focused on team success vs individual achievement
  • Skilled at communication to all levels of the organization: executives, customers, peers, and staff
  • Experience hiring, mentoring, and training a staff of engineers
  • Experience implementing a production deployment automation with canary and blue/green capabilities
  • Experience with infrastructure as code, either via immutable server architectures and/or infrastructure configuration management
10

Senior Manager, Site Reliability Engineering Resume Examples & Samples

  • Directly manage a large geographically distributed team of talented Site Reliability Engineers
  • Lead the DevOps process by working closely with Engineering Management in an agile process
  • Provide technical leadership for a hybrid private/public cloud enterprise solution
  • Drive best practices in Site Reliability Engineering and insure Secure, Scalable, Performant, and Highly Available Service
  • Collaborate with various internal teams to provide a high quality customer experience, and support
  • Communicate effectively and present team progress to upper management
  • Relentlessly introduce new ways of improving & scaling the service
  • Oversee Service metrics and measurement
  • Responsible for automation, monitoring, alerting, & logging
  • Mentor, coach, and develop a high performing Site Reliability team
  • Work in a dynamic, fast-paced environment similar to a start-up company
11

IT Chapter Lead Site Reliability Engineering Resume Examples & Samples

  • Implement mandatory security standards
  • Write fully automated tests (e.g., unit-, functional-, non-functional- and integration testing)
  • Develop micro services and APIs
  • Implement user stories from backlog as agreed with the Product Owner, without spending time on work outside the backlog
  • Advise Product Owner and Infra Area Lead on resource planning
  • Provide regular feedback, coaching and mentoring to Chapter members and
  • Align individual and overall performance of the Chapter, and identify top performers with Infra Area Lead
  • Stay up to date on your topic and apply your learnings in an Infra NL context
  • Regularly organize alignment meetings on Chapter specific topics to update colleagues on new developments
  • Define standards and best practices for Chapter
  • Mastery in at least one programming language, Perl and Java are a plus
  • Scripting experience in at least one of the following: Ruby, Python, Bash, Powershell
  • Experience with virtualization environments and tools e.g., VMware, cloud forms
12

Site Reliability Engineering Manager Resume Examples & Samples

  • Provide leadership for a team of engineers who own the reliability goals of uptime, scalability and performance
  • Build trust & alignmentacross teams to collaborate effectively across teams and partners to achieve Uber’s goals
  • Dive deep into availability, performance and scalability issues/outages for services and provide technical leadership for immediate and proactive resolutions
  • Manage on-call rotations and provide inputs to your team and partners to sustain SLAs
13

Head of Site Reliability Engineering Resume Examples & Samples

  • Build and manage the team of engineers responsible for the PaaS production environment that lives and breathes automation
  • Manage the production environment for the PaaS, ensuring compliance to Service Level Agreement. Establish required metrics for the measurement of the service performance
  • Drive the onboarding of applications on to the production PaaS environment, ensuring all best-practices are adhered
  • Manage incidents related to the PaaS and coordinate incident response with platform operations and/or infrastructure operations
  • Drive continuous improvement in the management of the PaaS
  • Capacity and resilience management of the PaaS
  • Mentors and train others on the management of the PaaS
  • Experience working with in an Agile environment with Cloud Native architecture and CI/CD
  • Experience designing and implementing monitoring and alerting, and metrics reporting systems
  • Ability to multi-task and priorities effectively in a fast paced environment
  • Strong problem solving and troubleshooting skills with the ability to exercise mature judgment
14

SRE Dev Engineer Site Reliability Engineering Resume Examples & Samples

  • Develop consumable, standardized infrastructure services (resilient, high quality, highly automated and up-to-date)
  • Support and educate DevOps teams and consumers on using the standardized infrastructure services (consumer is responsible for their own instances)
  • Manage all resources in version-controlled repositories (incl., code, scripts, configurations, artefacts, static resources)
  • Refactor and reuse existing code/modules/functionality
  • Participate rigorously in DevOps rituals (e.g. daily stand-ups, sprint planning, sprint review, retrospective, peer reviews)
  • Continuously improve yourself, your squad and the service
  • Mastery in at least one programming language, Java and/or .net are a plus
  • Working knowledge of configuration tools like Puppet, Chef or Ansible
  • Experience with building, operating and maintaining complex and scalable systems
  • Solid foundation in Linux or Windows administration and troubleshooting
  • Proven experience with automation. Knowledge of configuration management tools like Puppet or Chef is a plus
  • Creative and not afraid to step outside of your comfort zone
15

Site Reliability Engineering Manager Alexa Resume Examples & Samples

  • Be responsible for the overall uptime and performance of critical Alexa cloud services
  • Manage departmental resources, staffing, mentoring, and enhancing and maintaining a best-of-class engineering team
  • Work with internationally distributed teams and manage 24x7 on call resources
  • Design, write, and deliver software to improve the reliability, scalability, capacity, and latency of Alexa services
  • Identify recurring problems and build the tools and processes to prevent problems from recurring
  • Identify and build monitoring and alarming solutions
  • 7+ years of experience building production software systems
  • Understanding of web services, web application development, SQL, REST/JSON
  • Knowledge and understanding of network theory and concepts such as TCP/IP, UDP, DNS, and load balancing
16

Director Site Reliability Engineering Resume Examples & Samples

  • Demonstrates a broad understanding of all technical areas through experience and demonstrated success
  • Oversight of the research and proposed solutions driving the stability and reliability of our products improving overall company quality and customer satisfaction
  • Establish and maintain relationships with peers and leaders and act as an internal resource for teams, business units and Sr. Leadership
  • Anticipates and researches industry trends and influence Executive Leadership in how they will impact the company. Contributes to the development of overall Technology strategy
  • Anticipates future direction of the organization and consults with Sr Leadership on how they may position commercial work to offer this perspective to clients and prospects
  • Attends technology-focused conferences and seminars to expand knowledge about futuristic products, services and concepts that will help to develop a point of view for the organization
  • Leads global and virtual terms. Considers regional and cultural norms and values diversity in all forms
  • Help advance the maturity and integration of DevOps tools and automation
  • Leverage business planning and production performance to architect stable solutions
  • Implement and Drive Service Improvement Plans to increase application availability
  • Ownership of P1 and P2 Production issues
  • Laser focus on the architecture at the product or enterprise level
  • End to End Product Performance
  • 10+ years of management experience
  • 10+ years of hands-on technical experience
  • 4+ years of hands-on development experience
  • 2+ years of experience as a practicing SRE or in a leadership role in an SRE organization
  • Experience running teams of at least 50+
  • Management experience in matrix organization
  • Prior experience in full-stack architecture
  • Experience leading matrix organizations and teams of teams
  • Ability to present technical topics at an executive level
17

Site Reliability Engineering Lead Resume Examples & Samples

  • 5+ years of experience with designing and managing services in a distributed, Internet-scale Linux environment
  • 4+ years of experience with scripting in Shell, Perl, or Python
  • 2+ years of experience with managing a team of system administrators or infrastructure engineers
  • 1+ years of experience with configuration management in Salt or an equivalent technology, including Ansible or Puppet
  • Knowledge of edge cases and risk mitigation strategies to be applied to every change
  • Ability to motivate technology and process innovations
  • Ability to quickly comprehend how code, processes, and systems fit together
  • Ability to disseminate technical details one minute and career feedback the next
  • Ability to advance multiple projects simultaneously
  • BS degree in CS or related technical field
  • CISSP, RHCE, or Saltstack Certification
18

Site Reliability Engineering Manager Resume Examples & Samples

  • Demonstrate at least 3 years experience in operations team leadership for a SAAS, hosted or virtualised platform
  • Ability to effectively manage a team of up to 10 engineers, potentially across multiple shifts and locations
  • Experience working in a technical environment with technologies such as Python, AWS, Linux, Java
  • Show us how you set and maintain a high standard within operations teams, ensuring consistent and expedient response to events in mission critical systems
  • Expert in performance and people management with a focus on mentoring and motivating engineers
  • Ability to work with cross-functional delivery teams to improve your teams product / service
  • Be a change agent for the business
19

SRE Engineer Site Reliability Engineering Resume Examples & Samples

  • Deliver full software solutions to the consumer
  • Develop consumable, standardized software solutions (resilient, high quality, highly automated and up-to-date)
  • Develop an ecosystem of tools to provide self-service capabilities to the consumer
  • Apply continuous delivery practices
  • Build, enhance and maintain tooling and scripts to automate repetitive or error prone tasks
  • Improve the current WBS service offering
  • Get to know the services in WBS through analysis of the current level of service and the underlying causes
  • Formulate an improvement plan to gain a higher reliability level
  • Prioritise the improvement plan with the responsible devOps teams
  • Recognise shortcomings and deliver the necessary resources and skills to devOps teams where needed
  • Proven experience with automation
20

Site Reliability Engineering Resume Examples & Samples

  • Leadership (30%)
  • Bachelor’s degree in computer science, computer engineering or software engineering
  • 7+ years of software engineering experience
  • Advanced software development skills
  • Strong software architecture and design skills
  • Expertise in designing, analyzing and troubleshooting complex systems
  • Demonstrated systematic approach to problem solving
  • Experience working in a Microsoft environment
  • Knowledge of C#, ASP, SQL
  • Experience using Agile methodologies
  • Experience in monitoring, trending, and logging tools
  • Experience in cloud infrastructure and tooling
21

Director of Site Reliability Engineering Resume Examples & Samples

  • Manage engineers working with the engineering teams on our back-end services like our Hadoop, HDFS, Memcached, Reddis, Kubernetes, AWS, Java, Golang, Linux, etc
  • Troubleshoot issues across the entire stack: hardware, software, application and network
  • Understand technical architectures, capacity plans, tooling needs, automation plans, product launch plans, and other issues and create comprehensive plans for prioritizing technical and resourcing challenges
  • Partner with product management, program management, network engineering, technical support, and other related groups
  • Work closely with recruiting staff to expand the team, including sourcing candidates, interviewing candidates, participating in conferences/events, and onboarding new employees
  • Assess employee performance frequently, address under-performance, and recognize and promote excellent performance
  • You will identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services
22

Site Reliability Engineering Manager Resume Examples & Samples

  • Set-up and manage a group of highly motivated and highly skilled operations engineers (SREs) in a devops approach
  • Identify and hire strong candidates for the SRE jobs
  • 50% hands on is a must
  • Guide/Mentor team members in troubleshooting application/web/system related issues
  • Support career development of your team through active coaching, mentoring and aligning opportunities with skillsets
  • Drive excellence for reliability through maintenance of SLAs, efficient process, automation development, engineering reliability back into applications and maximizing performance
  • Proactively monitor availability and performance of the SAP Hybris cloud products using the required toolset
  • Effectively respond to Monitoring alerts, incident tickets, email requests or other channels coming in to Site Reliability Engineering team
  • Escalate issues as needed to product development or service engineering team per documented procedures, while at the same time establishing a contingency plan to eliminate any intermittent service disruption
  • Handling communication and providing transparency on major site issues to the executive management team and rest of the SAP Hybris organization
  • Document root cause analysis reports and develop standard operating procedures
  • Maintain the relationship with any relevant service providers (internal or external), keeping them accountable to the agreed SLAs
  • Fluency in English – verbal and written
  • Bachelor's Degree in Computer Science or equivalent technical experience
  • Exceptional skills as multiplier to sustain a fast-paced environment
  • Good understanding of Unix systems fundamentals and system management tasks
  • Strong understanding of network concepts, TCP/IP stack and common Internet protocols
  • Attention to detail and accuracy and ability to spot long term trends in a production enterprise environment
  • Must be reliable and dependable with ability to multi-task in a fast paced environment
  • Effective team player to be able to work closely with peers and other operations or engineering team
  • This role is contingent on the successful completion of a background check
  • 3 year’s experience managing a technical team of at least 10 people
  • 5 years of experience working within a Unix/Linux environment
  • Hands-on technical experience combined with strong management and communication skills
  • Experience working in a 24 x 7 cloud operations environment
  • Prior experience working with private and public IaaS providers is an advantage
23

Site Reliability Engineering Resume Examples & Samples

  • 5+ years of experience in software engineering (e.g. Java, Ruby, Node.js, and others)
  • Experience with DevOps processes and culture
  • Experience with DevOps tools (e.g. GoCD, GitHub, Ansible)
  • Comfortable in writing Infrastructure as Code using the same code quality & engineering practices expected from core application developers (Test driven development, Integration, Security, Acceptance testing)
  • Understanding of web and database technologies, concepts and design elements of on premise, cloud based and hybrid architectures
  • Strong understanding of presentation tiers (Apache, Tomcat, Nginx, IIS)
  • Experience with containerized environments (Dockers, DC/OS, Kubernetes)
  • Experience with at least one of the major cloud IaaS providers (AWS, Azure, Google)
  • Strong hands on skills with server operating systems and environments (Linux, Windows)
  • Understanding of standard networking protocols and components such as: HTTP, DNS, TCP/IP, ICMP and Load Balancing
  • Knowledge of (shell) scripting languages (Python, PowerShell, Perl…)
  • Working experience with monitoring and data analysis tools (Splunk, Nagios, Prometheus, AppDynamics,...)
24

Head of Site Reliability Engineering Resume Examples & Samples

  • Has the right mix of knowledge and skills in software (i.e. programming, data structures and algorithms), systems (i.e. operating software on internal and external infrastructure at scale), strategic thinking and most importantly, people
  • Participates in major incident resolution; driving recovery and handling executive level communications
  • Ensures the Operations Site Reliability Organization dedicate at least 50% of their time to 'engineering for availability and efficiency'
  • Drives the Operations Site Reliability Organization to apply software and system engineering principles to positively impact uptime and operational efficiency
  • As a hands on SRE Technical Director, you are expected to have some level of proficiency in the following
  • Relevant Work Experience of 10-15 years
  • Working globally with executive stakeholders and customers
  • Setting and Executing Strategic direction to transform our operation organization in APAC, ensuring alignment with our global strategy
  • Building strong teams through hiring, coaching, encouraging diversity, empowering and motivating others
  • Leading by example in technical areas, supporting your organization's career development, giving feedback and coaching
  • Extensive Programming experience in object oriented programming language such as Java,C#,C++
  • Systems configuration and administration : Windows or Linux
  • Analyzing and discovering how all components of distributed system work together using a broad range of skills and tools
  • Applying an evidence based approach to solving systems problems under pressure and in real time to provide fastest path to service recovery
  • System and software configuration management using tools such as Puppet, Chef or Ansible
  • Cloud technologies and platforms such as AWS or Azure using API or configuration technologies
25

Site Reliability Engineering Manager Resume Examples & Samples

  • Lead a team of System Reliability Engineers responsible for supporting services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, automation, and release
  • Engage in and improve the complete lifecycle of services—from inception and design, through deployment, operation and refinement
  • Lead a team of System Reliability Engineers to scale systems sustainably through mechanisms such as automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Build and grow the System Reliability team of engineers through recruitment, technical and non-technical training, internships, and exposure via community technical groups and gatherings
  • Bachelor's Degree in a related field (Computer Science, Computer Engineering, or related field desire (or equivalent combination of education and experience)
  • 3+ years of People Leadership experience that includes direct reports, coaching, and handling performance review cycle
  • Proven experience in leading technical staff with multiple disciplines in the DevOps Stack (servers, network, monitoring, elastic runtimes, load balancing, services, application)
  • 5 years of experience in developing or operating IT solutions
  • 2 years of experience in operating, designing, delivering, and operating cloud and micro service solutions through the complete DevOps stack
  • Proven systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
  • Travel or hospitality industry experience
  • Experience in designing, operating, and deploying IT solutions into AWS, Azure, or similar public cloud provider