Site Reliability Engineering Resume Samples

The Guide To Resume Tailoring

Guide the recruiter to the conclusion that you are the best candidate for the site reliability engineering job. It’s actually very simple. Tailor your resume by picking relevant responsibilities from the examples below and then add your accomplishments. This way, you can position yourself in the best way to get hired.

Craft your perfect resume by picking job responsibilities written by professional recruiters

Pick from the thousands of curated job responsibilities used by the leading companies

Tailor your resume & cover letter with wording that best fits for each job you apply

Resume Builder

Create a Resume in Minutes with Professional Resume Templates

CHOOSE THE BEST TEMPLATE - Choose from 15 Leading Templates. No need to think about design details.

USE PRE-WRITTEN BULLET POINTS - Select from thousands of pre-written bullet points.

SAVE YOUR DOCUMENTS IN PDF FILES - Instantly download in PDF format or share a custom link.

Create a Resume in Minutes

K Littel

Karolann

Littel

40979 King Cove

Dallas

+1 (555) 152 9940

40979 King Cove

Dallas

Phone

p +1 (555) 152 9940

Experience Experience

Dallas, TX

Site Reliability Engineering Manager

Dallas, TX

Waelchi, Harber and Haley

Dallas, TX

Site Reliability Engineering Manager

Handling communication and providing transparency on major site issues to the executive management team and rest of the SAP Hybris organization
Manage on-call rotations and provide inputs to your team and partners to sustain SLAs
Document root cause analysis reports and develop standard operating procedures
Work with the recruiting team to attract, onboard, and retain diverse top talent
Cultivate a culture of feedback to ensure that teams and individuals are collaborative, purposeful, and high performing
Perform application and web site troubleshooting to quickly resolve the issues per documented procedures
Collaborate with other leaders in the organization to ensure that Spotify is a safe, fun, and challenging place to work

Chicago, IL

Senior Manager, Site Reliability Engineering

Chicago, IL

Marvin Inc

Chicago, IL

Senior Manager, Site Reliability Engineering

Base knowledge of the W3C’s Web Content Accessibility Guidelines v2.0
Exposure to current tools used by people with disabilities (screen readers and other assistive technologies)
Participates in the management of full life cycle product development to include analysis and planning related to product development, launch and deployment. Assist peer product development organizations with launch readiness and post-launch triage and analysis
Conduct War Game exercises to simulate adverse conditions and situations in a controlled fashion. Collect and communicate learnings and areas of opportunity to advance Operational Stance and Preparedness
Demonstrates technical leadership and mentoring on the application of new technologies and systems management methodologies. Can tailor and adapt approach to build consensus and alignment across peer Operational and Engineering Groups
Monitors technical and engineering progress to ensure strategies, goals and objectives are met. Aligns operational plans with business objectives. Communicates changes to all affected personnel
Presents periodic updates to Senior Management on impairments, mitigation opportunities and progress

present

Chicago, IL

Director of Site Reliability Engineering

Chicago, IL

Rowe, Schroeder and Botsford

present

Chicago, IL

Director of Site Reliability Engineering

present

Interface with Dev/QA/OPS teams to identify root cause analysis and re-instrument triggers to prevent future network degradation and outages
Effectively communicate in and among the other organizations on various standards, services and system metrics. Other organizations include but not limited to Customer Care, Engineering, DevOps, OPS, Marketing, and Sales
Drive process and run book documentation to minimize mean-time-to-repair (MTTR) on network events, including processes on field dispatches, internal and external escalations, and vendor engagement
Provide leadership and managerial coaching to SRE management team
Manage engineers working with the engineering teams on our back-end services like our Hadoop, HDFS, Memcached, Reddis, Kubernetes, AWS, Java, Golang, Linux, etc
Troubleshoot issues across the entire stack: hardware, software, application and network
Represent the SRE organization in design reviews and operational readiness exercises for new and existing services

Education Education

Bachelor’s Degree in Computer Science

Rutgers University

Bachelor’s Degree in Computer Science

Skills Skills

Systematic problem solving approach, coupled with a strong sense of ownership and drive
Approachability
Knowledge of C#, ASP, SQL
Experience in cloud infrastructure and tooling
Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way
Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing)
Experience with algorithms, data structures, complexity analysis and software design
Experience working in a Microsoft environment
Expertise in designing, analyzing and troubleshooting large-scale distributed systems
Experience using Agile methodologies

Create a Resume in Minutes

15 Site Reliability Engineering resume templates

Read our complete resume writing guides

Site Reliability Engineering Resume Examples & Samples

Work in engineering team to design, build, deploy, and maintain systems
Write scripts to monitor systems and automate tasks
Troubleshoot issues across the entire stack - hardware, software, application and network
Take part in a shared 24x7 on-call rotation
4+ years of managing user facing applications at web scale
Familiarity with systems management tools such as Chef
Practical knowledge of shell scripting in various language including Python
Experience supporting Hadoop, Scribe and Hive
Track record of practical problem solving, excellent communication, and documentation skills
Capable of leading technical teams through designs and implementations across an organization
Experience managing services at web scale
Knowledge of Postgres, Sensu, HAProxy, and S3/EMR applications
Ability to debug C++ and Python applications

Senior Manager, Site Reliability Engineering Resume Examples & Samples

Drive the adoption and implementation of Operational Best Practices and associated tooling to improve resiliency and reliability. Improve tooling and instrumentation to accelerate triage and remediation
Conduct War Game exercises to simulate adverse conditions and situations in a controlled fashion. Collect and communicate learnings and areas of opportunity to advance Operational Stance and Preparedness
Presents periodic updates to Senior Management on impairments, mitigation opportunities and progress
Evaluates various architectural solutions and implementations and supports development and deployment of solutions as determined by the SRE team
Identifies trends, services and/or capabilities that may be beneficial to product offerings. Manages and forecasts resource needs to meet departmental objectives. Recommends action plans or solutions
Ensures effective implementation of the department budget. Prepares financial statements and monthly forecasts and reports. Prepares and analyzes monthly financial performance and makes budget and new technology recommendations

Manager, Site Reliability Engineering Resume Examples & Samples

Responsible for developing and managing a team of engineers who are focused on Site and Service Reliability. This team is well versed in the application technologies and Platforms used to deliver an excellent email experience for Comcast customers
The SRE group is focused on improving the availability and responsiveness of internal and external components and Platforms through the application of engineering best practices, tooling and instrumentation advances and cross organizational coordination. The SRE team helps drive efforts to improve triage time and bring down MTTR (Mean Time to Repair) and provides follow-up support to provide mitigation in the future
This individual will manage a team which may include exempt and non-exempt employees. Employees will be a mix of local and remote workers which includes offshore outsourced resources. They will provide subject matter guidance to employees as required and serve as a point of escalation
Passionate and driven to improve the customer experience through solving problems which impede reliability, resiliency and responsiveness
Develops processes and procedures to drive departmental efficiencies, assist in development, and meeting of departmental budget
Participates in the management of full life cycle product development to include analysis and planning related to product development, launch and deployment. Assist peer product development organizations with launch readiness and post-launch triage and analysis
Being proactive, evaluating multiple options and considering our customer's experience is key to our success
Effectively manage local and remote employees including offshore resources. Creating a culture of inclusiveness across all locations for direct and dependent teams
Document and detail areas of improvement to bolster architecture, design, technical requirements and service specifications. Present architecture, design, and technical choices to internal audiences
Demonstrates technical leadership and mentoring on the application of new technologies and systems management methodologies. Can tailor and adapt approach to build consensus and alignment across peer Operational and Engineering Groups
Monitors technical and engineering progress to ensure strategies, goals and objectives are met. Aligns operational plans with business objectives. Communicates changes to all affected personnel
Establishes and maintains productive relationships with peer organizations and equipment and software vendors

Manager, Site Reliability Engineering Resume Examples & Samples

Experience with Unix/Linux systems with scripting experience in Shell, Perl or Python
Strong knowledge of core protocols and tech such as: TCP/IP, HTTP, DNS, load balancers, distributed file systems, key-value and relational databases
Extensive experience with configuration management tools such as Puppet, Chef, Salt, or Ansible
Experience with specific software such as Hadoop, Kafka, Spark, CouchBase, and similar technologies is desirable, but the ability to quickly learn new technology is most important
Experience with container-based platform like Mesosphere, Docker, Kubernetes, and similar technologies
Capable of technical deep-dives into code, networking, systems, and storage with very bright, experienced engineers
Expertise in problem solving and analyzing global scale distributed systems
Identify gaps in processes, skills, tooling, technology choices and work with upper management to drive improvements within the organization
Generally requires 8-11 years related experience

Director of Site Reliability Engineering Resume Examples & Samples

Refine and Maintain a 24x7 production cloud based infrastructure supporting RingCentral’s SaaS platform and customers, to provide 5 9’s uptime and availability
Provide leadership and direction to SRE staff that are responsible for break-fix, uptime and reliability for core services, distribution, and customer access network elements and related interfaces
Interface with Dev/QA/OPS teams to identify root cause analysis and re-instrument triggers to prevent future network degradation and outages
Effectively communicate in and among the other organizations on various standards, services and system metrics. Other organizations include but not limited to Customer Care, Engineering, DevOps, OPS, Marketing, and Sales
Drive process and run book documentation to minimize mean-time-to-repair (MTTR) on network events, including processes on field dispatches, internal and external escalations, and vendor engagement
Establish infrastructure and service uptime and availability metrics and devise ways to improve them
Provide leadership and managerial coaching to SRE management team
Drive appropriate metrics for measuring group and individual performance. Drive for automation of metrics and measurement systems for scalable performance management
Establish vendor relationships and direct activity to insure vendor related technical issues are appropriately escalated, managed and measured in accordance with proper guidelines
10 - 15 years of experience managing operations teams (DevOps, Infrastructure, data centers, NOC, QA, DBA)
Excellent project management skills. Understands the difference between waterfall, agile, scrum and any other project management tools to effectively strike the right balance
Experience mentoring manager as well as line worker
Deep understanding of the software delivery process with the ability to implement and enforce that process across the organization
Experience with VMWare, AWS or any other relevant cloud technologies
Strong experience in planning, architecting and managing software development life cycle and best practices
Experience with complex networking technology including firewalls, VPN, routing, switching, load balancers, monitoring, security and DNS
Have expert level skills in Unix/Linux system and network administration and agile implementation of production systems
Experience architecting and managing customer facing large Big Data Hadoop platforms
Experience managing and coaching team of Technical Operations staff including system and network administrators
Ability to nurture and support a strong operations culture: customer/service focus; excellent technology; high quality implementations; self-motivated innovation and problem-solving
Proven experience in voice and data network management and engineering
Demonstrated Director-Level Management experience
Strong problem solving skills, demonstrated ability to make effective decisions
Strong oral communication and interpersonal skills
Proven ability to work effectively with customers, contractors, and vendors
Ability to manage multiple projects to completion
Ability to meet deadlines and deals efficiently with a fast paced environment

Site Reliability Engineering Lead Resume Examples & Samples

Extensive experience leading a team and running large critical applications
Strong hands-on skills in software engineering and IT operations
Experience supporting containers, container orchestration platforms
Experience operating applications on public and private cloud solutions
Experience with running large scale systems and meeting SLA expectations
5+ years of relevant working experience and at least 2 year in a technical lead role

Site Reliability Engineering Manager, Search Resume Examples & Samples

You will have a maniacal focus on site uptime and service reliability
Automate and support Search services hosted on public cloud infrastructure that are reliable, efficient, and maintainable
Partner closely with the Search product development team using a strong devops mindset
Constantly improve operational processes and efficiency
Automate, Automate, Automate
Manage a team of site reliability engineers to deliver cloud infrastructure and production operations for Search services
Lead issue triage and resolution for complex production issues spanning the whole stack: code, OS, network, and storage
Automate production operations and continuous deployment processes
Implement comprehensive service monitoring to ensure uptime and performance, including synthetic, real user, system, application performance, volume or rate based, anomalies, dashboarding, etc
Define, measure, and meet key Service Level Objectives including availability, performance, incidents and chronic problems, site traffic and conversion, etc
Manage a team that owns 24x7x365 support for a website that never sleeps
Proven results-oriented leader with at least 4 years of management experience delivering significant web engineering and operations automation for large-scale, high resilient web sites
Experience building large scale, revenue generating web applications in a public cloud IaaS, including real world experience with at least one public cloud provider: Amazon Web Services, Google Cloud Platform, or Microsoft Azure
Experience building, scaling, and running large scale production operations for Search platforms including Apache Solr, Lucidworks, ElasticSearch, Endeca, and others
Track record of driving extremely high levels of availability for web services with resilient architecture, scalable infrastructure, technical operations automation, 360 degrees of application performance monitoring, and a highly trained operations staff
Experience implementing a continuous deployment tool chain for development teams including microservices architecture, full continuous integration automation, and canary and blue/green production deployments
Experience with automation tools such as Puppet, Chef, Ansible, Salt Stack, etc
Experience with Application Performance Monitoring tools: New Relic, Dynatrace, Appdynamics, etc
Excellent communications, organization, and time management skills
Experience in attracting, engaging, and developing world-class technical talent
Able to create and drive strategic goals through accountability and execution excellence
Demonstrated experience in executing/delivering projects in a dynamic, fast-paced environment. Not afraid to multi-task
Project leadership ability

Site Reliability Engineering Resume Examples & Samples

Preferred Python, fine with other language like Ruby, Perl, Java, C
Familiar with JQuery, Ajax, HTML, CSS, and Javascripts with 2-5 years experience in this area
Knowledge of Django, Bootstrap, node.js
Knowledge of Mysql or other database
Knowledge of networking is a plus
Knowledge of Linux environment
Knowledge of GIT or other source control system
Ability to express ideas clearly within the team and across other groups

Senior Site Reliability Engineering Manager Resume Examples & Samples

Have a maniacal focus on high availability, performance, scalability, and security of mission critical production services, 24x7x365
Attract, recruit, grow, and retain highly motivated Site Reliability Engineers
Plan and manage capex/opex budgets that improve and sustain reliability
Collaborate, empower, and share success with development teams using a strong devops mindset
Create and influence system design, standards, and processes that improve production reliability
7+ years of management experience leading teams that ensure high availability for very large scale, revenue generating production environments
Highly collaborative. Focused on team success vs individual achievement
Skilled at communication to all levels of the organization: executives, customers, peers, and staff
Experience hiring, mentoring, and training a staff of engineers
Experience implementing a production deployment automation with canary and blue/green capabilities
Experience with infrastructure as code, either via immutable server architectures and/or infrastructure configuration management

Senior Manager, Site Reliability Engineering Resume Examples & Samples

Directly manage a large geographically distributed team of talented Site Reliability Engineers
Lead the DevOps process by working closely with Engineering Management in an agile process
Provide technical leadership for a hybrid private/public cloud enterprise solution
Drive best practices in Site Reliability Engineering and insure Secure, Scalable, Performant, and Highly Available Service
Collaborate with various internal teams to provide a high quality customer experience, and support
Communicate effectively and present team progress to upper management
Relentlessly introduce new ways of improving & scaling the service
Oversee Service metrics and measurement
Responsible for automation, monitoring, alerting, & logging
Mentor, coach, and develop a high performing Site Reliability team
Work in a dynamic, fast-paced environment similar to a start-up company

IT Chapter Lead Site Reliability Engineering Resume Examples & Samples

Implement mandatory security standards
Write fully automated tests (e.g., unit-, functional-, non-functional- and integration testing)
Develop micro services and APIs
Implement user stories from backlog as agreed with the Product Owner, without spending time on work outside the backlog
Advise Product Owner and Infra Area Lead on resource planning
Provide regular feedback, coaching and mentoring to Chapter members and
Align individual and overall performance of the Chapter, and identify top performers with Infra Area Lead
Stay up to date on your topic and apply your learnings in an Infra NL context
Regularly organize alignment meetings on Chapter specific topics to update colleagues on new developments
Define standards and best practices for Chapter
Mastery in at least one programming language, Perl and Java are a plus
Scripting experience in at least one of the following: Ruby, Python, Bash, Powershell
Experience with virtualization environments and tools e.g., VMware, cloud forms

Site Reliability Engineering Manager Resume Examples & Samples

Provide leadership for a team of engineers who own the reliability goals of uptime, scalability and performance
Build trust & alignmentacross teams to collaborate effectively across teams and partners to achieve Uber’s goals
Dive deep into availability, performance and scalability issues/outages for services and provide technical leadership for immediate and proactive resolutions
Manage on-call rotations and provide inputs to your team and partners to sustain SLAs

Head of Site Reliability Engineering Resume Examples & Samples

Build and manage the team of engineers responsible for the PaaS production environment that lives and breathes automation
Manage the production environment for the PaaS, ensuring compliance to Service Level Agreement. Establish required metrics for the measurement of the service performance
Drive the onboarding of applications on to the production PaaS environment, ensuring all best-practices are adhered
Manage incidents related to the PaaS and coordinate incident response with platform operations and/or infrastructure operations
Drive continuous improvement in the management of the PaaS
Capacity and resilience management of the PaaS
Mentors and train others on the management of the PaaS
Experience working with in an Agile environment with Cloud Native architecture and CI/CD
Experience designing and implementing monitoring and alerting, and metrics reporting systems
Ability to multi-task and priorities effectively in a fast paced environment
Strong problem solving and troubleshooting skills with the ability to exercise mature judgment

SRE Dev Engineer Site Reliability Engineering Resume Examples & Samples

Develop consumable, standardized infrastructure services (resilient, high quality, highly automated and up-to-date)
Support and educate DevOps teams and consumers on using the standardized infrastructure services (consumer is responsible for their own instances)
Manage all resources in version-controlled repositories (incl., code, scripts, configurations, artefacts, static resources)
Refactor and reuse existing code/modules/functionality
Participate rigorously in DevOps rituals (e.g. daily stand-ups, sprint planning, sprint review, retrospective, peer reviews)
Continuously improve yourself, your squad and the service
Mastery in at least one programming language, Java and/or .net are a plus
Working knowledge of configuration tools like Puppet, Chef or Ansible
Experience with building, operating and maintaining complex and scalable systems
Solid foundation in Linux or Windows administration and troubleshooting
Proven experience with automation. Knowledge of configuration management tools like Puppet or Chef is a plus
Creative and not afraid to step outside of your comfort zone

Site Reliability Engineering Manager Alexa Resume Examples & Samples

Be responsible for the overall uptime and performance of critical Alexa cloud services
Manage departmental resources, staffing, mentoring, and enhancing and maintaining a best-of-class engineering team
Work with internationally distributed teams and manage 24x7 on call resources
Design, write, and deliver software to improve the reliability, scalability, capacity, and latency of Alexa services
Identify recurring problems and build the tools and processes to prevent problems from recurring
Identify and build monitoring and alarming solutions
7+ years of experience building production software systems
Understanding of web services, web application development, SQL, REST/JSON
Knowledge and understanding of network theory and concepts such as TCP/IP, UDP, DNS, and load balancing

Director Site Reliability Engineering Resume Examples & Samples

Demonstrates a broad understanding of all technical areas through experience and demonstrated success
Oversight of the research and proposed solutions driving the stability and reliability of our products improving overall company quality and customer satisfaction
Establish and maintain relationships with peers and leaders and act as an internal resource for teams, business units and Sr. Leadership
Anticipates and researches industry trends and influence Executive Leadership in how they will impact the company. Contributes to the development of overall Technology strategy
Anticipates future direction of the organization and consults with Sr Leadership on how they may position commercial work to offer this perspective to clients and prospects
Attends technology-focused conferences and seminars to expand knowledge about futuristic products, services and concepts that will help to develop a point of view for the organization
Leads global and virtual terms. Considers regional and cultural norms and values diversity in all forms
Help advance the maturity and integration of DevOps tools and automation
Leverage business planning and production performance to architect stable solutions
Implement and Drive Service Improvement Plans to increase application availability
Ownership of P1 and P2 Production issues
Laser focus on the architecture at the product or enterprise level
End to End Product Performance
10+ years of management experience
10+ years of hands-on technical experience
4+ years of hands-on development experience
2+ years of experience as a practicing SRE or in a leadership role in an SRE organization
Experience running teams of at least 50+
Management experience in matrix organization
Prior experience in full-stack architecture
Experience leading matrix organizations and teams of teams
Ability to present technical topics at an executive level

Site Reliability Engineering Lead Resume Examples & Samples

5+ years of experience with designing and managing services in a distributed, Internet-scale Linux environment
4+ years of experience with scripting in Shell, Perl, or Python
2+ years of experience with managing a team of system administrators or infrastructure engineers
1+ years of experience with configuration management in Salt or an equivalent technology, including Ansible or Puppet
Knowledge of edge cases and risk mitigation strategies to be applied to every change
Ability to motivate technology and process innovations
Ability to quickly comprehend how code, processes, and systems fit together
Ability to disseminate technical details one minute and career feedback the next
Ability to advance multiple projects simultaneously
BS degree in CS or related technical field
CISSP, RHCE, or Saltstack Certification

Site Reliability Engineering Manager Resume Examples & Samples

Demonstrate at least 3 years experience in operations team leadership for a SAAS, hosted or virtualised platform
Ability to effectively manage a team of up to 10 engineers, potentially across multiple shifts and locations
Experience working in a technical environment with technologies such as Python, AWS, Linux, Java
Show us how you set and maintain a high standard within operations teams, ensuring consistent and expedient response to events in mission critical systems
Expert in performance and people management with a focus on mentoring and motivating engineers
Ability to work with cross-functional delivery teams to improve your teams product / service
Be a change agent for the business

SRE Engineer Site Reliability Engineering Resume Examples & Samples

Deliver full software solutions to the consumer
Develop consumable, standardized software solutions (resilient, high quality, highly automated and up-to-date)
Develop an ecosystem of tools to provide self-service capabilities to the consumer
Apply continuous delivery practices
Build, enhance and maintain tooling and scripts to automate repetitive or error prone tasks
Improve the current WBS service offering
Get to know the services in WBS through analysis of the current level of service and the underlying causes
Formulate an improvement plan to gain a higher reliability level
Prioritise the improvement plan with the responsible devOps teams
Recognise shortcomings and deliver the necessary resources and skills to devOps teams where needed
Proven experience with automation

Site Reliability Engineering Resume Examples & Samples

Leadership (30%)
Bachelor’s degree in computer science, computer engineering or software engineering
7+ years of software engineering experience
Advanced software development skills
Strong software architecture and design skills
Expertise in designing, analyzing and troubleshooting complex systems
Demonstrated systematic approach to problem solving
Experience working in a Microsoft environment
Knowledge of C#, ASP, SQL
Experience using Agile methodologies
Experience in monitoring, trending, and logging tools
Experience in cloud infrastructure and tooling

Director of Site Reliability Engineering Resume Examples & Samples

Manage engineers working with the engineering teams on our back-end services like our Hadoop, HDFS, Memcached, Reddis, Kubernetes, AWS, Java, Golang, Linux, etc
Troubleshoot issues across the entire stack: hardware, software, application and network
Understand technical architectures, capacity plans, tooling needs, automation plans, product launch plans, and other issues and create comprehensive plans for prioritizing technical and resourcing challenges
Partner with product management, program management, network engineering, technical support, and other related groups
Work closely with recruiting staff to expand the team, including sourcing candidates, interviewing candidates, participating in conferences/events, and onboarding new employees
Assess employee performance frequently, address under-performance, and recognize and promote excellent performance
You will identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services

Site Reliability Engineering Manager Resume Examples & Samples

Set-up and manage a group of highly motivated and highly skilled operations engineers (SREs) in a devops approach
Identify and hire strong candidates for the SRE jobs
50% hands on is a must
Guide/Mentor team members in troubleshooting application/web/system related issues
Support career development of your team through active coaching, mentoring and aligning opportunities with skillsets
Drive excellence for reliability through maintenance of SLAs, efficient process, automation development, engineering reliability back into applications and maximizing performance
Proactively monitor availability and performance of the SAP Hybris cloud products using the required toolset
Effectively respond to Monitoring alerts, incident tickets, email requests or other channels coming in to Site Reliability Engineering team
Escalate issues as needed to product development or service engineering team per documented procedures, while at the same time establishing a contingency plan to eliminate any intermittent service disruption
Handling communication and providing transparency on major site issues to the executive management team and rest of the SAP Hybris organization
Document root cause analysis reports and develop standard operating procedures
Maintain the relationship with any relevant service providers (internal or external), keeping them accountable to the agreed SLAs
Fluency in English – verbal and written
Bachelor's Degree in Computer Science or equivalent technical experience
Exceptional skills as multiplier to sustain a fast-paced environment
Good understanding of Unix systems fundamentals and system management tasks
Strong understanding of network concepts, TCP/IP stack and common Internet protocols
Attention to detail and accuracy and ability to spot long term trends in a production enterprise environment
Must be reliable and dependable with ability to multi-task in a fast paced environment
Effective team player to be able to work closely with peers and other operations or engineering team
This role is contingent on the successful completion of a background check
3 year’s experience managing a technical team of at least 10 people
5 years of experience working within a Unix/Linux environment
Hands-on technical experience combined with strong management and communication skills
Experience working in a 24 x 7 cloud operations environment
Prior experience working with private and public IaaS providers is an advantage

Site Reliability Engineering Resume Examples & Samples

5+ years of experience in software engineering (e.g. Java, Ruby, Node.js, and others)
Experience with DevOps processes and culture
Experience with DevOps tools (e.g. GoCD, GitHub, Ansible)
Comfortable in writing Infrastructure as Code using the same code quality & engineering practices expected from core application developers (Test driven development, Integration, Security, Acceptance testing)
Understanding of web and database technologies, concepts and design elements of on premise, cloud based and hybrid architectures
Strong understanding of presentation tiers (Apache, Tomcat, Nginx, IIS)
Experience with containerized environments (Dockers, DC/OS, Kubernetes)
Experience with at least one of the major cloud IaaS providers (AWS, Azure, Google)
Strong hands on skills with server operating systems and environments (Linux, Windows)
Understanding of standard networking protocols and components such as: HTTP, DNS, TCP/IP, ICMP and Load Balancing
Knowledge of (shell) scripting languages (Python, PowerShell, Perl…)
Working experience with monitoring and data analysis tools (Splunk, Nagios, Prometheus, AppDynamics,...)

Head of Site Reliability Engineering Resume Examples & Samples

Has the right mix of knowledge and skills in software (i.e. programming, data structures and algorithms), systems (i.e. operating software on internal and external infrastructure at scale), strategic thinking and most importantly, people
Participates in major incident resolution; driving recovery and handling executive level communications
Ensures the Operations Site Reliability Organization dedicate at least 50% of their time to 'engineering for availability and efficiency'
Drives the Operations Site Reliability Organization to apply software and system engineering principles to positively impact uptime and operational efficiency
As a hands on SRE Technical Director, you are expected to have some level of proficiency in the following
Relevant Work Experience of 10-15 years
Working globally with executive stakeholders and customers
Setting and Executing Strategic direction to transform our operation organization in APAC, ensuring alignment with our global strategy
Building strong teams through hiring, coaching, encouraging diversity, empowering and motivating others
Leading by example in technical areas, supporting your organization's career development, giving feedback and coaching
Extensive Programming experience in object oriented programming language such as Java,C#,C++
Systems configuration and administration : Windows or Linux
Analyzing and discovering how all components of distributed system work together using a broad range of skills and tools
Applying an evidence based approach to solving systems problems under pressure and in real time to provide fastest path to service recovery
System and software configuration management using tools such as Puppet, Chef or Ansible
Cloud technologies and platforms such as AWS or Azure using API or configuration technologies

Site Reliability Engineering Manager Resume Examples & Samples

Lead a team of System Reliability Engineers responsible for supporting services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, automation, and release
Engage in and improve the complete lifecycle of services—from inception and design, through deployment, operation and refinement
Lead a team of System Reliability Engineers to scale systems sustainably through mechanisms such as automation, and evolve systems by pushing for changes that improve reliability and velocity
Build and grow the System Reliability team of engineers through recruitment, technical and non-technical training, internships, and exposure via community technical groups and gatherings
Bachelor's Degree in a related field (Computer Science, Computer Engineering, or related field desire (or equivalent combination of education and experience)
3+ years of People Leadership experience that includes direct reports, coaching, and handling performance review cycle
Proven experience in leading technical staff with multiple disciplines in the DevOps Stack (servers, network, monitoring, elastic runtimes, load balancing, services, application)
5 years of experience in developing or operating IT solutions
2 years of experience in operating, designing, delivering, and operating cloud and micro service solutions through the complete DevOps stack
Proven systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
Travel or hospitality industry experience
Experience in designing, operating, and deploying IT solutions into AWS, Azure, or similar public cloud provider