Site Reliability Engineer Resume Samples

The Guide To Resume Tailoring

Guide the recruiter to the conclusion that you are the best candidate for the site reliability engineer job. It’s actually very simple. Tailor your resume by picking relevant responsibilities from the examples below and then add your accomplishments. This way, you can position yourself in the best way to get hired.

Craft your perfect resume by picking job responsibilities written by professional recruiters

Pick from the thousands of curated job responsibilities used by the leading companies

Tailor your resume & cover letter with wording that best fits for each job you apply

Resume Builder

Create a Resume in Minutes with Professional Resume Templates

CHOOSE THE BEST TEMPLATE - Choose from 15 Leading Templates. No need to think about design details.

USE PRE-WRITTEN BULLET POINTS - Select from thousands of pre-written bullet points.

SAVE YOUR DOCUMENTS IN PDF FILES - Instantly download in PDF format or share a custom link.

Create a Resume in Minutes

M Mitchell

Mireille

Mitchell

72011 Bauch Isle

Philadelphia

+1 (555) 247 3500

72011 Bauch Isle

Philadelphia

Phone

p +1 (555) 247 3500

Experience Experience

Chicago, IL

Site Reliability Engineer

Chicago, IL

Mueller, Hauck and Frami

Chicago, IL

Site Reliability Engineer

Assist in the Development Priority List process working with Product Management group to address issue identified as part of Problem Management
Provide solutions for performance management, disaster recovery, monitoring and access management
Work/support business users to understand issues, develop root cause analysis and work with the team for the development of enhancements/fixes
Works with the team to develop, maintain, and communicate current development schedules, timelines and development status
Provide engineering design across different workloads including incident & problem management, change management, security and compliance
Improve security and performance of infrastructure by working with other teams
Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth

Philadelphia, PA

Senior Site Reliability Engineer

Philadelphia, PA

Ryan, Ziemann and Lesch

Philadelphia, PA

Senior Site Reliability Engineer

Keeping the ship sailing! Monitoring and supporting the IT infrastructure environment
Monitoring and diagnosis of systems for optimal performance
Generating well defined and documented standard processes for the enterprise
Queuing and data-pipeline solutions (RabbitMQ, ZeroMQ, pub/sub, SQS)
Scripting (Bash/Python)
Identifying, gathering, analyzing and automating responses to key performance metrics, logs, and alerts
Engineering solutions in the long term to make everyone’s life easier

present

Chicago, IL

Principal Site Reliability Engineer

Chicago, IL

Rogahn-Hickle

present

Chicago, IL

Principal Site Reliability Engineer

present

Provide architectural and practical guidance to software development to improve resiliency, efficiency, performance, and costs
Monitor and report on service level objectives for a given applications services. Work with business and product owners to establish key performance indicators
Capacity planning and management – create, use, maintain a capacity model for on-prem and AWS hosting, based on E2E user flow profiles
Work with product operations team to resolve trouble tickets, developing and running scripts, and troubleshooting services in a hosted environment
Working knowledge of virtualized environments; VM management and provisioning
Provide technical insight on development projects
Assist with testing and validating production applications

Education Education

Bachelor’s Degree in Computer Science

Iowa State University

Bachelor’s Degree in Computer Science

Skills Skills

2+ years testing and supporting a highly scalable, highly available online service
2+ years supporting a highly scalable, highly available online service
Excited about continually reducing complexity, and creating systems that are easily understandable, repeatable, and observable
Strong TCP/IP understanding and ability to produce detailed documentation
Strong demonstrable scripting knowledge of PowerShell
Ability to define and document technical architecture of complex and highly scalable products
Fundamental networking knowledge of TCP/IP, ARP, IP Tables, routing and working knowledge of routers, switches, firewalls/VPNs #LI-HM2
Basic networking knowledge, including TCP/UDP, how ARP and the routing table work, and some understanding of higher-level protocols like HTTP and DNS
Strong personal and professional initiative with a focus on the success of the team and organization
Good knowledge about Windows Operation System

Create a Resume in Minutes

15 Site Reliability Engineer resume templates

Read our complete resume writing guides

Site Reliability Engineer Resume Examples & Samples

Manage the scalability and efficiency of the Livestream platform
Respond to and resolve platform problems
Solve tasks in a generic way that can be automated (so no task is ever done by hand twice)
Participate in capacity planning and forecasting, system performance analysis, and system tuning
Review and influence ongoing design architecture of the Livestream platform
BS degree in Computer Science or related field. In lieu of degree, relevant skills or equivalent experience
Familiarity with at least two of the following languages: Python, Perl, PHP, JavaScript, Erlang, Scala, Ruby
Ability to write scripts using Shell, awk, sed
MS degree in Computer Science or related field
Expert knowledge of Linux kernel
Understanding of video streaming protocols

Site Reliability Engineer Resume Examples & Samples

Engages with the Account Team to ensure Critical Application Services client expectations are being fulfilled
Respond to support requests and co-ordinate Customer support teams where appropriate
Attend and participate in all customer service review meetings
Identify opportunities for growth and advancement of the Service offering
Requires 3-5 years of hands-on Linux exposure within the full LAMP stack * Bachelor's degree in computer science or engineering related field or equivalent work experience preferred * Familiarity with virtualization technologies * Experience with common scripting languages (Bash, Perl, Python, Ruby) * Experience working with mission-critical applications written using Ruby, Rails and Python Ex: Oracle Web Commerce or AEM * Experience with Java is preferred * Ability to diagnose and fix complex issues surrounding hardware, software, and network issues * Familiarity with MySQL and Oracle database administration concepts and performance tuning * Understanding of performance monitoring software such as Cacti, SNMP, Munin, sysstat, and Nimbus * 1-2 years knowledge of network concepts such as the TCP/IP stack, load balancing, firewalls, iRules, and network routing * Knowledge of RAID technology in regards to data integrity, and IO performance * Knowledge of email and DNS fundamentals * Experience with performance tuning and diagnosing system bottlenecks through root cause analysis * Experience working in large scale distributed environments * Ability to work independently and operate as a member of a team * RHCE certification is strongly preferred, RHCA a plus * Excellent problem solving abilities, coupled with a desire to take on responsibilities * Must be detailed orientated and a self starter

Site Reliability Engineer Resume Examples & Samples

Support web based applications with systems administration, configuration, troubleshooting and monitoring
Evaluate Linux systems and make recommendations to improve security, scalability, performance and availability
Production application monitoring and support
Should work with multiple Product development teams distributed Globallyand address their needs in a timely manner
Passion to learn and explore new tools/technologies that improves the current process * Qualifications/Requirement*
3+ years of professional experience in systems, server hardware, virtualization and networking
Proven ability to troubleshoot system/network issue(specific China Great firewall)
Experience and working knowledge of Cloud services namely AWS(Amazon Web service) & Microsoft Azure Cloud services is a big plus
Experience in a 24x7 high availability production environment
Experience with programming/scripting language(Shell, Perl or other)
Strong knowledge about Linux(CentOS/Ubuntu) and LAMP stack
Good knowledge about Windows Operation System
Good knowledge in virtual networking, VPN and CDN(Content delivery network)
Fundamental networking knowledge of TCP/IP, ARP, IP Tables, routing and working knowledge of routers, switches, firewalls/VPNs #LI-HM2

Senior Site Reliability Engineer Resume Examples & Samples

Web application performance baselining, analysis, tuning, capacity planning and demand forecasting
Assist with development and implementation of DevOps SRE solutions for large scale distributed web applications across multiple tiers and data centers
Work closely with development and QA teams on new and ongoing technology projects related to performance, high availability and scalability including load-based dynamic provisioning and de-provisioning of systems, etc
Perform proactive daily system monitoring including reviewing system and application logs as well as responding to, triaging, troubleshooting and remediating incidents
Coordinate with our enterprise operations team to communicate with impacted stakeholders and clients, escalating where appropriate
Review entire environment and execute initiatives to reduce failures, defects and improve overall performance
Design, develop and execute automated tests to validate solutions and environments
Document current and future configuration, processes and policies
Availability for On-call after-hours support
Up to 10% travel may be required
5+ years experience working with large scale distributed web applications across multiple environments
Software Development background in .NET or J2EE stack
Strong working knowledge of Windows and/or Linux operating systems, their underlying components, system statistics, performance tuning, file systems and I/O
Experience in one or more of the following languages: Powershell, Bash, Python or Perl
Solid understanding of operational principles in capacity planning, monitoring and incident handling
Must have understanding of building and managing large-scale systems and application architectures
Experience with VMware, ESX, etc
Experience with syslog-ng, ELMAH, Splunk, ELK or similar log monitoring and analytics solutions
Experience supporting Windows, IIS7.x and .NET 3.5/4.x applications in a production environment
Experience in scalability, optimization and performance analysis
Experience with SNMP/MIB and REST
Experience building performance test environments a plus
Ability to work in a collaborative team oriented Agile/SCRUM environment

Site Reliability Engineer Resume Examples & Samples

Build monitoring and automation tools
Automate system capacity, uptime and other system related reports
Develop dashboards for our network and infrastructure
Gain expert level knowledge of our applications and services
Perform deep-dive analysis on application issues
Participate in a weekly on-call rotation
4+ years of experience as a Site Reliability or Operations Engineer
Experience with high-volume, low-latency systems
Experience supporting a variety of applications on Linux operating systems (CentOS is a plus; LXC is a strong plus)
Proficiency with load balancing solutions, both open-source and commercial
Self-starter with a can do attitude

Site Reliability Engineer Resume Examples & Samples

Experience operating and supporting large-‐scale Internet hosted applications
Hands on system administration experience Linux-‐based systems (CentOS, RHEL), storage systems (SAN and NAS), load balancers and virtualized environments (JVM, VMware, vSphere, Amazon AWS)
Experience with custom tool development, research of tools and deployment of solutions in support of internet hosted applications and environments
Familiarity with hosted application service provider environments, including remote administration of servers and devices
Excellent written and verbal communication skills, demonstrating the ability to effectively convey technical information to both technical and non-‐technical audiences
Excellent information management practices, such as thorough documentation, usage of wikis, blogs and other collaboration tools
Familiarity with service delivery and project management principles
Experience supporting large-‐scale SaaS based applications and databases
Familiarity with agile software development processes including software builds and source code control
Experience with administration of NoSQL technology is a plus (e.g. MongoDB, Cassandra, Hadoop/Hbase, etc.)

Site Reliability Engineer Resume Examples & Samples

Analyze data to deeply understand operational characteristics of our systems in production
Develop and maintain tools as appropriate and champion them through the organization in order to improve productivity
Responsible for technical troubleshooting for high priority escalations. And getting to the root cause and helping drive the permanent resolution
Work across functional teams to ensure improvements to system/unit functional specifications are pro-actively picked up
Proactive communication skills and effectively communicates difficult messages consistent with management direction
Effectively resolves conflict
Excellent knowledge and experience of Linux systems administration (Preferably RedHat/Centos)
Designing configuration management and automation tools (Puppet/Chef//Func)
Working in large deployment environments
Good understanding and experience with scripting (shell, perl, ruby, python)
Experience with internal tools (spacewalk, Graphite, rsyslog, logstash, elasticsearch etc.)
Designing and configuring patch management systems
Good understanding of networking, including TCP and DNS
Excellent problem solving and analytical skills; experience of troubleshooting using packet captures and root cause analysis
Ability to build on and use open-source tools/projects
Thorough knowledge of the HTTP protocol
Experience in real-time traffic processing (firewalls, port forwarding, etc.)
Prior employment within a web hosting, internet service provider or Software as a Service company
Experience with Openstack or EC2
Experience of supporting Java application servers
Experience with Agile Scrum development environments
Understanding of software development methodologies and some background in Java
Experience of Virtualization technologies
Experience of configuring automated monitoring systems
Understanding of HTTP proxy servers is highly desirable

Site Reliability Engineer Resume Examples & Samples

Review and influence new and evolving design, architecture, standards, and methods for operating services and systems
Identifying GAP’s in current tooling and areas of improvements, working towards delivering the tools and where required engaging with other Scrum teams to ensure delivery of improvements
Identify business needs and if needed look at new technologies and the relevance for the use case of the technologies to meet those business needs
Assist in the Development Priority List process working with Product Management group to address issue identified as part of Problem Management
Diagnose bottlenecks for the full stack and provide recommendations to overcome the bottlenecks as an interim work around, while a long-term solution is investigated
Identify all monitoring requirements are met and carry out periodic reviews of checks currently in place to ensure service meets or exceeds customer expectations
Proactively review and recommend changes to the live infrastructure after ensuring the right validation has been carried out
If needed prepare and deploy Support releases
Updating and maintaining Compatibility matrix, providing guidelines to Scrum Teams on Test Scenarios to be included in testing
Small feature implementation for new projects
Perform periodic on-call duty as part of a global team
Strong working knowledge of networking, packet tracing, understanding latency and throughput
Strong working knowledge of
> Java or C/C++ development experience including solid scripting skills in Ruby, Perl or Python

Site Reliability Engineer Resume Examples & Samples

Experience operating and supporting large-scale Internet hosted applications
Hands on system administration experience in Microsoft Windows (Server 2008/2012), Microsoft SQL Server(2008/2012)
Strong demonstrable scripting knowledge of PowerShell
Experience with Group Policies, Active Directory and DNS
Familiarity with storage systems (SAN and NAS), load balancers and virtualized environments (VMware, vSphere, Amazon AWS)
Experience with ESX and vSphere
Experience with Automation and Configuration management tools
Strong experience monitoring solutions such as Splunk, Nagios etc
Ability to prioritize tasks and work independently

Site Reliability Engineer Resume Examples & Samples

3+ years in various Operation roles, including experience administering Linux systems in a production environment
Solid understanding of system performance and monitoring
Experience with the VMware Iaas/CIM layer (vSphere, vCenter, vCloud Director, vROps and Hyperic)
Experience in ITIL Best Practices including problem, incident and change management
Experience in one or more of the following languages: Java, Shell, Python, or Ruby
BS or MS degree in Computer Science, or a related field
Good working knowledge of build automation and continuous integration/delivery processes and tools: Git, Gerrit, Maven/Gradle, Jenkins, Docker, Nexus, Artifactory. Selenium
Experience with at least one of the VMware vRealize Suite of products: vRealize Operations, vRealize Automation, vRealize Business
Experience with enterprise monitoring solutions: vRealize Operations, vRealize Hyperic, vRealize Log Insight, Nagios

Senior Site Reliability Engineer Resume Examples & Samples

Experience deploying large-scale Internet hosted applications including deployment automation concepts
Hands on system administration experience in Linux-based platforms, storage systems (SAN and NAS), load balancers and virtualized environments (VMware, Amazon AWS)
Demonstrable technology experience with administration of Mongo Database. Familiarity with basic Oracle concepts is a plus
Experience with networking concepts, protocols and technologies
Strong experience with designing, deploying and maintaining monitoring solutions such as Splunk, Nagios, Cacti, etc
Experience with one or more development or scripting languages suited for system administration and automation, such as Ruby, Python, Perl, PHP, Java/Javascript, Shell
Excellent written and verbal communication skills, demonstrating the ability to effectively convey technical information to both technical and non-technical audiences
BA/BS or higher in a technical field
Experience supporting large-scale SaaS based applications and databases
Experience with networking technologies such as TCP/IP, DHCP, TFTP, VLAN, QoS and VoIP
Experience creating and maintaining automated server deployment scripts using tools such as Chef or Puppet

HBO Site Reliability Engineer Lead Resume Examples & Samples

Troubleshoot issues across the entire stack: hardware, software, application and network. Physical hardware and cloud-based environments
Drive standardization efforts across multiple disciplines and services
Participate in a 24x7 on-call rotation
Ability to effectively communicate with all levels of management and all stakeholders

Senior Site Reliability Engineer Resume Examples & Samples

Leadership role in all Site Uptime projects from problem recognition to prioritization of work, design and implementation of solutions
Focus specifically on all externally visible issues
Focus on our Outage Response and Recovery processes and tooling
Focus on Application Error Monitoring and Reporting
Focus on Deployment and Rollback success and speed
Integrate with existing systems and tools and rip and replace where needed
Identify and address application and system performance bottlenecks in a high throughput production environment
A solid understanding of all server infrastructure technologies with production operations experience
Experience with scale issues and large infrastructures
6+ years of industry experience

Senior Site Reliability Engineer Resume Examples & Samples

Strong working knowledge of networking including packet tracing, packet captures and diagnosis, understanding latency and throughput
Strong working knowledge of Linux and Windows operating systems, their underlying components, system statistics, performance tuning, filesystems and io
Prior experience managing production systems and assets
Experience in handling traffic spikes, production outages, root cause analysis
Experience with production deployment, monitoring and operational support for Enterprise class hardware
Experience in performance diagnostics, capacity planning, performance architecture design, performance tuning, performance monitoring
6+ years of experience supporting production infrastructure
8+ years of experience supporting production infrastructure
2+ years hands on experience with VMware vSphere v5.x and above

Senior Site Reliability Engineer Resume Examples & Samples

Network and system infrastructures
Scripting languages (Python, Bash, Perl, etc.)
Automation of infrastructure and deployment (Ansible, Chef or Puppet)
Databases (Performance Analysis, High Availability, Scalability, SQL/NoSQL)

Site Reliability Engineer Resume Examples & Samples

Designing, operating and troubleshooting large-scale, highly available distributed systems
Building tools and automation to help with provisioning, monitoring, debugging systems
Programming in JVM languages like Java/Scala, C/C++, Go, scripting in Perl, Python or Ruby
Solid OS internals including concurrency, memory management, file systems, networking, system calls
Network technologies like TCP, UDP, HTTP, DNS, ICMP
Team player, take pride in your craft, get things done

Senior Site Reliability Engineer Resume Examples & Samples

Perform performance analysis, proactive troubleshooting, continual improvement and capacity planning of production and pre-production, virtualized environments
Review entire environment and execute initiatives to reduce failures, defects and improving overall performance
Leverage automation framework to improve processes, automate deployment, and improve manageability of environment
Design, develop and execute automated test to validate solutions and environments
Design and implement next generation internal Cloud platforms
Expand and evolve Production cloud Infrastructure in a scalable, highly available, and distributed fashion
Perform requirements gathering and analysis to determine appropriate solutions
Conceptualize, develop, and evaluate different architectures based on user and application requirements
Design, implementation and maintenance of automation and configuration management for all levels of Production cloud Infrastructure
Refine and enhance processes that allow for proper testing and validation of releases that go from development to staging to production environments
Participate in programs to deploy pre-GA VMware products in production and provide direct feedback for product improvement
Be able to keep a cool head under “fire” and take part in a shared weekly SME rotation
Effectively interact with other R&D & IT organizations in order to accomplish shared projects successfully
Work with other experienced Systems Administrators on maintenance of environment
Provide feedback and recommendations for design and implementation of solutions
Prepare detailed build / test plans to implement new technologies/configurations
Strictly follows the change management process and the incidents management process when working with the production environment
Perform tasks that are often unstructured and address issues that are less defined requiring new perspectives and creative approaches
Works on other technical projects as required

Site Reliability Engineer Resume Examples & Samples

At least 6 years of professional experience
2+ years supporting a highly scalable, highly available online service
Active Directory Experience
IIS Experience
Experience with at least one of the following languages: C/C++, C# or Java
PowerShell Experience
Experience as part of a 24x7 oncall escalation path
Live Site / Customer Support Experience
Experience managing certificates SSL / TLS
Virtualization Experience
Azure Experience
Excellent problem-solving and debugging skills with a solid understanding of testing practices
Excellent cross group collaboration skills
Familiarity and passion for agile/lean development and execution methods, including Scrum and Kanban
Experience developing services using the DevOps model
Can do / positive attitude

Senior Site Reliability Engineer Resume Examples & Samples

Experience managing Linux Systems and advanced troubleshooting techniques
Experience with VMware enterprise products such as ESXi, Virtual Center and vCloud Director
Experience operating networking layer 2 and 3 routers and switches
Strong working knowledge of Linux operating systems, their underlying components, system statistics, performance tuning, filesystems and io
Experience with production deployment, monitoring and operational support for Enterprise class applications
Experience with load balancers and firewalls
2+ years hands on experience with VMWare vSphere v4.x and above
Knowledge of data storage protocols including CIFS, FC, FCoE, iSCSI, and NFS
Experience with iptables, tcpdump/wireshark,etc
Experience with perl/python/powercli a plus
Knowledge of Windows operating systems a plus

Site Reliability Engineer Resume Examples & Samples

Automating application configurations with Chef/Scalr
Performance tuning (HP performance Center, Open Source)
On-boarding new technologies / Integrating into automation
Platform SLA monitoring and enforcement
Cost Analysis across multiple clouds (Amazon Web Services, Google Compute Engine, Microsoft Azure, OpenStack)
4+ years experience in Linux systems administration
Experience deploying to AWS or other clouds
Experience with auto-scaling and the architecture of stateless applications
Experience with Chef or other configuration management tools

Site Reliability Engineer Resume Examples & Samples

3+ years production environment experience
Demonstrated tool building capability
Grace under fire and willingness to help troubleshoot to keep our services up and running, in a 24x7 oncall rotation
Positive attitude, and a self-directed work ethic

Site Reliability Engineer Resume Examples & Samples

Work closely with developers in supporting new features and services
Troubleshoot site issues
Develop custom tools as necessary
Document system design and procedures
Participate in light on-call rotation

Site Reliability Engineer Resume Examples & Samples

Work in the Operations team to design and maintain bulletproof systems, with security baked in from the start
Model threats, uncover and fix vulnerabilities
Automate security testing of production systems
Integrate security information into our monitoring and metrics systems
Respond to security incidents
Advise other teams on secure design of their systems

New Grad-site Reliability Engineer Resume Examples & Samples

Perform performance analysis, proactive troubleshooting, continual improvement and capacity planning for production, virtualized environment
Develop tools that support the daily operation, this includes Monitoring tools, CMDB
Perform deployment functions to ensure releases follow proper implementation lifecycle
Refine and enhance processes that allow for proper testing and validation of release that go from development to staging to production environments
Work with experienced System Administrators on maintenance of environment
Perform troubleshooting analysis and implement fixes to ensure availability SLAs are met
Programming experience using one or more structured languages (python/Java)
Ability to use scripting languages to automate tasks and gather data
Understanding of development process and intermediate knowledge of product development
Familiar with CMDB discovery methods
Excellent oral and written communication skills; including documentation
Ability to perform well in a dynamic environment, with on-time delivery
Ability to follow and adhere to policies, procedures and standards relating to Systems management. May recommend process improvements
Require limited supervision and direction; drive results and set priorities independently
Able to work a 24x7 on-call rotation schedule
Hands on experience with VMware’s vCloud Suite or products
VMware Certification, VCP, VCAP
Knowledge of database technologies, MSSQL, PSQL
Experience coding to product APIs
Knowledge of data storage protocols including CIFS, FC and NFS
Knowledge of TCP/IP networking, DNS, LDAP, SMTP, Linux Account Management
Knowledge of Red Hat, Centos, Ubuntu, and Windows operating systems
Hands on operational experience in a high-volume or critical production service environment
IP networking, including familiarity with the functionality, operating, and failure modes of networks
Proven technical troubleshooting and performance tuning experience

Site Reliability Engineer Resume Examples & Samples

Create and improve tools to aid in monitoring and control of our systems
Enhance our infrastructure to support a variety of different disaster recovery options
Build out a new data center in Dublin
Providing general Unix server administration and troubleshooting for an environment covering over 2000 hosts, both physical and Xen/VMWare virtualized
Building new server images, deploying and migrating production systems, and tuning configurations to improve application performance
Converting one-off and stand-alone systems to use our standard templates and management framework
Troubleshoot and Resolve technical issues that affect the production environment
Improve security and performance of infrastructure by working with other teams
Performing all other related duties as assigned by manager
Experience managing an internal Linux distribution in a large scale environment
Experience working with Unix automation tools, such as Puppet, Chef, and/or Capistrano
Knowledge of Web Servers (Apache/Ningx)
Familiarity with load balancing tools and techniques
Experience working in virtualized environments, such as Xen, VMWare, VirtualBox
Familiarity with IP and Ethernet networks as well as transport protocols such as Email, FTP, and HTTP
Scripting ability a plus, particularly Bash, PHP, Python or Ruby

Site Reliability Engineer Resume Examples & Samples

Engineering Rigor
Quality and standards
EDUCATION and/or EXPERIENCE
Desirable Experience on the following areas

HBO Site Reliability Engineer Resume Examples & Samples

Identify and drive opportunities to improve automation for the company
Manage timely resolution of all critical and/or complex problems meeting SLA requirements
Develop, configure and optimize service and application monitoring and telemetry

Senior Site Reliability Engineer Resume Examples & Samples

Hands on system administration experience in Microsoft Windows (Server 2003/2008, IIS Manager), Microsoft SQL Server(2005/2008), storage systems (SAN and NAS), load balancers and virtualized environments (JVM, VMware, vSphere, Amazon AWS)
Demonstrable technology experience with administration of Microsoft SQL Server 2008 databases
Strong experience with designing, deploying and maintaining monitoring solutions such as Splunk, Nagios, Zabbix, Cacti, SCOM, etc
Familiarity with Linux based operating systems
Experience creating and maintaining automated server deployment scripts using tools such as SALT, Chef or Puppet
Base knowledge of the W3C’s Web Content Accessibility Guidelines v2.0

Site Reliability Engineer Resume Examples & Samples

Perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services
Participate in code reviews for projects primarily written in Java and Scala, built on open source libraries such as Finagle, and running on both physical and virtualized platforms
Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
Practical experience in Java or Scala

Site Reliability Engineer Resume Examples & Samples

Work in engineering team to design, build, and maintain systems
Write scripts to monitor and automate processes
Take part in a 24x7 on-call rotation
Participate in code reviews for projects written in Scala built on open source libraries such as Finagle
2+ years industry experience as Software engineer
3+ years of experience in Internet scale Unix environments
Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience in multi-tier web application architectures
Hands-on experience in building event driven backend systems on JVM with Java or Scala
Ability to lead technical teams through designs and implementations across an organization
Practical knowledge of shell scripting and at least one scripting language (Python, Ruby, Perl)
Experience with existing open source projects such as Mesos, Hadoop, Scribe, Zookeeper, etc

Senior Site Reliability Engineer Resume Examples & Samples

Understand how the different core infrastructure systems come together to enable provisioning engineering at Twitter, and help keep all other infrastructure running
Interface with customers and partners in engineering to gather feedback, requirements on the overall objectives of CISS, aligned with the mission
Ensure reliability of the existing core infrastructure systems, to guarantee 99.99% uptime while maintaining SLAs to guarantee low latencies across the systems
Provide project and technical leadership to the team, keeping in mind the above responsibilities
Practical experience in C / C++ / Python / Ruby / Java / Scala
Ability to lead technical teams through design and implementation across an organization
Experience with existing open source projects such as Scribe, ZooKeeper, and Apache Mesos

Site Reliability Engineer Resume Examples & Samples

Build tools to monitor and automate operational processes
Understand the application stack and perform troubleshooting across the whole stack
Work closely with engineering and operation groups to design, build and maintain systems, application and infrastructure
Work in a team-oriented environment
Take part in a 24x7 oncall rotation
Currently pursuing a Bachelor’s, Master’s, or PhD in Computer Science or equivalent field
Extensive system administration coursework
Knowledge of Internet fundamentals (e.g. RFCs, DNS architecture, best practices, security, etc.)
Knowledge of TCP/IP, HTTP, security, storage and databases
Interest in distributed systems and high availability architecture
Programming skills in one or more of: C, Java, Python, Ruby
Internship or other experience administering large-scale UNIX installations
Practical knowledge and experience with Linux/Unix/BSD; demonstrated involvement in systems operations
Previous success in a performance-critical environment is a plus

Site Reliability Engineer Resume Examples & Samples

Work in engineering team to design, build, and maintain web and RPC services
Develop automation to improve reliability and ease of deployment
Hands-on experience with event driven backend systems on JVM with Java or Scala
B.S. in computer science or equivalent

Site Reliability Engineer Resume Examples & Samples

Build automation and tooling in Python
Perform deep dives into partner teams infrastructure
Advanced practical knowledge of Python or Ruby
Experience with configuration management tools such as Puppet, Chef, or Ansible

Site Reliability Engineer Resume Examples & Samples

Infrastructure Automation using Chef
Multi Cloud API integration with Scalr
Work in a world where code is configuration
Analyze Cost of applications across multiple clouds / Work with Dev teams to right size applications based on those analyses
Compare costs across all Clouds, this includes both public and private clouds
Manage AWS Reserve Instance program / Work with finance team for best optimization
On-boarding new technologies
20% - Chef Development
20% - Scalr integration
10% - New Technology Evaluation
40% - Cost Optimization
10% - Documentation

Site Reliability Engineer Resume Examples & Samples

Champion site security, reliability and robustness while balancing developer requirements
Be effective in a distributed team through strong communication
Improve and maintain processes for systems infrastructure and app deployments
Be responsible for uptime, performance, and quality of service and participate in on-call schedule as first line of defence
2+ year experience as a Site Reliability Engineer required
Passion for running great operations

Site Reliability Engineer Resume Examples & Samples

Manage site security, reliability & robustness while balancing developer requirements
Improve and maintain processes for their systems infrastructure & app deployments
Responsibility for the uptime, performance and quality of service
2+ years experience specialising in site reliability
Linux OS experience
Python, Django, PHP
Docker, lxc
Experience with monitoring & alerting systems
Previous experience in a start-up environment would be an advantage

Site Reliability Engineer Resume Examples & Samples

Combine software development, networking and systems engineering expertise to help engineering teams build and run Nordstrom.com
Help us define, transition to, and grow a Site Reliability Engineering model
Work closely with engineering teams with a demonstrated commitment to their success
Represent Nordstrom in the technical community, which can include contributing to open source
Demonstrate a passion and commitment towards advancing the use of public and hybrid cloud technologies
Participate in 24x7 on-call rotation
Investigates and recommends approaches & systems that meet quality, performance and sustainability criteria
Experience with Amazon Web Services (AWS) and APIs
Experience with Git or other source control
Experience with automation tooling such as Chef, Docker, AWS, etc
Some engineering development experience, preferably in Java, Node.js, or .NET
Understanding of software development methodologies and practices, including agile or lean development, continuous integration and continuous delivery
General awesomeness, positive attitude and passion trump all other requirements

Site Reliability Engineer Resume Examples & Samples

Own, as part of a small cross functional squad, a particular infrastructure problem space at Spotify. For example bare metal provisioning, cloud provisioning, monitoring, networking, storage, or service containerization
Design and document systems, including writing and reviewing code, to automate away problems within your squad’s domain
Undertake measured, methodical, troubleshooting of complicated systems under pressure
Partake in an on-call rotation alongside the engineers who build our production backends

Site Reliability Engineer Resume Examples & Samples

Own, as part of a small cross functional squad, a particular infrastructure problem space at Spotify: ensuring Spotify has a secure, reliable, highly available perimeter and service discovery infrastructure
Design, implement and drive internal adoption of infrastructure perimeter and service discovery products
Architect, design and document systems, including writing and reviewing code
Meet availability SLAs for the services your squad owns, contributing directly to Spotify global availability

Site Reliability Engineer NYT Beta Resume Examples & Samples

Proficiency in at least one programming language, and willing to learn Go (our primary language), experience with Ruby is a plus
Designing pragmatic systems with an eye for performance, reliability and security
Linux environment and fundamentals
Strong understanding of AWS products and services
Designing and/or operating web/mobile stacks at scale

Site Reliability Engineer Resume Examples & Samples

Management and support of consumer websites including infrastructure design and deployment on platforms such as Linux and Windows using Apache, IIS, PHP, .Net, MySQL, and SQL Server
Monitor and ensure regular backups for the consumer websites
Provide guidance and assistance to the third party development teams
Develop and maintain effective working relationships with peers, teams, and other internal or external people critical to business functionality
Management and troubleshooting of video encoding and streaming platforms including Roku
Develop scripts, utilities, and web applications to support daily business unit processes
Responsible for the resolution of incident services calls and requests providing 24/7 support
Assist with level 2 and 3 support coordinating and assisting other IT departments to resolve issues
Create process or troubleshooting documentation in the support knowledge base
Minimum 5 years technical experience
Undergraduate degree in computer science or related field or equivalent work experience
Experienced in translating business requirements into technical specifications and in developing prototypes and in launching pilot tests
Must have strong interpersonal and organization skills and be able to communicate clearly with all levels of the business including personnel, technical management, customers and/or external vendors
Strong work ethic and be able to work in a team environment and work independently
Excellent problem-solving and analytical skills including the ability to quickly identify, analyze, and resolve issues and system failures
Ability to analyze project requirements, draft project plans, milestones and delivery schedules
In depth knowledge and experience with video encoding and streaming of live and on demand content (Expressions, Silverlight, Smooth Sreaming, HLS)
Experience in Perl, PowerShell, Linux Bash, Visual Basic, and batch file scripting
Experience with .Net Framework, VB.NET, C#, XML, SOAP, SQL
Experience with HTML, ASP.NET, PHP, JavaScript, Apache and IIS Web Server
In depth knowledge of current technology
Linux and Windows server platorms
Database Fundamentals (SQL Server, MySQL)
Network Fundamentals (Network Topology, LAN/WAN, TCPIP)
Virtualized Environments (VMWare)
Windows and Apple desktop platforms

Senior Site Reliability Engineer Resume Examples & Samples

Contribute to the architecting and implementation of scalable architecture used to serve billions of video vies per month
24x7 Second Tier On-Call Rotation
Mastery in debugging Linux/Unix applications (4+ yrs), networking, and relevant systems
Proficiency in two or more development languages (Bash, Clojure, Go, Java, Javascript, Python, Ruby, etc)
Experience implementing large-scale monitoring and alerting infrastructure
2+ Years Troubleshooting and automation of AWS or comparable systems
Experience with modern infrastructure concerns such as: CDNs, Containers, GeoDNS, Service Discovery, etc
Keen understanding and appreciation for security concerns
Global-scale performance monitoring and improvement
Continuous delivery for large engineering efforts
Any distributed database or systems experience
Large-scale container deployments

Site Reliability Engineer Resume Examples & Samples

Engender reliability and availability starting with metrics and measurements
Enable scaling by providing tools, developing training or augmenting processes
2+ years in an operations role, or closely related position
Relocation assistance available for qualified applicants
Displays knowledge of, and ability to apply, process design and

Principal Site Reliability Engineer Resume Examples & Samples

Secure the system from issues, be they real, perceived or notional
Experience with configuration management tools such as Ansible, CFEngine, Chef and Puppet
Utilizes time management and project management skills to lead the
10+ years in an software development role, operations role, or closely related position
Bachelor's Degree in Computer Science or a related field, or relevant work experience
Experience with distributed version control like Git or Mercurial
Experience with enterprise monitoring solutions like AppDynamics, Graphite, Nagios, and Splunk
Familiarity with continuous integration/deployment processes and tools such as Artifactory, Gerrit, Git, Jenkins, Maven and Nexus

Senior Site Reliability Engineer Resume Examples & Samples

Lead technical activities running proof of concepts and user stories from conception to delivery
Proven design skills across database/storage or networks
Hunger to get involved in every part of our system; from the earliest stage of product architecture, design and development to deployment, troubleshooting, and performance analysis – to ensure a reliable quality product
Strong collaboration and communication skills with ability to influence across an organization from Architects, Developers to Managers
Perform RedHat Linux, WebSphere, DB2: configuration, installs, automation, recovery, backup, monitoring
Participate in periodic on-call rotation

Site Reliability Engineer Resume Examples & Samples

Manage and maintain all production systems at Vevo
Contribute to the improvement of Vevo's infrastructure tooling and monitoring
Work closely with all engineering teams to maintain and deliver large-scale applications
24x7 First Tier On-Call Rotation
Proficiency in debugging Linux/Unix applications (2+ yrs), networking, and relevant systems
Proficiency in one or more development languages (Bash, Clojure, Go, Java, Javascript, Python, Ruby, etc)
1+ Years Troubleshooting and automation of AWS or comparable systems
Good understanding of IP Networking
Experience with key technologies used as building blocks for modern applications: DNS, HTTP, Load Balancing, etc
Understanding and excitement for modern security concerns
Experience working with continuous delivery pipelines
Docker or other containerized deployment methodologies

Site Reliability Engineer Resume Examples & Samples

Maintaining and supporting all areas of our platforms architecture, ensuring that everything we do is super reliable and super fast, all the time
Enabling the Lyst’s engineering team to be agile and to make feature changes every day
Supporting a continuous integration environment and production deployment run entirely in Docker
An internal PaaS implementation
Custom Chef configuration
An entirely AWS-based environment, using multiple AWS services alongside EC2
Creating a development platform, tools and pipeline that is effective and easy to iterate with
The huge challenges that come from working with a website with over 2 Million monthly users, 9 million products and a spider architecture performing 3.8 million updates a day
Fixing the interesting problems we face in the best way possible; we are not constricted to tool sets and languages. If you find a solution to a problem that will work better, we use your idea. Best idea wins
Building your own brand and skills. Lyst is a company that will encourage and support you to get involved in the wider community. Events like FOSDEM, PyCon, JS London, AWS re:Invent are regular occurrences on our calendar
Creating better solutions for everything we do, from our culture to the code. This is your company, own it
The monitoring of performance metrics such as response and error rates on the Lyst platform
Proposing and implementing performance improvements, anything from one-line fixes to rearchitecting a component into a microservice
Capacity planning for new features, and ensuring they’re written with scalability in mind
Evaluating supporting services for new features, would PostgreSQL perform better than Cassandra? Is a custom solution a better fit than using ElasticSearch?

Site Reliability Engineer Resume Examples & Samples

Vulnerability Management and Threat Intelligence
Influence feature design, architecture, standards & processes to ensure Security
Conduct advanced network security forensics
Assessment and recommendation of Web Application Security
Proven experience as a team player working with devops groups to continuously improve security posture
Working knowledge of industry standard tools and systems related to penetration testing and forensics
Able to articulate and visually present attack and mitigation strategies and concepts
7+ years’ experience providing security insight and solutions in large scale environments
Strong analytical and troubleshooting capabilities

Site Reliability Engineer Resume Examples & Samples

Works with other members of cross-functional teams, joint ventures,third party vendors and Company's Product Managers and Marketing teams to deliver quality products, in a timely fashion, that meet defined requirements. Establishes and maintains working relationships within NE&TO, Product Development teams, joint ventures, vendors and contractors
Ensures that projects are properly accepted into the engineering team, worked on in a timely and efficient manner and smoothly transitioned into Quality Assurance and Operations teams
Participates in the review of failures and provides feedback to
Strong understanding of REST web services
Experience with the twelve-factor web app methodology
Experience with Cloud based systems, deployment methods and technologies
Experience with scripting language(s) (Go, Python, Perl or others)

Site Reliability Engineer Resume Examples & Samples

Participate in on-call rotation duties
Experience managing complex projects, with significant bottom‐line impact
7+ years’ experience in large scale internet service design & implementation

Site Reliability Engineer / E-commerce Resume Examples & Samples

Build tooling to allow the self-service of AWS offerings by engineering teams in a secure, reliable, friction free manner
Develop foundational capabilities to feed near real-time business and technical metrics into a modernized Live-Site Operations Center
Future participation in 24x7 on-call rotation
Designs systems, services and components that meet required levels of quality and performance sustainability
Experience with automation languages like Ruby, Powershell or Unix, etc

Senior Site Reliability Engineer Resume Examples & Samples

Influence feature design, architecture, standards & processes to ensure Security, Performance, Operability & Scale
Experience with Anycast, Global Load Balancing, and CDNs
Ability to manage multiple priorities, commitments & projects
Demonstrated passion for customer experience & usability, including successful delivery of customer self‐service tools & automated management/optimization of services

Site Reliability Engineer Resume Examples & Samples

Create solutions to improve performance, scalability, and reliability
3+ years of hands on systems administration experience on Linux or Windows platforms
Scripting skills in at least two languages (bash, python, ruby, perl, powershell, etc.)
Experience with configuration management tools, such as Puppet
Working knowledge of NoSQL data stores (MongoDB, Cassandra, Couchbase, etc.)

Senior Site Reliability Engineer Resume Examples & Samples

Maintain, build, implement and design telemetry systems and dashboards
Troubleshoot complex issues and teach others how to use toolsets
Ability to automate tasks using scripting or other programming language
Demonstrated expertise in web services, virtualization, cloud concepts, REST, JSON, YAML, XML, SQL, PHP, LDAP, & object oriented methodologies

Site Reliability Engineer Resume Examples & Samples

Participate in technical proof of concepts from conception to delivery
Be hungry to get involved in every part of our system — from the earliest stage of product architecture, design and development to deployment, troubleshooting, and performance analysis – to ensure a reliable quality product in production
Be able to collaborate and communicate clearly on status and progress
Design and build tools to manage a rapidly growing number of servers and services
Perform general OS, Web/Application server, database configuration, installs, automation,

Site Reliability Engineer Resume Examples & Samples

Work closely with product engineering teams to help design systems for performance, fault tolerance, and scalability
Develop the tools and training needed for product engineers to assume operational responsibility for their own software
Monitor and audit production application stacks for opportunities to improve performance and capacity utilization
Troubleshoot, isolate and fix production issues along with product engineers and help prevent them from happening again
Operating and debugging production systems
Designing and implementing infrastructure, deployment, monitoring, and logging tools
Linux environment and fundamentalsDesigning web and mobile stacks at scale

Senior Site Reliability Engineer Resume Examples & Samples

Root-cause complex problems involving multiple parties, networks, hardware and software that relate to scaling and performance
Bachelors Degree or Equivalent
Engineering, Computer Science
Generally requires 7-11 years related experience
Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell
Experience with distributed version control systems like Git or Mercurial
Experience with IaaS and PaaS providers such as AWS, OpenStack, Heroku, and CloudFoundry
Experience with enterprise monitoring solutions like InfluxDB, AppDynamics, Graphite, Grafana, Nagios, and Splunk

Senior Site Reliability Engineer Resume Examples & Samples

Hands on system administration experience in Linux-based platforms, storage systems, load balancers and virtualized environments (Amazon AWS)
Demonstrable technology experience with administration of relational (mySQL) and noSQL databases
Strong experience with designing, deploying and maintaining monitoring solutions such as Splunk, New Relic, Data Dog etc
The following requirements are not strictly required, but are preferred
Experience with TCP, CIDR, RFC 1918, sub-netting, DNS, BGP
Experience creating and maintaining automated server deployment scripts using tools such as Chef, Salt, or Puppet

Site Reliability Engineer Resume Examples & Samples

Coordinate ongoing projects between Development & Service Engineering
Drive release readiness from Operational and Infrastructure perspective
Participate in Design reviews and clearly document operational requirements
Eliminate duplicate efforts by bringing teams together and ensuring recycling of good process, technology and tooling
Build requirements for tools, develop efficient DevOps processes, & contribute to our knowledge base
Document requirements to optimize monitoring & self-healing capabilities
Experience managing complex projects with significant bottom‐line impact
Proven experience with bug tracking, reporting & project planning tools
3+ years’ experience in large scale internet service operations and service delivery

Principal Site Reliability Engineer Resume Examples & Samples

Familiarity with container technologies such as Docker, LXC, Mesos and Kubernetes
10+ years in a software development role, operations role, or closely related position
Programming experience in two or more of the following languages: Go, Java, Python, Ruby, Shell
Experience with distributed version control such as Git or Mercurial
Experience with enterprise monitoring solutions such as InfluxDB, AppDynamics, Graphite, Nagios, New Relic and Splunk

Principle Site Reliability Engineer Resume Examples & Samples

Actively engage early in the service lifecycle interfacing with software delivery teams to influence service readiness
Own serviceability, reliability and quality, measure service health and influence service KPIs
Work across disciplines to design secure and resilient service architectures across platform services and standardize topologies
Deliver and contribute to site availability, reliability and sustainability
This job includes an expectation of on call availability split between the US and EMEA teams
Partner with other engineering teams to resolve issues and defects
Work with a global foot print of services, teams, and engineers
Be willing to travel internationally a couple of times per year
A minimum of 10 years’ experience in supporting the service design, development, or managing 24 x 7 enterprise systems
BS/BA in Computer Science or related field, or equivalent work experience
Solid working knowledge on cloud computing (Azure and/or AWS)
A solid understand of telemetry and service monitoring tools and techniques
Preferred 5+ years of coding/testing experience in a high level language
Understanding of Microsoft products, services, and platforms (i.e. SQL Server, ASP.Net, SharePoint, Azure, System Center)
Experience with OSS and 3rd Party development and platforms e.g. Java, Ruby, Python, Linux, iOS
Deep passion for shipping a high quality product that customers love

Site Reliability Engineer Resume Examples & Samples

Obsess over collecting and digesting metrics
Collaborates with project stakeholders to identify product and
Researches, writes and edits documentation and technical requirements,
6+ years in a software development role, operations role, or closely related position
Experience with enterprise monitoring solutions like InfluxDB, AppDynamics, Graphite, Racon, Grafana, Nagios, and Splunk

CIB Investor Services Site Reliability Engineer Resume Examples & Samples

Design software to improve availability, scalability and efficiency of the financing platform
Develop tools and applications to automate and support applications
Troubleshoot production issues
Participate in on-call rotation

Site Reliability Engineer Resume Examples & Samples

Solve problems and automate responses for recurrent issues
Gathering and refining requirements from stakeholders
Working with developers in other component teams to ensure consistent integration of services across teams

Site Reliability Engineer Resume Examples & Samples

Serve as primary point responsible for operations, maintenance and performance of network applications in high-volume production environment
Assist in the roll-out and deployment of new features to facilitate fast growth
Automation of build, release and operational activities
Develop and or customize tools to rapidly deploy applications in production environment
Work closely with development teams to ensure that applications are designed with “operability” in mind
Specifying and designing non-functional components of software
5-12 years’ experience in managing Network Management applications and tools
Experience of networking and infrastructure at scale
Use and support of Infrastructure-as-a-Service
Expert troubleshooting and debugging skills
Strong Scripting and Automation Skills using Python, Perl, Javascript
SevOne
HP Network Automation
SPLUNK
ThousandEyes

Site Reliability Engineer Resume Examples & Samples

3+ years in various DevOps/SRE roles
Understanding of containers and container orchestration
Demonstrate skills in priority setting, analysis, communication, time management, scheduling, and multitasking
Experience with infrastructure configuration and automations processes and tools: Puppet, Ansible, Chef, Fabric, Terraform
76002BR

Site Reliability Engineer Resume Examples & Samples

Leadership role in all projects from problem recognition to prioritization of work, design and implementation of solutions
Focus on application networking processes and automation
Work in and contribute to a shared codebase
Rapidly debug, fix and solve problems
A solid understanding of infrastructure service technologies with production operations experience
Experience with performance analysis, scale issues and large infrastructures
Experience with workflow automation

Senior Site Reliability Engineer Resume Examples & Samples

In Depth Experience on Azure Services like SQL Azure, Compute, IaaS, PaaS, WAAD, VNET, & Express Route
Expert in troubleshooting Live Site issues across the applications
Expertise in web services, virtualization, cloud concepts, REST, JSON, XML, SQL, LDAP, & object oriented methodologies is desired
Engage with Partners and customers effectively to mitigate/resolve/preempt issues/outages

Senior Site Reliability Engineer Resume Examples & Samples

Drive efficiencies in process: implement and enforce process for change management, emergency response and capacity planning
Solve problems relating to mission-critical services and build automation to prevent problem recurrence, with the goal of automating response to all non-exceptional service conditions
Participate in an on-call rotation and be available for escalations
A strong system and software engineering background
A solid understanding of system availability, latency and performance
Experience with enterprise messaging systems like MQ, Kafta and RabbitMQ running on Linux and Solaris
Knowledge of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems
Strong programming skills in Java, Python or C++ and the ability to learn new languages as needed

Site Reliability Engineer Resume Examples & Samples

Hands on system administration experience in Linux-based platforms, storage systems, load balancers and virtualized environments (VMware, Amazon AWS)
Demonstrable technology experience with administration of Mongo Database. Familiarity with basic dbase concepts is a plus
Knowledge of ITIL process framework

Courts Site Reliability Engineer Resume Examples & Samples

Work to improve the reliability and performance of front-end and back-end IT services in close collaboration with the architecture team and developers with a focus on automation, availability, performance, and efficiency
The successful candidate will be a key contributor, working closely with members of our newly formed agile teams in a fast-paced environment
The ideal candidate is energetic, innovative, and has superior technical, analytical, and problem solving skills
Writing scripts to monitor and automate processes
Troubleshooting issues across the entire stack - hardware, software, application and network
Taking part in 24x7 on-call rotation as needed
Participating in code reviews
Bachelor’s (or equivalent combination of education and experience)
Master’s preferred
Experience working with Agile Teams is preferred
Software engineering experience in Internet scale environments
Experience troubleshooting performance, availability, scalability and reliability issues in production with enterprise- grade applications and driving architectural improvements to address these issues
Knowledge of TCP/IP, HTTP, web application security, and multi-tier web application architectures
Hands-on experience developing event-driven back-end systems
Knowledge of shell scripting and at least one scripting language (Python, Ruby, Perl)
Experience with open source projects (Mesos, Hadoop, Scribe, Zookeeper, etc.)
Previous experience as a primary contributor to an architecture team in systems design, building, testing, and implementation
Assesses current or future customer needs and priorities through communicating directly with customers, conducting surveys, or other methods
Excellent written and verbal communication, interpersonal skills, and team skills
Ability to listen to the needs of our internal clients, communicate clearly, and determine ideal configurations for a new development teams and projects

Site Reliability Engineer Resume Examples & Samples

5+ years of experience in Service Engineering/Engineering roles; 5+ years of coding/testing experience
Leadership skills: Sound problem resolution, judgment, negotiating and decision making skills
Develop test plans/cases, conditions and scenarios in support of ongoing applications and infrastructure
Familiar with SCVMM and operational knowledge of deployment and challenges
Demonstrated solid working knowledge on cloud computing / Azure/AAD
Evaluate current services and drive performance, availability and supportability improvements
Constantly focus on how to enable the business and increase agility
Support a 24x7 live site support model for the services team owns

Site Reliability Engineer Resume Examples & Samples

Process customer service requests and meet established SLAs
Identify and automate manual workarounds and process improvements
Utilize Rails console to resolve customer account issues
Monitor the availability, latency, scalability and efficiency of all services
Understand the loan life cycle, transactional logic, and framework of our finance models
Implement reports and automations for business stakeholders
Assist with major incidents from identification and troubleshooting through to service restoration
Reconcile daily bank files and monitor critical application processes
Troubleshoot issues and discrepancies with the appropriate individuals, teams, or vendors
Identify, document, and recommend solutions for bugs, processes, and feature enhancements
Update knowledge base articles with current information and communicate to your teammates
Perform periodic on-call duty as part of the SRE team
Experience with Ruby, Perl, Python, Java, or PHP
Capable of learning new technologies and concepts quickly, and continuously building upon that knowledge
Can balance and prioritize multiple projects/tasks and remain calm under pressure
Problem solving skills- if you don’t know how to do something, you consider it a challenge to try to figure it out on your own
Ability to write accurate and efficient SQL queries
Strong understanding of IT infrastructure (Linux or systems administration, network technologies, relational databases, web technologies, etc.)

Site Reliability Engineer Resume Examples & Samples

Maintain and support the product and data systems: proactively monitor events, investigate issues, analyze solutions, and drive problems through to resolution
Use a wide variety of operational tools and monitoring platforms to gain in-depth knowledge, understanding, and ongoing monitoring of system availability, performance, and capacity
Define requirements and assist with development of customized tools and reporting as needed by projects and operations
Work with business partners to establish Service Level Indicators and Objectives (SLIs and SLOs)
Rationalize deployment strategies to facilitate continuous delivery
Implement alerting strategy that makes alerts actionable and unique
Operate within ITIL Problem, Incident and Change Management practices
Provide follow-through to ensure issues are resolved to satisfaction
Create and improve standard operating procedures and documentation
Drive continuous improvement and innovation within the team

Site Reliability Engineer Resume Examples & Samples

Dedicated member of an agile software/devops team
Deployment, support and maintenance of development software stacks, overseeing build frameworks
Manage and maintain enterprise infrastructure tools as the primary subject matter expert
Respond to system issues related to the infrastructure and fulfill service requests
Lead infrastructure deployments in the scrum
On-Call support for Pre-prod and Production environments

Site Reliability Engineer Resume Examples & Samples

Improve the reliability and efficiency of Twitter’s traffic management systems
Build and maintain high-performance, fault-tolerant, and scalable distributed systems in the context of Twitter's service-oriented architecture
Troubleshoot complex distributed systems problems and develop solutions that have a significant impact at Twitter’s scale
Write performant, maintainable, clear, and concise code and accompanying documentation

Site Reliability Engineer Resume Examples & Samples

App Services - our core services handling users, tweets and more
Storage infrastructure - our next-generation distributed cache and storage systems
Core Infrastructure System - our internal core infrastructure services (provision engineering stack, DNS, Puppet, LDAP, Subversion, Kerberos etc.),
Database Engineering - our relational stores like MySQL, PostgreSQL and Vertica
Engineering Effectiveness - our tools and services related to build, test and deployment systems
Hadoop/Data Platform - our Hadoop clusters, data management services and all the ecosystems YARN, Scalding, Parquet, Hbase,.
M&A - help our acquired companies manage their infrastructure
Mesos/Aurora - our compute platforms that all other Twitter runs on top of
Platform - our API/frontend services
Traffic Engineering - our traffic management systems
You will perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
You will troubleshoot issues across the entire stack: hardware, software, application and network,
You will drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization
You will mentor SREs on standard methodology for everything from monitoring to troubleshooting complex code issues
You will participate in code reviews for projects primarily written in Java and Scala, built on open source libraries such as Finagle, and running on both physical and virtualized platforms
You will represent the SRE organization in design reviews and operational readiness exercises for new and existing services
Experience with existing open source projects such as Scribe, ZooKeeper and Apache Mesos

Site Reliability Engineer Resume Examples & Samples

Work in engineering team to design, build, and maintain Hadoop clusters and data services
Participate in and build tools to
2+ years of managing services in a distributed, internet-scale *nix environment
Familiarity with systems management tools (Puppet, Chef, Capistrano, etc)
Demonstrable knowledge of Linux operating system internals, filesystems, disk/storage technologies and storage protocols and networking stack
Hands-on operational experience on managing JVM services
BS or MS degree in Computer Science or Engineering, or equivalent experience
Understanding of Hadoop, YARN,
Understanding of Scalding, Parquet

Site Reliability Engineer Resume Examples & Samples

You would perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
You will build relationships with ISPs and other industry partners
Troubleshooting tools (i.e tcpdump, netstat, iostat, traceroute)
Experience with iptables or other firewall solutions
Ability to work with engineering teams and minimal hand-holding
You will Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services
You will participate in an oncall rotation with other engineers to support your services
Practical experience in Postfix, Exim or Msys
Strong contacts at ISPs and track record of excellent deliverability

Site Reliability Engineer Resume Examples & Samples

Build automation and tooling in Python and assist teams in their software (e.g. Ruby, Python, Perl)
Perform deep dives into partner teams infrastructure of varying types, sizes, and quality
Drive standardization efforts across multiple disciplines, systems, software, and practices
Functional knowledge of bootstrapping technologies: PXE or cloud-init
Experience with configuration management tools: Puppet, Chef, or Ansible

Site Reliability Engineer Resume Examples & Samples

Understand how the different core infrastructure systems come together to enable provisioning engineering at Twitter, and help keep all of the infrastructure running
Meet with customers and partners in engineering to gather feedback, iterate on requirements, and align with the team and company mission and objectives
Conceptualize, architect and develop systems and features for enabling core infrastructure engineering effectiveness: ease of use and ease of maintenance of these systems
Development of solutions to enable automated workflows that bind together with Twitter's core engineering principles, and enable all engineers to make use of and achieve their objectives as required of the core infrastructure systems
Perform deep dives into both systemic and latent reliability issues; partner with software, systems and security engineers across the organization to produce and roll out fixes
Practical experience in Python and Ruby
Experience with existing open source projects such as Scribe and Apache Mesos
B.S. in Computer Science or related field

Senior Site Reliability Engineer Resume Examples & Samples

5+ years of experience as a SRE or Operations or administration of customer-facing, high-availability, large scale web-based applications
5+ years of PHP, Perl, Python or other scripting language
5 years of experience in Java-based technologies
Mastery in PHP, Perl or Python Programming
Administrative Experience with installs, configures, troubleshoots, monitors, maintains of Linux infrastructure
Experience in writing SQL and PL/SQL procedures
Experience with one of the log analysis tools like Splunk or ElK Products (ElasticSearch, Logstash, Kibana)
Experience with Orchestration Tools like Ansible etc
Experience with monitoring tools like Sensu, Collectd, Grafana etc

Site Reliability Engineer Resume Examples & Samples

You will automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement
You will build scalable infrastructure to manage metadata for hundreds of billions of files, hundreds of petabytes of user data, and millions of concurrent connections
You will drive the company through “Disaster Recovery Tests”, where we manually turn down pieces of infrastructure to test Dropbox’s overall resiliency to failures
You will design the system and processes that Dropbox engineers use to deploy their software into production
You will build an auto-remediation system to automatically resolve production incidents before passing them to on-call engineers

Site Reliability Engineer Resume Examples & Samples

Conduct performance analysis and monitoring of multiple products/platforms
Identify, plan and implement solutions that continually improves performance
Automation development for infrastructure efficiencies across multiple products/platforms
Automation towards un-attended fix actions
Capacity management for multiple products/platforms
Code contributions in languages such as Python, C/C++, Bash, Ruby
Performance tuning for Apache web server, tcp/ip & network stack, system/kernel level tuning, NFS
Alert response for performance and capacity related thresholds
On-call rotations and Incident call handling

Site Reliability Engineer Resume Examples & Samples

Work with engineering teams to design, build, and maintain systems
Write scripts and software layers to monitor and automate processes
Identify and drive opportunities to improve automation for the team (deployment, management and visibility of our services)
Participate in an on-call schedule
Work closely with Adobe operations teams to help develop and optimize solutions
Strong comprehension of continuous integration and continuous deployment methodologies
Deep understanding of both software engineering and technical operations
Experience with programming in Python, Java, Ruby, Scala, Go, or similar programming language
Experience with existing open source projects such as Mesos, Hadoop, Spark, ZooKeeper, Kafka, Cassandra, Docker
Experience with developing frameworks, platforms, APIs
Developing, running, and/or consuming cloud technologies such as AWS, Azure, OpenStack, Google Cloud Platform

Site Reliability Engineer Resume Examples & Samples

BS /MS in Engineering, Computer Science, or equivalent with 5 years of experience
System admin experience
Experience with cloud hosting - AWS, RackSpace, CIS, Openstack
Experience to 24X7 support model with oncall rotation
Experience with monitoring with alerting
Good proficiency with script languages such as Python/Shell
Knowledge of distributed computing
Experienced with implementing back-end services in large / “web scale” distributed systems
Knowledge and experience with micro-services design and implementation
Knowledge and experience with “Platform as a Service” environments or other application development platforms
Strong team player, with ability to actively contribute in teams with different skill and experience levels
Knowledge about buffering, stream processing, complex event processing, and storage solutions (e.g., RabbitMQ, Kafka, , Mongo, etc)
Experience with Cluster enabling solutions with services like Apache Mesos, Kubernetes
Service discovery solutions like Consul.io, HA proxy etc
Understanding on how to develop within a continuous integration environment leveraging tools such as Jenkins, Hudson, Bamboo
HA/DR/Availability/Capacity
Monitoring setup (Cloud Watch/Nagios/New Relic/SPM/Sensu)
Experience building highly scalable solutions
Excellent troubleshooting skills on a busy infrastructure setup
About Cisco

Senior Site Reliability Engineer Resume Examples & Samples

Manage the infrastructure for a cloud service that processes a billion metrics per day, and replicates tens of billions of database writes to our backup service
Design, implement, operate and troubleshoot the automation and monitoring of a service that seamlessly spans several data centers and several cloud providers
Become an expert in MongoDB performance, helping us optimize from the application level all the way through the firmware
Participate in a weekly on-call rotation, and make trips to our data centers as needed
Troubleshoot and resolve issues in multiple environments
Improve our infrastructure capabilities, optimizing for cost, simplicity, and maintainability

Site Reliability Engineer Resume Examples & Samples

2+ years of demonstrated experience managing and maintaining large scale SaaS platforms
Deep experience in at least one infrastructure component (operating systems, compute, storage, networking, data center, distributed systems, big data, cloud, etc.) and solid understanding of the rest and how they impact services
Experience building, configuring, and maintaining operational monitoring and reporting tools
Solid understanding of infrastructure and application performance metrics, including capacity planning
Proven ability to work independently, and strong problem solving skills
U.S. citizen or a lawful permanent citizen in order to pass government security clearance requirements, preferred

Site Reliability Engineer Provisioning Resume Examples & Samples

Deliver the infrastructure and configurations that engineering teams need: both by provisioning systems directly, and building out tooling for getting it done faster
Improve and automate our tools for provisioning, monitoring, trending, and configuration management
Explain and document our tools and processes, so that developers can own and self-serve their own operational needs wherever possible
Communicate effectively with SRE teammates and developer “customers”
Advise engineering teams on how to configure systems for high reliability
Participate in periodic on-call rotation as part of a global team maintaining the availability and performance of the New Relic site and APIs

Site Reliability Engineer Resume Examples & Samples

Design, write and deliver software to improve the availability, scalability, latency and efficiency of Mozilla's services
Solve problems relating to mission critical services and build automation to prevent problem recurrence with the goal of automating response to all non-exceptional service conditions
Experience with algorithms, data structures, complexity analysis and software design
Experience in one or more of: Python, Go, JavaScript
Familiarity with running web services at scale, understanding of Unix systems internals and networking
Familiarity with Cloud services like AWS
Familiarity with Linux container engines like Docker
Familiarity with container scheduling systems like Kubernetes, Fleet, and/or Mesos. Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way
Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing

Site Reliability Engineer Resume Examples & Samples

Keep the customer facing services available at top performance by maintaining constant health of the supporting systems
Incident management - Act in key support roles during major incidents e.g. Sev0, Sev1. Also participate in the technical review of the incident for problem management
Problem Management - populate in participate in RCAs and hand them off to the Global Solutions team
Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company's internal compliance policy and directives
Being available to discuss and resolve technical issues and escalations with other technical staff as required
Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required
Ability to operate in high pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities
BS/BA Degree in Computer Science or equivalent industry experience(3-5 years in an Enterprise scale internet service engineering or support role)
Expertise in TCP/IP related technologies (networking protocols, network programming, etc.)
Expertise in CLI enterprise support of Unix variants (Linux/Solaris/BSD) as well as strong Linux/UNIX knowledge with significant exposure to Red Hat Enterprise Linux and Solaris
Strong understanding of monitoring implementations and administration
Strong communication skills(Written and Oral)
Past experience in Incident Management and good understanding of ITIL service operations
Experience in working in a 24/7 team managing large data centres
Exposure to Oracle and high end Storage Infrastructure (Hitachi/EMC Tier 1)
Perl/Python scripting experience
Prior Chef/Puppet or automated deployment experience
Experience in supporting and maintaining a monitoring system
Experience supporting and troubleshooting relational databases and distributed platforms
Experience in supporting and maintaining Java applications

Senior Site Reliability Engineer Resume Examples & Samples

Must take ownership and accountability for one's actions, and be known for meticulous attention to detail
Work in a highly skilled global team in a 24/7 environment
Develop debugging tools to assist engineers in diagnosing production service problems
Maybe required to define requirements for, and/or write/test custom tools to handle system automation tasks (installation, configuration, monitoring, etc)
BS/BA Degree in Computer Science or equivalent industry experience
Mandatory Knowledge on RHEL/CentOS 6.x or above - Preferable RHCSA certified
Have scripted using shell/perl/python to automate repeatable tasks such as deployment, tuning of services
Solid working knowledge of Unix systems internals, including file systems, kernel modules, packages, and networking
Knowledge on Clustering/HA
Exposure to automation frameworks such as Puppet/Ansible/Rundeck
Understanding/management of code repositories like perforce/svn/cvs/git
Experience designing, deploying and maintaining bare metal provisioning and multiple server installation (Kickstart) monitoring resources etc
Understanding of Network security (eg. Firewalls, IDS, IPS etc.)
Knowledge on Application Virtualization and Operating system-level virtualization
Hardware training and/or certification in systems and storage management (Red Hat Linux, Sun Microsystems, Hitachi Storage Solutions, EMC, Dell, Solaris, etc.) desired but not mandatory
Solid networking experience - TCP/IP, administration of networking hardware (Cisco, Foundry, etc.), load balancing - Considered a PLUS
MS or PhD in Comp Sci, Mathematics, Machine Learning, Artificial Intelligence or similar
Salesforce force.com Development (Apex and Visualforce)
Visual/Interface Design skills
Familiarity with open source and 3rd party Monitoring Systems (Nagios, kafka, SMARTS, etc.)
IRC Bot development
Agile development experience/understanding
Network administration background

Site Reliability Engineer Resume Examples & Samples

Partner with fellow engineers to architect and build mission critical software and systems that can stand the test of scale and availability, while limiting operational overhead
Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis
Participate in an oncall rotation and be available for escalations

Site Reliability Engineer, Senior Resume Examples & Samples

7 years of experience with leading activities in an Operations Center maintaining overall command and control
Experience with software and systems engineering
Experience with problem management and change control
Knowledge of Linux, Java, Oracle, Networking, VMware, Apache, IT Process Management, including ITSM and ITIL, and Microsoft Office
Ability to provide incident management, initial triage of events, and categorization and prioritization
Ability to direct troubleshooting activities and make technical and executive escalations
Ability to monitor several areas, including the Data Center Network, Infrastructure Bandwidth, Server and Infrastructure, applications within Marketplace environments, Agent-based Monitoring, Production Batch Processing, and Events and Alerts Escalation
Ability to perform advanced troubleshooting and escalation and drive service restoration activities
Knowledge of Remedy, Outlook e-mail, New Relic, Mixpanel, Oracle Enterprise Manager (OEM) Grid Control Management, Data Services Hub (DSH) Tool, or Tivoli
Ability to function at a high level in critical situations
Ability to ensure proper shift coverage and delegation of responsibilities
Possession of excellent critical thinking, problem-solving, troubleshooting, and process-oriented thinking skills

Site Reliability Engineer Resume Examples & Samples

Masters degree in Computer Science, Engineering, Information Technology, or related field with 1 year experience or Bachelors degree in Computer Science, Engineering, Information Technology or related field with 2 years experience
Academic training or experience must include some exposure to
Experience in security, dev-ops or Linux sysadmin role, preferably in a fast-paced web application environment
Experience in web or network security
Experience writing security policies, educating teams on security best practices, performing penetration tests, and analyzing a code base
1 year experience programming with two of the following languages; Ruby, Bash, Java, Python, C, C++, and Perl
6 months of experience implementing information security best practices such as SAS70, SSAE16, SOC or ISO

Senior Site Reliability Engineer Resume Examples & Samples

Develop and deliver configuration and deployment automation required for improving the functionality, availability, and manageability of our microservices using Python or Ruby and configuration automation tools such as Puppet, Chef, or Ansible
Build infrastructure and application monitoring by gathering application and system metrics and implement tools for recoveries
Troubleshoot availability/performance problems and build software-based solutions to prevent recurrences
Define and evangelize cloud-related optimizations and best practices to improve reliability and performance

Senior Site Reliability Engineer Resume Examples & Samples

You will have a maniacal focus on site uptime
Engineer application infrastructure that is reliable, efficient, and maintainable
Partner closely with software engineering teams using a strong devops mindset
Act as a subject matter expert for troubleshooting and resolving complex, multi-tier web problems that span a number of different platforms
Automate production operations processes
Automate continuous integration and deployment processes
Deliver infrastructure requests for software projects on time
Define, measure, and meet key operational metrics including site performance, incidents and chronic problems, site traffic and conversion, etc
Participate in a 24x7x365 support rotation for a website that never sleeps
5+ years’ experience building and supporting large-scale, business critical systems
Public Cloud experience (AWS/ Google Cloud)
Expert knowledge of at least one web application platform: WebSphere, JBoss, Tomcat, Apache, NginX, Varnish, Endeca,
Expert knowledge of Application Performance Monitoring tools: Dynatrace, Splunk, Gomez, Coradiant, and Tealeaf
Experience with continuous integration platforms such as Jenkins
Experience with infrastructure configuration management tools such as Puppet and Chef
Mastery of at least one scripting language including Python, PERL, Ruby, Shell
Unix/Linux power user
Strong interpersonal skills, written and verbal communication
Strong decision-making, problem-solving skills, critical thinking, and testing skills
Ability to self-manage assigned tasks and projects
Ability to work independently with minimal direction
The knowledge, skills and abilities typically acquired through the completion of a bachelor's degree program or equivalent degree in a field of study related to the job

100

Site Reliability Engineer Resume Examples & Samples

Development of a reliable set of micro REST APIs to help create agile and robust infrastructure management and reporting workflows
Improve and build upon our existing automation tools for systems provisioning and management
Independently learn new technologies and master the New Relic infrastructure so that you can provide 'full stack' diagnostics, when necessary, to help to figure out the root cause of internal problems
Communicate effectively with fellow SREs and other engineering teams, and describe problems succinctly with sufficient detail that you can hand-off an ongoing problem to another team or a peer for completion
Strategize with fellow SREs and other engineering teams on complex problems, and make decisions and recommendations about systems improvements after analyzing possible courses of conduct
Perform periodic on-call duty as part of a global team maintaining the availability and performance of the platform that enables all of New Relic

101

Site Reliability Engineer Resume Examples & Samples

Min 2+ years of experience in a Cloud administration role
Min 3+ years of experience in IT
Must be familiar and comfortable with continuous integration
Ability to work well with various team members from developers to business people
Self-starter with the ability to work solo on projects and get results
Proactive and strong ability to learn new things with little guidance
QlikView/Qlik Sense experience a plus
Ability to work on computer for extended periods
Must be willing to be on-call for rotation and system emergencies
LI-MP

102

Site Reliability Engineer Resume Examples & Samples

Proactively monitor availability and performance of the Ariba Cloud using key tools
Effectively respond to Monitoring alerts, incident tickets, email requests coming in to Site Reliability Engineering
Perform application and web site troubleshooting to quickly resolve the issues per documented procedures
Ensure user tickets and monitoring alerts are handled per defined SLA's for response time, updates and closure
Escalate issues as needed to back line production operations teams or Engineering per documented procedures
Handling communication and notification on major site issues to the company and executive management team
Document resolution run books and standard operating procedures
Ensure smooth hand offs between shifts
Experience working in a 24 x 7, large-scale Internet web environment
Prior experience working with Java/J2EE applications

103

Senior Site Reliability Engineer Resume Examples & Samples

Building and running Global Compute platforms
Operate and deploy cloud services and related projects from development to production
Develop automation, processes, and tools designed to make this process simpler and more robust
Bridge Engineering and core shared operations services
Participate in troubleshooting, capacity planning and analysis, performance analysis activities
Advise management on service on boarding strategies and execution
Mentor team members on areas of subject-matter expertise

104

Site Reliability Engineer Lead Resume Examples & Samples

5+ years of experience with Linux and UNIX administration
5+ years of experience with scripting languages, including PHP, Python, Java, Node, or Ruby
2+ years of experience with maintaining high availability production systems in a Cloud environment
2+ years of experience with automation and configuration management using either Puppet, Chef, or an equivalent
2+ years of experience with continuous integration tools, including Jenkins or TravisCI
Experience with LXC and containerization, including Docker
Experience with continuous monitoring tools, including Sensu, Nagios, and Splunk
Experience with working in and maintaining PaaS environments, including Kubernetes, Mesophere, Flynn, or Deis

105

Senior Site Reliability Engineer Resume Examples & Samples

Such as NewRelic, DataBase Query Optimization, Webpage optimization
Fine tuning of web servers, use of caching layers etc.,
Optimize buffer pools, java settings, propose & implement CDNs
Proactively Identify, troubleshoot, and resolve and/or propose solutions across ALL environments
Own capacity planning and infrastructure upgrades for computer platforms
Collaborate with Software Engineering and Product teams
Work the L2 queue, create & implement fixes in production and back-port to code-base
In creating pull-request for said code fixes, changes & improvements
Integrating infrastructure changes to optimize for speed and accuracy
Develop systems diagram aka solution architecture for special projects
Implement the said solutions across the various environments
Document the proposed solution via visual aids such as MS Visio or LucidChard or draw.io etc.,
Create & evangelize high level plan, timelines and cost estimate for the proposed solution
Build out the solution bringing the paper solutions and/or pipelines to reality
Collaborate tightly, working in short sprint cycles with Product, Development and QA for load testing scenarios, execute & provide feedback to peer groups
Propose site reliability measures as a result of such experiments
Define, setup, and manage full stack monitoring and corrective actions
Recommend and champion software improvements to support elastic scalability
Continuous Improvements to minimize manual Ops while improving availability
Support corporate and business critical security certifications
Understand business domain quickly and be responsive to peer team needs
Document, train, and mentor peer team members as required
10+ years of progressive experience with Software Development Engineering - .Net & Java
Strong knowledge authoring, troubleshooting & optimization of SQL
Strong knowledge in Visual Studio, TFS, TeamCity and related MS Development concept & tools
Strong knowledge in Java and related development concept & tools
Experience with Continuous Integration/Deployment with tools such as Jenkins, Bamboo etc
Willingness to learn & work with
Configuration Management tools such as Salt Stack (In-use) (previous Chef/Puppet is good)
Big Data solutions such as Apache Kafka (in-use), Apache Cassandra (in-use), Apache Spark (in-dev)
Distributed computing concepts such as streams processing, fault tolerance, job management, map-reduce etc.,
Android/iOS Applications - our Point-of-Sale systems runs as an application on these two platforms
Solid working knowledge of VLANs, Routing, NAT/PAT, TCP/IP, and TCPDUMP
Experience dealing with firewalls, packet filters, proxy servers, traceroute, ping etc
Strong Infrastructure experience in: Cloud (AWS/Azure etc.,), VMWare/XEN etc.,
Expert Systems engineering experience in: Linux and Microsoft Windows Servers
Expert in web ware such as: Apache/NGINX/IIS, DNS, NTP setup and management
Bachelor’s Degree in a technical discipline or greater preferably in Computer Science
Ability to work and make reasonably sound business decisions under minimal supervision
Strong verbal, written, and presentation skills
Support for on-call & off-hour/weekend activities relating to production support

106

Site Reliability Engineer Resume Examples & Samples

Ensure high availability with adequate monitoring and instrumentation
Experience automating Linux system provisioning
Experience with logging infrastructure and tools such as Logstash, Elasticsearch, Kibana, Splunk, and HDFS
Maintain and extend operational processes to ensure high reliability of our entire technology stack
Implement best practices to manage utilization, optimization, and monitoring of our cloud services
Handle deployments, automation, and instrumentation and conduct analysis to improve DevOps
Configuration management experience with Ansible, Puppet, Chef, or Salt is a nice to have
Participate in daytime on-call rotation to support site and production issues
Manage our computing and storage infrastructure if needed on private cloud
Experience with Devops tools, processes, and culture
Advanced Linux SysAdmin experience
2 years minimum experience in SRE, DevOps or Production Operations role
Hands on experience scripting with Perl, Python or Ruby
Technical generalist knowledge of System Administration, Databases, Network Operations

107

Senior Princ Site Reliability Engineer Resume Examples & Samples

Work with multiple development teams to deploy code, maintain and enhance infrastructure
Develop and maintain auto-scaling monitoring and performance analysis tools in a large-scale dynamic environment
Assist with capacity planning
Communicate regularly with globally dispersed teams
Operational Automation and CI/CD experience (RunDeck, Puppet, Jenkins, Chef, Ansible, GIT)
Expertise in UNIX / LINUX / Solaris / Windows Operating Systems
Development experience with Java, Python, Ruby, Bash, Groovy, C#, C++ languages, .NET framework
NoSQL Database experience (Cassandra, Riak, MongoDB, Redis)
Clear understanding of software development lifecycle
Experience with Oracle Database query tools, including the ability to write complex SQL queries, and Oracle DB monitoring tools
Solid Networking experience; CCNA certification preferred
Proven track record of delivery and commitment to customers
Be able to isolate defects effectively and provide specific conditions for reproducibility

108

Site Reliability Engineer Resume Examples & Samples

Design, implement, and support access management solution
Provide timely and concise status to customers and team members for engagements
Engage in agile transformation of products, solutions, team, and culture
5+ years of experience in the IT industry (system administration, automation, development, etc.)
Ability to initiate, design, execute, and complete projects independently with minimal direction
Must possess excellent communication skills (written, verbal) and be able to work effectively with technical and non-technical individuals alike
Ability to solve problems independently
Development & support

109

Site Reliability Engineer Resume Examples & Samples

Use complex algorithms to develop systems & applications that deliver business functions or architectural components
Develop system architecture that improve designs & mapping form to function
Educate team members about integrating systems collaboratively & efficiently
Assess how the competition differs from Verizon’s current state and update Verizon operations
Bachelor’s degree in Computer Science or Information Technology
4+ years of experience in supporting Java and J2EE applications deployed in multi-platform environment
Experience and/or knowledge of CA Wily Introscope, Siteminder and other monitoring tools
Experience and/or knowledge of IBM Tealeaf, SPLUNK, Dynatrace and other
Experience and/or knowledge of DevOps – CI/CD tool chain and successful implementation
Good understanding of JVM Thread Dump, GCC and various system log files
Good understanding of Application Logging, Monitoring and Alerting products
Solid experience with working on UNIX platforms and good UNIX scripting skills
Good knowledge of some scripting language (perl, php, Python)
Good experience with both VM's and high-availability architecture
Good understanding of web server load balancing concepts and experience with highly available distributed environments
Experience working in DMZ environments with good understanding of Firewalls, TCP/IP, hardware load-balancing (ideally Netscalar, HST.F5), firewalls, multi-tiered architectures
Troubleshooting of JEE based applications (analyze JVM logs, Trace Logs, FFDC, native logs, java core, heap dumps)
Good time management, documentation and communication skills
Enthusiasm and eagerness to learn and embrace new technologies
Teamwork & collaboration skills to work across organizations and lead cross-functional teams
Communication & stakeholder management skills

110

111

Site Reliability Engineer Resume Examples & Samples

5+ years’ combined of professional software development and system administration experience
A B.S. or M.S. degree in Computer Science, MIS, CIS, or a related field, or equivalent experience
Passion for providing a great customer experience
2+ years’ prior system administration experience with 24x7 SaaS products
The ability to manage SaaS architectures and operations at large scale
Experience with Amazon Web Services (AWS) or similar infrastructure as a service
DevOps and software development experience with large-scale environments
A deep understanding of Ruby, Python, or JAVA
Experience with shell scripting
Experience with REST APIs and standard Linux command line utilities
Experience with Chef, Puppet, Ansible, or other configuration management software
Experience using a version control system—such as Git or SVN—and code deployment methodologies
Network design and troubleshooting
SQL and NoSQL databases
Jenkins

112

Principal Site Reliability Engineer Resume Examples & Samples

Respond to critical Application Alerts
Research and investigate production application defects and provide solutions for resolution
Provide technical insight on development projects
Perform Root Cause Analysis (RCA) on system incidents
Assist with testing and validating production applications
Development experience with Java, Python, Ruby, C#, C++ languages, .NET framework
Clear understanding of software development lifecycle methodologies
Demonstrated ability to understand complex network architecture (Firewall, Load Balancer, DNS, Routing, Switching)
Understanding of Client/server/database architecture
Able to articulate strong stance for product quality and customer satisfaction, balancing business goals and commitments with quality objectives
Must be able to work effectively in globally distributed team structure
Experience with server and application monitoring tools (Nagios, Splunk, NewRelic, Sensu, Graphite, LogStash, etc)
Operational Automation experience (RunDeck, Puppet, Jenkins)
Experience working with .NET Applications
NoSQL Database experience preferred
Experience with Mercury test suite including QTP, Loadrunner and Quality Centre or similar products preferred
Knowledge of Apache product suite particularly Tomcat

113

Site Reliability Engineer Resume Examples & Samples

Evaluate and contribute to product and service design and architecture, helping shape service engineering technical strategies, review specifications, and design and improve upon core tools and processes
Help build a data-driven culture by providing statistical trending and analysis using real service data to increase service health and quality
Work closely with peer engineering teams on defining and implementing improvements to service tooling, monitoring, and reporting to enhance reliability and capability
5+ years in the technology industry with broad engineering experience; 3+ of those in online services
Demonstrated understanding of Microsoft SQL, PowerShell, C#, .NET web applications and software security concepts
Experience developing and publishing applications in Azure
Working knowledge and hands-on experience in developing tools designed to run in complex, large scale online/hybrid services
Previous working experience in a DevOps engineering model a plus
Strong interpersonal, verbal, and written communication skills, with the ability to assemble, document, and present technical information to the team
BA/BS in Computer Science, Mathematics, Electrical/Computer Engineering or related degree preferred or equivalent work experience
Experience in large scale implementation (greater than 40K clients and 1K mobile devices) in one of the following technologies is a plus: System Center Configuration Manager, Altiris, Airwatch, MobileIron, Identity Management, Remote Desktop, VDI, Exchange/Office 365 or SharePoint

114

Site Reliability Engineer Resume Examples & Samples

Create and/or improve the tools that provide insight into availability and performance of our services
Solve problems related to these mission critical services and build automation to pro-actively detect and prevent their re-occurrences along with driving down time to resolution
Share in the creation of new designs and architectures for multi-region, multi- datacenter distributed systems
Collaborate with application engineering teams in solving business needs with our services
Partake in the periodic on-call duties for provided services
BS in Computer Science (or equivalent experience) plus 5-7 years of experience including experience with open source technologies, automated configuration, DevOps, or cloud automation development
Experience in one or more languages of Python, Ruby, Java, or Go
Understanding of Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery
Knowledge of open source technologies such as Docker, Elasticsearch, Kafka, Redis, Cassandra, Consul, Nginx, HAProxy, and ability to quickly learn new technologies
Configuration management experience with Puppet, Chef, Ansible, CFEngine desirable

115

Site Reliability Engineer Resume Examples & Samples

Develop automation to improve reliability, performance, and deployment speed of core services
Work on problems related to deployment, monitoring, failure handling, and traffic management
Design and implement solutions that improve availability and use of Oracle's Cloud services
Create new designs, architectures, processes, and methods for the implementation, operability, and support of large-scale distributed systems
Be self-driven and independent thinker, act on ideas and drive them to completion
Define best practices and standardization in Cloud operations
Stay informed and relevant in new technologies
Innovate!
Prior experience in designing, implementing or supporting high performance and large-scale web applications in high scale customer facing environments
Prior software development experience in one or more of: Python, Java, Ruby, Go
Strong communication and analytical skills
Able to accurately estimate efforts and deliver on time
Experience with agile processes and good understanding of software development practices
Expertise with software development eco-system such as Git, Jenkins, Artifactory, and CI/CD practices
Knowledge on system and application security
Understanding of virtualization solutions and Cloud services
Strong knowledge of Linux-based OS internals
Strong networking knowledge: TCP/IP, UDP, ICMP, IP packets, DNS, OSI layers, and load balancing
Experience with configuration management tools
Understanding of the DevOps toolchain components, and how they fit together
Ability to define and document technical architecture of complex and highly scalable products
Ensure the quality of the products being delivered
Eagerness to automate, wherever and whenever the possibility arises. Automation is part of your DNA
Must possess a strong desire to measure application performance and act upon results
Experience with Linux containers (e.g. Docker)
Familiarity with cluster management solutions: Mesos and job schedulers
Expertise with databases and big data stores like MySQL, Memcached, PostgreSQL, and Oracle DB

116

Site Reliability Engineer Resume Examples & Samples

BS in Computer Science or related field, or equivalent employment experience
Strong sense of ownership, customer service, and integrity demonstrated through clear communication
Demonstrated ability to write programs using a high-level programming language like: C, Java, Ruby, Python, or Perl
Proclivity towards efficient programming emphasizing improvement via complexity analysis
Experience managing large numbers of diverse systems with configuration management systems like: Puppet, Chef, Ansible, or Salt
Deep understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals
Understanding of standard networking protocols and components such as: HTTP, DNS, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing
Fundamental understanding of distributed systems including: the CAP Theorem, Microservices, and the Twelve Factor App
Passion for eliminating repetitive manual processes using automation

117

Site Reliability Engineer Resume Examples & Samples

5+ years of managing services in a large scale *nix environment
Experience with DevOps tools, processes, and culture. Experience with Puppet, Chef or Ansible is a plus
A systematic, test-and-measure approach to continually improving service operations
Practical knowledge of shell scripting and at least one scripting language (i.e. Perl, Python, Ruby)
Good understanding of Java is a major plus
Working knowledge of Oracle Database is a plus
Understanding of cryptography is a plus

118

Site Reliability Engineer Resume Examples & Samples

Minimum of 3 years of experience supporting internet-facing production services and distributed systems
Extremely organized, detail oriented, and thorough in every undertaking
Able to balance multiple tasks and projects effectively and quickly adapt to new variables
Demonstrated problem solving ability utilizing creative and innovating thinking but also adhering to
Self motivated and eager to learn
Professional and open minded attitude
Able to work closely with other team members as well as work independently

119

Senior Site Reliability Engineer Resume Examples & Samples

Practical knowledge of shell scripting and at least one scripting language (i.e. Perl, Python, PHP)
Strong hands-on knowledge in Unix/Linux environment
Experience with monitoring tools such as Nagios, Splunk highly preferred

120

Site Reliability Engineer Resume Examples & Samples

System administration in a large environment
Good understanding of TCP/IP network protocols
Fluency in one or more of: C, Ruby, Perl, Python
Superior analytical/troubleshooting skills
Understanding of data structures, algorithms, and complexity analysis
Tendency to automate repetitive tasks

121

Site Reliability Engineer Resume Examples & Samples

5+ years supporting database infrastructure in a high volume of customer-facing environment
3+ years of data modeling and database design
Strong experience in optimizing and performance tuning of large RAC clusters
Extensive Oracle Data Guard experience including fast start failover
Proficiency with SQL, shell, Perl, & Python languages
Experience on any of NoSQL data store such as Voldemort, MongoDB and Couchbase
Background in building of a large infrastructure supporting a high volume of transactions in a mission critical environment
Strong communication skills and ability to work effectively across multiple business and technical teams
Ability to thrive in a fast-paced, tight deadline delivery timeline

122

Site Reliability Engineer Resume Examples & Samples

3+ years supporting hosted services in a high-volume customer-facing environment
Demonstrated ability to write programs using a high-level programming language like: C, C++, Objective-C, or Java
Experience with relational databases and No-SQL
Background building distributed, server-based infrastructure supporting a high volume of transactions in a mission critical environment
Demonstrated ability to deliver results on time with high quality

123

Siri Senior Site Reliability Engineer Resume Examples & Samples

Expert knowledge of the Linux operation system (OS, networking, process level)
Understanding of one or more object-oriented programming languages (Java, C++)
Fluent in at least one scripting language (Shell, Python, Ruby, etc.)

124

Site Reliability Engineer Resume Examples & Samples

Typically requires at least 5+ years of experience in Linux or UNIX systems administration in a large engineering or R&D environment and demonstrated skills in the following
Linux (RHEL/CentOS preferred)
NFS and NAS appliances (NetApp preferred)
Layer 2 / Layer 3 networking (Arista or Cisco preferred)
Scripting in shell, Perl, Python or Ruby
Revision control systems (SVN, git, Perforce)
Centralized configuration management (Puppet, cfengine)
Software/tool compilation and installation
Flexlm and similar licensing systems
Monitoring systems such as Nagios, Zenoss, Groundwork
LDAP (OpenLDAP, DSEE, OpenDirectory)
IPAM with DNS (BIND) and DHCP
Must be analytical and possess strong organizational/problem-solving skills

125

Site Reliability Engineer Resume Examples & Samples

Experience automating Linux/Mac OS X system provisioning
TCP/IP layer 2 and layer 3 networking
Centralized configuration management, eg Puppet, Ansible, Chef, or Salt
Experience with logging infrastructure and tools such as Logstash, Elasticsearch, Kibana, Splunk, HDFS
Independently driven, proactive, accountable, reliable, team player
Additional desirable qualifications
Scripting: JavaScript, Ruby, Python, Perl, Tcl/Expect, Bash
AppleScript/UI automation

126

Site Reliability Engineer Resume Examples & Samples

Using your Linux system administration skills, monitor and manage the reliability of the systems under the responsibility of the Control Room Bridge
Develop and maintain monitoring tools used to support the HPC community within NERSC using programming languages like C, C++, Python, Java or Perl
Provide input in the design of software, workflows and processes that improve the monitoring capability of the group to ensure the high availability of the HPC services provided by NERSC and ESnet
Support in the testing and implementation of new monitoring tools, workflows and new capabilities for providing high availability for the systems in production
Assist in direct hardware support of our data clusters through managing component upgrades and replacements (dimms, hard drives, cards, cables, etc) to ensure the efficient return of nodes to production service
Maintain outage documentation through a trouble ticketing system
Help in investigating and evaluating new technologies and solutions to push the group’s capabilities forward, getting ahead of our users’ needs, convincing staff incentivized to transform, innovate and continually improve
Bachelor’s Degree in a Computer Science or similar discipline with a minimum of 3 years related experience, or the equivalent combination of education and experience
Minimum of 1-2 years of experience as a system administrator or system engineering supporting critical systems and applications in UNIX or Linux, in a high-volume customer-facing environment managing data clusters, administering the replacement of hardware, and ensuring its continuous availability to the user community
Experience with or have taken classes in programming languages such as C, C++, Perl, Java and Python or a scripting language
Experience in or have taken classes in the areas of TCP/IP related technologies (networking protocols, network programming, e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing, etc.)
Ability in or have taken the appropriate classes or certifications in areas of enterprise support of Unix variants (Linux/Solaris/BSD)
Past experience in Incident Management and good understanding of IT service management
Experience in working in a 24/7 team
Exposure to Oracle and high end Storage Infrastructure (Hitachi/EMC Tier 1) or have taken the appropriate classes in these areas
Familiarity in configuring distributed, server-based infrastructure supporting a high volume of transactions in a mission critical environment in a Linux environment
Knowledge of network security: configuring/maintaining ACLs, knowledge of firewalls
Excellent problem solving and skills with ability to work both independently and collaboratively, contributing to an active intellectual environment
Must demonstrate good judgment and ability to schedule and lead small group projects with systematic problem solving, coupled with a strong sense of ownership and motivation

127

Site Reliability Engineer Resume Examples & Samples

Building, managing and securing Linux systems in an Enterprise environment
Deploying production developer code
Monitoring systems and applications
Creating and managing internal tools
Deep troubleshooting of production issues
Creating and maintaining tools for use within the team
Participating in on-call rotation

128

Site Reliability Engineer Resume Examples & Samples

Continuous improvement.You will make the DevOps motto yours:“There is always a better way to do things”. You will help uncover and implement these better ways
Release management. You will help deploy new products or services into production, being the liaison between Software Development and Hosting, and taking into due consideration security and compliance considerations
Data and knowledge creation. You will help generate information about the system, from service availability statistics, capacity usage monitoring, to feature usage measurements. This will allow the business to take data-driven decisions in all circumstances

129

Senior Site Reliability Engineer Resume Examples & Samples

Up to 50% of time spent on Incident Response with the goal of driving bridge calls to the quickest possible MTTR
Integrates with select Scrum teams to determine infrastructure and environment related impacts and drive operational readiness
Become infrastructure capability Subject Matter Expert and work with Dev teams to build to standards that drive the highest levels of availability
Build and implement recovery tooling capabilities to better respond to common production incidents
Help drive best software practices in code to avoid incidents or be more resilient to environment instability
Inventory and document Application dependencies at all layers of the technology stack. Build tools and automate dependency discovery where possible
Implement and validate new environment requirements before they are needed. This includes certificates, firewall rules, websites, VM’s, containers, load balancer configuration, and all necessary infrastructure to reliably run the target application
Apply specialized knowledge of industry standards or practices to assigned initiatives
Identify complex and or broad problems and issues and formulate recommendations
Interact with development and infrastructure partners to better understand their objectives and help them understand the technical landscape
8+ years of active engineering and/or architecture experience in a complex environment and/or comparable development experience such as
Large Scale web infrastructure experience, preferably in high transaction volume OLTP sites or the Financial Services industry
Experience supporting a 24/7 site with on-call responsibilities for production support
Broad Technical field exposure preferred, with preference to skills in one or more of the following: Infrastructure, VM, load balancing, containers, JVM’s, web servers, application debugging, queing technologies, Caching technologies, databases, routing and switching, etc
Bachelor’s Degree in related field preferred; Relevant industry experience can substitute

130

Site Reliability Engineer Resume Examples & Samples

Design and development of a new service or customer interface
Provisioning and configuring the servers and applications behind services
Adding new monitoring and measuring key metrics to better understand a process
On-call troubleshooting or direct customer support
Improving automation and analyzing workflows
Proficiency in provisioning, administering and using modern Unix/Linux operating systems
Strong knowledge of shell scripting and at least one programming language, preferably Python
Practical knowledge of TCP/IP networking and troubleshooting connectivity issues
Excellent verbal and written communication skills, including documentation for peers and end-users
Demonstrated commitment to centralized configuration management and automation, preferably with Puppet
Experience supporting high availability solutions such as load balancing, clustering, or fail-over mechanisms
Database, web server, web applications design and administration experience, including Apache, MySQL, Postgres, Sybase, PHP
Experience using Juniper, PulseSecure, NetApp, Infoblox and other proprietary OS's and devices
Solid understanding of network protocols and ability to use diagnostic tools appropriately
Red Hat training or certification
Demonstrated experience in a production environment supporting enterprise wide essential services like DNS, DHCP, IP address management, LDAP, authentication, firewalls, file transfer, network access control, remote access, proxies and VPNs
Proven ability to provide technical leadership to colleagues and customers
Demonstrated experience in a production environment supporting essential services and 100+ servers
Demonstrated ability to solve problems in creative and effective ways

131

Senior Site Reliability Engineer Resume Examples & Samples

Fluent in systems programming and/or automation, and can leverage their experience to solve complex problems associated with running production environments at massive scale in multi-tenant environments
Implementation of proactive monitoring, alerting, trend analysis and self-healing systems
Participate in on-call rotations, driving restoration and repair of service-impacting issues
Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
5+ years of PHP, Perl, or other scripting language
Master's Degree in Computer Science or equivalent
Experience with one of the log analysis tools like Splunk or ElK Products (ElasticSearch,Logstash, Kibana)

132

Site Reliability Engineer Resume Examples & Samples

4+ years of experience with Linux
Experience programming with at least one of: Python, Go, Ruby, PHP, or Bash
Experience with configuration management (including tools like Chef, Puppet, Ansible, or others)
Experience managing high-throughput, high-availability systems
Knowledge of networking protocols, including familiarity with TCP/IP, HTTP, SSL, and DNS
Knowledge of monitoring systems and strategies for writing useful alerts

133

Site Reliability Engineer Resume Examples & Samples

Strong systems scripting skills (Python, shell)
Comfortable analyzing and troubleshooting large-scale distributed systems
Fluency with Ubuntu (or other Linux distribution)
Experience with running production systems
Willingness to occasionally rack and wire servers
Experience with continuous integration/deployment with Jenkins (or similar tools)
Experience with SaltStack (or similar) for configuration management
Experience with virtualization, containerization, and system image management (Docker, KVM, LXC)
Experience with building and operating monitoring/alerting infrastructure
Familiarity with distributed storage systems (HDFS, Amazon S3)
Basic understanding of network switch configuration and network troubleshooting process

134

Principal Site Reliability Engineer Resume Examples & Samples

Critical Analysis: Analyzes the current IT architecture to identify weaknesses and opportunities for improvement based on performance, reliability and capacity metrics, existing and planned application architectures, and criticality and efficiency of business processes
Vision and Strategy: Drives development of our next generation global infrastructure strategy, design and roadmap based on deep understanding of data center strategy, private and public cloud “infrastructure as a service” (compute, storage, network, etc.) and “platform as a service” (software-defined networking, auto-scaling, automated provisioning and configuration) capabilities, as well as desktop computing and mobile device management
Expertise: Hands-on technical practitioner and individual contributor who knows from experience which infrastructure architecture patterns, tools, practices and vendors/partners which will enable us to effectively and reliability scale our various systems
Solutions: Works collaboratively with enterprise architecture, information security, application & infrastructure teams to produce an optimal, high level, conceptual, and cost effective designs. Facilitates the evaluation and selection of solutions and product standards. Acts as an arbitrator, as needed, to bring the appropriate parties together to drive discussions to decisions in a timely manner
Delivery: Leverages hands-on expertise to implement proof-of-concept and production-class designs (read: gets hands dirty) and ensures proper monitoring and alerts are in place to govern adherence to standards and requirements
Insights: Work with engineering and support teams to convey how best to improve application reliability and resiliency as input to technology roadmaps. Share real world implementation challenges and recommend new capabilities that would simplify adoption and drive greater value from use of Sephora’s infrastructure services
Minimum 10 years' experience in a medium to large IT organization running a large, complex IT infrastructure
Minimum 5 years' experience in a technical architecture role with demonstrated ability to lead via influence
Expertise in problem solving and analyzing global scale distributed systems (public/private cloud, traditional datacenter infrastructure, multiple operating systems)
Expertise with public cloud IaaS and PaaS providers including AWS, Azure
Expertise with private cloud IaaS and PaaS providers such as OpenStack, OpenShift, CloudFoundry, etc
Expertise working with and evaluating services of Managed Service Providers such as Verizon/Terremark, HCL, Rackspace, IPSoft. LiquidHub, etc
Experience building software and systems to manage infrastructure and applications through automation; i.e. fluency with configuration and automation tools such as Puppet, Chef, Ansible, SCCM
Expertise in compute virtualization such as VMware, MS Hyper-V, KVM, VirtualBox; familiarity with container technologies such as Docker and LXC is a plus
Expertise with monitoring solutions such as Nagios, Cacti, Splunk, Zenoss, AppDynamics
Fluent in Linux (various flavors) and Windows
Comfortable with Java, Perl, Python, Ruby and shell scripting languages including bash and PowerShell
Successful in building relationships with leaders, peers and business partners as well as vendors
Experience developing strategic and tactical plans to meet business objectives
Familiarity with retail and ecommerce technology is a strong plus

135

Bluemix Site Reliability Engineer Resume Examples & Samples

Involvement in every facet of platform — from the earliest stage of product architecture, design and development to deployment, troubleshooting, and performance analysis – to ensure a reliable quality product in production
Ability to collaborate and communicate clearly on status and progress
Take initiative to do what must be done in order to keep critical systems operating
Perform general OS, Web/Application server, database configuration, installs, automation
Participate in periodic on-call rotation in a 7X24 environment
Scripting experience in 1 or more of the following languages: Python, Ruby, Perl
Strong software Linux development, networking, and problem solving skills
Identify best practices and tools across Bluemix DevOps, Infrastructure, CFS Conductors & Network Teams
Proven and relatable troubleshooting and triage skills across systems
German: basic
English: fluent
German: fluent

136

Site Reliability Engineer Resume Examples & Samples

Participate in service capacity planning and demand forecasting, software performance analysis, and system tuning
Identifying underlying root causes and provide recommendations or solutions for long term permanent fixes to critical production issues
Develop effective documentation, tooling, and alerts to both identify and address reliability risks
Participate in on-call rotation with other members of the site reliability / service engineering team

137

Bluemix Site Reliability Engineer Resume Examples & Samples

Involvement in every facet of the platform
At least 1 year experience in software development or engineering
At least 1 year experience in application operation
At least 1 year experience in problem determination
Some coding skills such as;Representational State Transfer (REST)/web services, distributed systems, messaging, knowledge of open source tools used in Cloud Foundry development (e.g. Git and Jenkins)
At least 2 years experience in software development or engineering
At least 2 years experience in application operation

138

Senior Site Reliability Engineer Resume Examples & Samples

Experience with GPOs
Live Site / Customer Support experience
Experience automating and improving deployment service
Experience automating and improving monitoring service
Knowledgeable on software security concepts
Test Design and Test Automation experience
Excellent problem-solving and debugging skills with a solid understanding of testing practices. Strong communication skills
Familiarity and passion for agile/lean development and execution methods, including Scrum and Kanban Experience developing services using the DevOps model

139

Site Reliability Engineer Resume Examples & Samples

At least 4 years of professional experience
Azure experience
IIS experience
Experience as part of a 24x7 on-call escalation path
Experience managing certificates SSL / TLS Virtualization experience
Exploratory Testing experience
Performance Testing experience

140

Site Reliability Engineer Resume Examples & Samples

Design, write and deliver software to improve the reliability, scalability, latency, and efficiency of your services
Solve problems relating to mission critical services and create solutions to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions
Conduct periodic on call duties using a follow-the-sun model (on an as needed basis)
Ability to build and drive consensus towards common goals and priorities through advanced impact and influence skills
BS Degree in Computer Science, Electrical & Computer Engineering or Mathematics or equivalent experience
4+ years of experience and outstanding coding skills in C, C++, C#, Java, Python or similar
2+ years of Software Engineering and experience in testing, deploying and supporting large scale services on Azure, AWS or similar environments
Capable of technical deep-dives into networking, service design, operating systems and storage, yet verbally and cognitively agile enough to engage in strategy discussions with leadership team members
Experience in SDLC, distributed systems, networking, hardware, logistics and operations or capacity planning
A passion for building and participating in highly effective teams and development processes
Expertise in problem solving and analyzing global scale distributed systems and critical production service environments
Strong debugging, testing / validation and analytics/SQL skills
Fundamental understanding of TCP/IP, BGP, AnyCast and CDNs
Experience defining and measuring internal/customer facing OLA/SLAs
Statistics experience and bias for measurement and driving action with metrics

141

Site Reliability Engineer Resume Examples & Samples

Configuring, installing, and verifying new gTLDs
Configuring TLD changes, such as pricing changes, hub migrations/consolidations, and enabling premium tiers
Identifying and recommending TLD onboarding process improvements
Embracing and advancing an automation mindset, developing and executing automated tests
Implementing TLD installation process improvements
Defining and recommending key indicators of TLD system health (KPIs)
Developing and maintaining visualization tools to report real-time system health (metrics)
Developing and maintaining alerting mechanisms to report system health problems (monitoring)
Proactively capturing, investigating, and reducing operational failures
Efficiently researching issues using system logs such as Kibana logs, perfmon, eventvwr
Quantifying scope and impact of operational failures to inform prioritization decisions

142

Lead Site Reliability Engineer Resume Examples & Samples

Build production services that host millions of transactions every day and generate billions of dollars in revenue annually
Define and ensure formal Service Level Objectives for production services
Champion tools and automation to drive deployment, monitoring, and management of rapidly changing production systems
Automate, scale, and support complex production services
Implement application performance monitoring to ensure site uptime and performance
Participate in a 24x7x365 support rotation
6+ years of engineering experience working on teams that ensure high availability for very large scale, revenue generating production environments
Experience in Java, Tomcat, Database, Oracle, Kafka, REDIS, MQ, JBOSS
Scripting and automation experience for deployments using shell, Puppet, GIT
Experience in handling production issues and incident management
Experience in performance engineering and tuning of applications
Expert with production monitoring concepts and usage
Demonstrated experience in executing/delivering projects in a dynamic, fast-paced environment
Bachelor’s degree in Computer Science or related field

143

Site Reliability Engineer Resume Examples & Samples

1 core services (OpenLDAP, LDAP, SMTP)
2 5+ years as a UNIX SA/SE experience...ideally Linux and should know LAMP stack well
3 Python, Bash and/or Ruby scripting
Minimum four years experience as a systems engineer; at least two years with cloud infrastructure technology
Experience building and managing authentication systems such as OpenLDAP, FreeIPA, SAML
Strong UNIX and TCP/IP fundamentals, particularly in cloud environments
Fluent in at least one software development/scripting language, ideally Python but open to Bash and Ruby
Fluent in at least one configuration management tool such as Puppet and/or Chef
Architect and build our core internal services to scale and endure security threats
Own our authentication backend: OpenLDAP + Onelogin + AD
Build our cloud-based and physical systems platforms
Assess our current infrastructure and evolve it to become cohesive and scalable
Script and automate your way out of problems
70/30 split for duties: 70% design and 30% operations. The main focus for this engineer is security and deciding how to secure the systems in terms of authentication and cloud infrastructure

144

Site Reliability Engineer Resume Examples & Samples

Core services (OpenLDAP, LDAP, SMTP)
5+ years as a UNIX SA/SE experience (Ideally Linux and should know LAMP stack well)
Python, Bash and/or Ruby scripting

145

Senior Site Reliability Engineer Resume Examples & Samples

Influence feature design, architecture, standards & processes to ensure security, performance, operability & scale
In Depth Experience on Azure Services like SQL Azure, Compute, IaaS, PaaS , WAAD , VNET, Express Route
5+ years’ experience in large scale internet service design & implementation
Experience managing/Troubleshooting Dynamics AX 2012/AX 7 environments is desired

146

Site Reliability Engineer Resume Examples & Samples

Refining our automated software deployment and release management processes
Responding to, and resolving unexpected and potential service problems
Writing software and defining best practices to prevent the same problem happening twice

147

Site Reliability Engineer Resume Examples & Samples

Troubleshoot and analyze hardware, networks, application, and storage/DB related issues
Cohesively work within a team to manage the monitoring and stability of our multi-cloud+Colo environments
Build, configure, and proactively automate systems and services
Install and support in-house, open-source, and 3rd party applications throughout the technology stack
Collaborate with our Developers and Data Scientists on design concepts to help right-size, build, scale and automate production infrastructure solutions
Provide peer support to other SREs and End Users
Experience with any of the following tools we like: AWS, Ansible, Zabbix, Spark, Kafka, Docker
Comfortable working in a dynamic startup environment

148

Site Reliability Engineer Resume Examples & Samples

78203BR
Enable SPs to grow their top line revenue by delivering new, innovative services to market quickly and consistently across a variety of end point devices
Help SPs deliver new services safely and confidently by providing comprehensive security and integration across VMware and 3rd party ISV and SaaS solutions

149

Senior Site Reliability Engineer Resume Examples & Samples

Serve as a primary point responsible for the overall health, performance, and capacity of one or more of our Internet-facing services
Gain deep knowledge of our complex applications
Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth
Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX environment
Work closely with development teams to ensure that platforms are designed with ""operability"" in mind
Function well in a fast-paced, rapidly-changing environment
Participate in a 24x7 rotation for second-tier escalations
B.S. or higher in Computer Science or other technical discipline
5+ years in a UNIX-based large-scale web operations role
UNIX/Linux systems knowledge/administration background
Trouble-shooting skills that span systems, network (TCP/IP), and code
Programming skills (we like Javascript, shell, command line tools)
Knowledge of most of these: data structures, relational and non-relational databases, networking, Linux internals, filesystems, web architecture, and related topics
Experience deploying NodeJS application
Experience with web-based Java/J2EE architectures and JVM configuration
Exposure to MongoDB
Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc

150

Site Reliability Engineer Resume Examples & Samples

Monitor and troubleshoot issues that arise on customer-facing infrastructure
Laise cross-departmentally in the event of an incident or escalation to or from ShiftOps
Identify system data, hardware, or software components required to meet customer needs
Verify stability, interoperability, portability, security or scalability of system architecture
Analyze, install, acquire, modify and support operating systems, databases and software
Linux/UNIX administration
Scripting/data manipulation: Perl, Bash, or Python
Scalability plumbing (LVS and other load balancers)
Virtualization: Virtuozzo, Linux KVM, or OpenNebula
Experience with Configuration management and version control
OS: Solaris, CentOS, Debian, or Nexenta
Network technologies: Foundry/Cisco, HP, Procurve
Hardware: HP, DDN, Sun/Oracle, Dell
Experience with Jira
Filesystems: Ext2, Ext3, NFS, ZFS, DRBD, GlusterFS
Web Servers: Apache, Nginx Email: Dovecot, Exim, DNS: Power DNS
System/Network monitoring tools: Graphite, Observium, or Netflow
Automated imaging systems: System Imager, PXE, Kickstart, Cobbler
Remote management systems: iLO, ILOM, DRAC, iPEPS
Hardware and Software RAID
Databases (Configuration, access control, replication, tuning, other engines) MySQL/Percona, PostgreSQL, NoSQL technologies
Working with Firewalls

151

Site Reliability Engineer Resume Examples & Samples

Work in engineering team to design, build, and maintain cache layers, key-value, relational and binary file storage systems
Diagnose, and troubleshoot complex distributed systems handling petabytes of data and develop solutions that have a significant impact at our massive scale
Participate in building advanced tooling for testing, monitoring, administration, and operations of multiple clusters across data centers, primarily in Python, C and Java
Work and collaborate across teams such Application services, Linux kernel, JVM and Capacity Planning, Hardware, Network, and Datacenter Operations to design next-gen storage platforms
5-7+ years of managing services in a distributed, internet-scale *nix environment
Demonstrable knowledge of TCP/IP, Linux operating system internals, filesystems, disk/storage technologies and storage protocols
Hands-on operational experience on managing cache services (memcache, redis)

152

Site Reliability Engineer Resume Examples & Samples

Manage the availability, latency, scalability and efficiency of the Livestream platform, ensuring the platform has 100% uptime
Participate in capacity planning and forecasting, system performance analysis, and system tuning at application, database, filesystem and networking layers
Manage backups and disaster recovery, including backup monitoring and verification, and leading restoration tests and disaster recovery drills
Analyze and improve system security policies on all layers of the platform; track and handle vulnerabilities affecting it
Help manage physical datacenter infrastructure, networking and operations
Build, manage and maintain all the base OS images and system configurations
Bachelor’s Degree in Computer Science or related field. In lieu of degree, relevant skills or equivalent experience
Fluency in at least one of the following languages: C, C++, Perl, Python, Ruby
Familiarity with at least one of the following languages: Perl, Python, Ruby, Lua, JavaScript
Ability to write scripts using shell, awk, sed and other core Linux tools
Knowledge of essential Linux system calls, signals and memory management
Experience in low-level system debugging and performance measurement tools
Expert knowledge of IPv4 networking and routing protocols
Experience with automation and working with monitoring tools including cloud monitoring tools
Knowledge of SQL & NoSQL technologies

153

Site Reliability Engineer Resume Examples & Samples

Contribute to design, write and deliver software to improve the reliability, scalability, latency, and efficiency of your services
Influence and contribute to new designs, architectures, standards and methods for large-scale distributed systems
Experience in SDLC and Agile projects
Expertise in problem solving and analyzing distributed systems and critical production service environments
Debugging, testing / validation and analytics/SQL skills
Big data experience preferred (COSMOS, Hadoop)
Fundamental understanding of OSI model/stack
Firm sense of accountability, ownership and initiative for end-to-end project lifecycle with solid project management and skills
Strong communication and collaboration skills to work with people from a variety of technical backgrounds
Experience defining and measuring service key performance indicators
Statistics experience and bias for measurement and data driven improvements
The ability to drive Live Site issues and repair items to resolution

154

Principal Site Reliability Engineer Resume Examples & Samples

Minimum of a HS Diploma or Associate’s Degree and/or
A minimum of 10 years of technical experience along with established leadership credentials across technologies
Exceptional knowledge and experience deploying and managing Cloud technologies

155

Senior Site Reliability Engineer Resume Examples & Samples

Bachelor's degree in Computer Science, Information Management or other technical / IT field, OR in lieu of a degree: a HS diploma/GED and minimum 4 years of IT experience or equivalent military experience/training
Minimum 3 years IT experience in enterprise-wide deployments
Demonstrated experience scripting or developing software and services for the cloud Ruby, Python, Go, Java, Node.js, .NET, etc

156

Site Reliability Engineer Resume Examples & Samples

Will configure and operate RHEL, CentOS, CloudLinux, Windows Server 2008/2012 and other mainstream distributions
Web and Application Server Technologies – Understanding of the configuration and management of the Apache, IIS, Nginx, TomCat, SendMail, EXIM, ProFTPd, etc. Will implement full LAMP stack configurations as well as Windows based stack
Virtualization Technologies – Will work with OpenStack, KVM, and implementation of Linux / Open Source technologies on the platform
Config and Automation Management – General operational understanding of common Configuration Management technologies employed on Open Source and Proprietary platforms such as Puppet, SpaceWalk, SaltStack, SCCM and Fabric
Will work with SQL and noSQL data based technologies, such as MySQL and Cassandra, with a focus on basic troubleshooting and query abilities
Performance and Reliability Management – Will troubleshoot and make recommendations on improvement. and will write basic scripts and applications in PowerShell, Python, C#, and other industry standard languages
Average 3-5 years experience in a large scale production environment (1000+ servers)
Strong analytical thinker and problem-solver
Organized, detail-oriented and able to multi-task
Exposure to LeanIT concepts, DevOps, and Agile methodologies
Experience with mainstream Open Source and Microsoft technologies
Advanced understanding of Linux and Windows operating systems
Experience with Virtualization technologies (KVM, OpenStack, Virtuozzo, VMWare, Xen, Etc.)
Exposure with database technologies (MySQL or MSSQL)
Basic experience with monitoring technologies and methodologies
Exposure with San and NAS technologies (iSCSI, FCP, CIFS, NFS, etc.)
Exposure to scripting and coding in industry standard languages (Python, PowerShell, C#, Ruby, etc.) and code management methodologies (Git, SVN, TFS, etc.)
Recommended Certifications: RedHat RHCSA , LPIC-2, ITIL Foundations, Cisco CCNA, Network +

157

Site Reliability Engineer Resume Examples & Samples

Excellent knowledge and experience in Software Engineering, System Administration, and Operations
Experience developing in any of the following languages (Java, Javascript, C#, Python, Go)
Understanding of Unix/Linux systems from kernel to shell and beyond, including internal Unix systems and networking (DNS, TCP/IP, UDP, etc)
Experience designing and implementing tasks in Continuous Integration systems (Jenkins, Travis, CircleCI, etc.)
Strong grasp of security, privacy and monitoring concepts
Strong sense of project ownership and team responsibility
5+ years of relevant working experience and at least 3 years in a DevOps / Site Reliability Engineer role

158

Site Reliability Engineer Resume Examples & Samples

Knowledge on MapReduce, SaaS, PaaS
Knowledge on building/compiling the code using GNU make
Knowledge on profiling and debugging the java/gcc application
Knowledge on managing services on AWS
Knowledge on tools like Jenkins, Git, Nagios, PostgreSQL, JIRA, Hadoop, Kafka, Flume
Certification on Database administration, RHEL administration, network administration or any valuable certification would be an added plus

159

Site Reliability Engineer Resume Examples & Samples

Manage the scalability, performance, and availability of MediaMath platform APIs by solving for reliability against existing systems and services spanning the entire stack
Develop tools and automation to minimize delivery time and increase developer productivity
Participate in the design and development of new and evolving services, architecture, and performance standards
Support team members in the development of a SOA strategy and migration path
Participate in capacity planning and service performance analysis and tuning
Respond to and resolve emergent issues. Be on-call periodically as part of shared team
This is not an exhaustive list of responsibilities. Other duties may be assigned, as needed. MediaMath retains the right to change job duties at any time
As part of our global technology team, you may be required to be work off-hours or be on-call on a rotating basis
You are considered a “security employee” and have a particularly noteworthy security aspect to your role and are required to undergo additional training annually
Administer and ensure logical security in carrying out all job duties
Support in Security Incident response and monitoring, as needed
3-5 years of relevant work experience, including experience with high-volume, production distributed systems environment
Fluency with Python, Perl, Shell, Ruby, Scala, Go, or similar
Experience managing and deploying full stack, distributed services
Experience with system automation tools such as Ansible, Chef, Puppet, Salt Stack, etc
Experience with monitoring, alerting, and pipeline analysis tools such as Nagios, Sensu, Graphite, Riemann, Logstash, etc
Expertise in the use and optimization of SQL
Experience with queuing/data-pipelining solutions such as Storm, Kafka, RabbitMQ, ZeroMQ, etc
Experience with systems such as PostgresSQL, MySQL, Cassandra, CouchDB, Redis, and Memcached
Exposure to AWS and OpenStack APIs preferred

160

Principal Site Reliability Engineer Resume Examples & Samples

Takes ownership of all issues and does not out-source issues to engineering, but collaborates with Engineering as appropriate. Identifies methods and opportunities to build automation into RCA. Drives resolution activities and automated remediation's to ROCC
Identifies automation enhancements and associated use cases. Constructs work plans to develop and deploy. Identifies resource needs and schedule. Applies impact analysis techniques. Leverages Agile methodologies to code required result
Leads Change Management activities in a Cloud environment by developing complex test plans and seamless deployment processes. Able to identify impact analysis and quality control validation of changes. Defines agile automation techniques to rapidly deploy changes
Primary interface to Product Engineering, Service Engineering, and SRE Development for continuous integration and delivery of product changes
Provides technical expertise in developing solutions to complex software engineering and cloud operations problems, which require frequent use of ingenuity and creativity. Provides work leadership to others. Interfaces with senior management to provide and obtain information and gain consensus regarding project direction. Understands industry tools and trends and their relevance to the Cloud industry
Collaborates well with other engineers and other engineering groups, voluntarily shares information. Submits Customer and Field Features for future product releases. Ability to probe into issues and opportunities to define requirements and scope. Links technology to business value
Exercises considerable latitude in determining technical objectives, without appreciable direction. Understands agile software development methodologies and has senior development skills. Understands the value of software and associated time savings. Offers proposed design changes/suggestions to processes and products, exerts significant latitude in determining objectives of an assignment. Accurately analyzes current environment and business requirements to define solutions. Spends 50% of time with software development. Uses proper modeling and documentation techniques
May be accountable for overall product and/or serve as a customer advocate. May represent organization as principal customer contact. Able to drive technical solutions and provide clear requirements to supporting resources
Interacts cross functionally on matters that require coordination across functional/organizational lines. Accurately scopes technical level of effort. Leverages tools and provides clear and concise updates to project leaders
Significant contributor to organizational goals and objectives. Demonstrates core values and leads by example
Writes functional detailed design specs as well as responding to requirement documents and system level test plans. Accurately estimates both level-of-effort and resource skill sets at the assignment level. Accurately documents development, operation processes, and incident remediation activities
Understands and adheres to cost/delivery/quality targets established during the program design phase. Applied understanding of process flow and how activities influence downstream processes. Uses metrics to plan and control projects

161

Site Reliability Engineer, Agent Lifecycle Resume Examples & Samples

Working with Java or Kotlin
Operationalizing different datastores
Troubleshooting in a complex environment
Foundation in systems knowledge including some of

162

Site Reliability Engineer Resume Examples & Samples

Improve the predictability and reliability of software releases with the implementation of automated build, test and deployment tools and processes
Engage with Software Engineering and Architect Teams to ensure Release Engineering best practices are implemented
Provide afterhours release and change control support based on the most current change control schedules

163

Senior Site Reliability Engineer Resume Examples & Samples

Create and deliver automation software required for improving the functionality, availability, and manageability of our Cloud collaboration micro-services using Python and Go language
Basic networking skills and familiarity with Unix/Linux systems including CLI used in checking component status and logs
"Cloud" (using IaaS and PaaS) experience desirable

164

Site Reliability Engineer Resume Examples & Samples

Providing support for the applications, scripting, automation, managing Incidents, Changes and Day-To-Day support in Production, QA and Development environments
The majority of the job responsibility will be focused on Production Application support but there will be some script development and automation work (25-50%)
Job responsibility will from time to time requires work during off hours, weekends and night shifts as this is a 24x7 environment
Maintaining strong links with the 1st Level, DCIS and the Development teams
Management of change required to support product enhancement and growth
Following departmental change management procedures in defining, planning and implementing changes in such a way that service disruption is minimized, and adherence to Service Level Agreements is ensured
Following defined procedures in order to log and track issues
Minimum 3-5 years experience with various software applications and development
Good analytical skills (root cause analysis and create solutions to resolve incidents)
Strong communication skills - this role will involve communicating regularly with a wide range of people of all professional levels and cultures
Ability to code in one or more scripting languages (Python, Java etc) and open platforms like CDAP and Docker
Ability to work independently in a challenging environment to meet product schedules and deadlines
Good understanding of Networking, BigIP and DNS
Experience with scripting and automation (monitoring, configuration and deployment automation)
Experience implementing system and application monitoring (familiarity with monitoring tools like Kibana)
Experience Cloud based systems (like Open Stack and AWS)
System Administration experience (Linux)
Experience with Big Data technology (Hadoop, Elastic search, HDFS etc)
Familiarity with open-text based formats like JSON etc and web services
Familiarity with Oracle Databases

165

Site Reliability Engineer Resume Examples & Samples

SREs are engineers with the right mix of knowledge and skills in software (i.e. programming, data structures and algorithms) and systems (i.e. operating software on internal and external infrastructure at scale)
We constantly evaluate products and services before and after production releases to prevent, identify and fix problems that impact service availability in deploying, configuring, releasing, monitoring, recovering, and scaling
We dedicate at least 50% of our time applying software engineering principles to resolve problems impacting service uptime or our operational efficiency
Experienced in writing clean code in at least one Object Oriented language such as: Java, C#, JavaScript, Python, Ruby
Experienced automating software build, deployment and server configuration management using tools such as puppet, chef and jenkins
Expertise in Windows and Linux system administration, databases (relational and NoSQL), web servers (Apache, IIS), networking and storage technologies
Experience with Cloud technologies and platforms such as AWS or Azure

166

Senior Site Reliability Engineer Resume Examples & Samples

5 plus years experience administering production Linux environments
Demonstrated proficiency writing automation in Python, Ruby, or a similar language
Resourcefulness and independence when required to find solutions to new problems
Ability to incrementally improve legacy configuration code
Expertise in writing clearly about work scope and status
Experience working with widely distributed teams

167

Site Reliability Engineer Resume Examples & Samples

Work closely with technical writers and service engineers to ensure that our best-in-class documentation is delivered reliably and on time
Develop and maintain build systems and automation tooling
Develop and maintain internal tools to support authoring and collaboration
Maintain both integration and production infrastructure as part of an operations-focused culture
Work closely with other members of the technical content team to provide systems support, training, and solutions as needed
Collaborate with the user experience team to ensure that technical content — including documentation, contextual help, UI text, and other forms of assistance — are discoverable, usable, and pleasing to the customer
7+ years experience
Deep knowledge of at least one of the following scripting languages: python, javascript, bash
Knowledge of several of the following technologies: CI/CD systems (TeamCity/Jenkins), Docker, linux OS (Oracle Linux/RedHat), infrastructure-as-a-service (AWS/OpenStack/Google Cloud/Azure), web servers, load balancers, log analysis tools, remote system and network debugging tools
Impeccable written English skills
Strong team player with outstanding communication, organization, and interpersonal skills
Comfort with agile, swiftly changing, dynamic software development situations
Ability to drive, follow, and evangelize cross-team processes
Knowledge of cloud infrastructure concepts and technologies
Experience using distributed source code management systems such as Git
Experience using enterprise-grade bug tracking systems, such as JIRA
Experience (and commitment to) capturing and maintaining institutional knowledge
A Bachelors degree in a Computer Science-related field, or significant work experience in startups or fast-paced enterprise technology development environments

168

Site Reliability Engineer Resume Examples & Samples

Experience developing for and supporting specific Azure technologies (ex: Cloud Services, Azure SQL, Azure Service Bus, Azure Storage, KeyVault, Service Fabric, Azure Active Dirctory, etc.)
Experience developing for and supporting an on-premise Microsoft stack (ex: Server 2012 R2, Active Directory, IIS, SQL Server, Hyper-V, CRM)
Experience supporting Live Site as part of a 24x7 on-call escalation path
Strong understanding of automation practices in a DevOps space (ex: automated deployments, synthetic transactions, monitoring, etc)
A communicative team player with a can-do attitude

169

Site Reliability Engineer Resume Examples & Samples

Work within various areas of focus (e.g. monitoring, secrets management, deployment pipeline, containerization, etc) and research, strategize, and propose solutions that meet requirements, reduces friction for product engineers, and consolidates existing solutions
Drive adoption and onboard teams to Delivery Engineering tooling and solutions
Contribute to designing, implementing and maintaining team tooling
Work closely with engineering teams to learn about needs, current process and to promote best practices
Migrate application stacks to cloud infrastructure
Solid programming and troubleshooting skills. You may be called upon to help with systems written in Go, Python, Java, Scala, Php, Ruby amongst many other programming languages. We don't expect you to know everything but we expect you to learn quickly
A good understanding of databases. Both relational and otherwise
An understanding of cloud based deployments on Amazon Web Services or Google Cloud Platform
Strong grasp of multi-tier application architecture & concepts of networking, load balancing, monitoring and *nix OS
A passion towards automating things. We love repeatable processes and know that humans are prone to error. We'd like to automate deployments, monitoring releases and even brewing our coffee
An understanding of an application that is one of our goals to move our systems in that direction
A high degree of interest in Linux containers and smart clustering solutions like Kubernetes/Mesos/fleet, etc
A bias towards helping people. Many teams will rely upon you for help to build their systems

170

Senior Site Reliability Engineer Resume Examples & Samples

Create enterprise infrastructure and tooling
Monitoring and diagnosis of systems for optimal performance
Generating well defined and documented standard processes for the enterprise
Research and development of new technologies related to our problem set
Occasional presentations and training of integrated technologies
Full stack developer, fluent in multiple languages (python, perl, bash, go)
Experience with system automation tools (Ansible, Puppet, Chef)
Exposure to SQL and NoSQL solutions (MySQL, Postgres, Redis, Cassandra)
Strong cloud skills, AWS and GCP
Experience with containerization and virtualization technologies (Docker, Kubernetes, CoreOS)
Hands-on experience with system configuration (Nginx, Apache, Linux, consul, etcd)
Queuing and data-pipeline solutions (RabbitMQ, ZeroMQ, pub/sub, SQS)
Effective communicator with a desire to share and guide others

171

Site Reliability Engineer Resume Examples & Samples

Support the daily build and release needs of agile teams
Coordination with the development organization on production deployments
Provide support and rollback facilities in a timely manner
Develop suggestions for enterprise-wide best practices
Post release monitoring and validation
Professional experience with CI/CD solutions
Good understanding of CI/CD principles (Jenkins, Bamboo)
Master VCS skills (git)
Good experience in a Linux environment (Multiple distros, CentOS, Debian, RedHat)
System scripting experience in multiple languages (Python, Perl, Go)
Comfortable automating builds and releases with Make or other tools
Working experience with Atlassian tools a plus (JIRA, Confluence

172

Senior Site Reliability Engineer Resume Examples & Samples

Appropriate CS and IT technology technical background (Bachelor in computer science or equivalent)
Strong knowledge of Linux operating systems (CentOS/RHEL) and configuring common core services, debugging. Knowledge of BSD a plus
Strong knowledge of L1-4 Networking, Switching/Routing, L2-7 reverse proxy and proxy load balancers, firewalls, DNS/DHCP, TCP/IP stack. Should have basic knowledge of OSPF, BGP, SNMP, and SMTP
Must demonstrate strong skills in L7 debugging and analysis. Knowledge of Curl and other tools to diagnosis and differentiate L1-4 issues from L7 HTTP, HTML, JS issues. Should be familiar with diagnosing HTTP header and caching issues
Strong knowledge and experience with SOA/RESTful/JSON environments running node.js, vertx.io, tomcat, apache, NGINX, varnish, memcached, redis, Ruby, Go, Python
Strong knowledge and experience with version control (Git), deployment tools (Capistranio) and Continuous Integration technologies (Jenkins, Puppet, Chief)
Experience with transactional databases (MySQL, Postgres) configured for high availability and redundancy
Ultimate self-starter
Experience in handling production outages and root cause analysis
Strong crisis management leadership ability
Strong and effective written/verbal communication skills, whether talking to individual contributors or to executive management
Skill with Ruby, Python, or Java a plus
Experience creating tools for infrastructure (IaaS and PaaS) management and automation a plus
Part of a Global TDO team responsible for overall site availability and reliability with rotating shifts and follow-the-sun global support. TDOs are not on-duty longer than 8 hrs at a time
Works with the Global Systems Operations Center staff (SOC) (Tier 1-2 support), and SRE team (T3-4 support) to prioritize issues and ensure adequate follow-up to issues
During an incident, leads efforts to triage and mitigate impact globally. After an incident, responsible for incident reviews and action items for follow-up/restoration in order to improve overall service stability
Manage real-time communications during outages with both technical and non-technical audiences
Evangelize Best Practices to the rest of the company
Develop policies and procedures that improve overall product stability and availability
Design and create tools to help manage site services, and host monitoring/alarming
Participate in Incident Reviews of outages in order to improve overall product stability
Build relationships with development teams and technology leaders across the company

173

Site Reliability Engineer Resume Examples & Samples

Experience managing Linux systems in a 24/7 production environment
Ability to program in Python, Ruby or Perl highly preferred
Working knowledge of multi-tier applications and their dependencies including load balancing, TCP/IP networking, web services, LDAP and DNS
Proficiency with web server administration including Apache and Nginx highly preferred
Knowledge of database support and administration including MySQL, Postgres & HBase
Experience with monitoring tools such as Nagios and Graphite highly preferred
Develop and maintain automation for system administration and application management
Experience with configuration managers such as Puppet, Chef or Ansible highly preferred
Excellent interpersonal and communication skills demonstrated through previous projects or assignments (work or academic related)
Network administration experience a plus

174

Site Reliability Engineer Resume Examples & Samples

Experience making hardware decisions based on workload projections
Desire to work closely with both team members and other people throughout the company
Proficiency writing clearly about work scope and status
Ability to clearly communicate technical concepts and reasoning to both technical and non-technical audiences
2 plus years experience administering production Linux environments
Proficiency writing automation in Python, Ruby, or a similar language
Ability to perform physical activities, including, but not limited to, walking, standing, lifting items up to 35 lbs. and using, handling and controlling tools

175

Site Reliability Engineer Resume Examples & Samples

Evolve our continuous deployment infrastructure. Our deployment infrastructure is the lifeline tying our development teams to the True Fit base. We rely heavily it's success, and will look to you to help provide guidance and direction for it's growth
Programmatically build and administer cloud based Linux servers (Ubuntu, RHEL) on AWS. True Fit lives entirely in the cloud. We need someone who understands what this means and is comfortable navigating these seas. We need you to know what things like RDS, EC2, Cloudwatch, Lambda, Route53, and VPC mean, and a willingness to utilize their apis
Architect deployment systems for our application software. True Fit utilizes the cutting edge in analytics engines and methods in our systems. We need someone willing and able to build and utilize a varied knowledge base to support what we do
Automate the world. We're a lean shop, and we work at a breakneck pace. If it needs doing more than once, it needs to be automated. Build, deployment, monitoring, testing, and infrastructure are all within our automation sphere
Analyze and troubleshoot network and infrastructure issues. True Fit's environment demands a sharp mind and rapier analytical skills. We run into issues now and again, and we rely on smart people to tease a problem apart to find elegant solutions
Monitor and measure system performance. The heart of any well-tuned system is a known system. True Fit must understand what's going on under our hood and requires someone mindful of details
Work with other departments to design and build operations-friendly software. While our operations infrastructure may provide the guts of the True Fit machine, our product & support people, engineers, and scientists, provide the heart, mind, and soul of what we do. We'll need a person who can liaise with other departments, understand their needs, and collaborate to find solutions
Participate in our agile development environment. We maintain a fast moving, but tight ship. We need a person that can pull recent changes, commit a fix, and push a working system back into our infrastructure
2+ years as a system administrator, network engineer, build engineer, or software developer. True Fit is looking for a person ready to get their hands dirty. Vast seas of knowledge are not a requirement, but we do look for a quick and inquisitive mind, and one capable of applying what is learned. A development background is preferred
Experience in an environment that develops and releases commercial software products in hosted environments. We're looking for someone that understands an agile release process and can contribute to the support of our systems release processes
Proficiency in the Open Source Ecosystem. We're looking for a person steeped in the open source tea and proficient in that ecosystem. You can expect to see and be responsible for various different systems' smooth functioning
Expert scripting skills in at least one of shell, python, perl, ruby, etc. We live and breathe by our code and processes. We need someone that can speak our language
Some datastore knowledge. True Fit's data collection is vast and varied. Ideally, you'll have relational database acumen. Postgres & Oracle are preferred. NoSQL knowledge a la Mongo or Hadoop would be a plus as well
Knowledge of configuration management / desired state frameworks. Our systems are built by our code. We're looking for a person with knowledge of Chef, Puppet, Ansible, etc
Working knowledge of application technologies including JavaScript, HTML, Scala/Java, C++. Our operations department participates in many of the company's core functions. The more you can understand of what's being said, the better
Undergraduate degree in a quantitative field (Math, Physics, Engineering, Computer Science) or relevant experience. We find a STEM background or relative industry experience sets people up for success in this position
Strong listening and communications skills. We need to understand what's up, and for you to do the same! What's needed? Where are we going? How are we getting there? How's our progress?
Highly motivated self-starter and can do attitude.We need you to get your hands dirty and to go shoulder deep when required. We also need a person able to suss out where the work needs to be done

176

Site Reliability Engineer Resume Examples & Samples

Keep the customer facing services available at top performance by monitoring the health of the system
Build the tools we need to monitor our system and respond quickly
Develop systems to monitor the capacity of our applications and work to solve these capacity issues before they become a problem
Lead Pardot Engineering by identifying areas for improvement in reliability and prototyping these solutions
Collaborate with other teams to discuss and resolve technical issues and escalations
Expertise in TCP/IP related technologies (networking protocols, network programming, etc
Prior Chef or other configuration management experience
Experience working on large-scale systems with many moving parts

177

Site Reliability Engineer Resume Examples & Samples

At least 2 years’ experience in troubleshooting complex systems, including OS, Network, and Application code
At least 2 years’ experience in coding in at least one modern language such as Python, Ruby, NodeJS
At least 2 years’ experience with UNIX/Linux systems
Experience with DevOps, Continuous Delivery, Continuous Deployment
BS in computer science
Experience with SCM systems like Git
Basic security knowledge
Database knowledge including SQL and NoSql

178

Site Reliability Engineer Resume Examples & Samples

DevOps Engineer
Reliability Engineer
Systems Engineer
Cloud Platform Engineer

179

Site Reliability Engineer Resume Examples & Samples

Ensure service reliability and uptime for Barracuda Cloud services
Ensure consistent application of operational standards across cloud services
2 to 4 years proficiency in Linux/Unix command line and understanding of package management on Linux systems
Demonstrated programming skills in one or more of: Bash, Python, Perl, PHP, Ruby, C, Java
Understanding of network OSI model
CS/Engineering Degree or 2+ years of job experience

180

Site Reliability Engineer Resume Examples & Samples

3+ years proficiency in Linux/Unix command line and understanding of package management on Linux systems
Demonstrated programming skills with scripting languages such as Python, PHP, Bash, Ruby, or Java
Experience with configuration management systems such as Puppet

181

Senior Site Reliability Engineer Resume Examples & Samples

#1 responsibility: solving problems with code (automating solutions to common problems, designing tools, troubleshooting production product code to find efficiencies, etc)
Collaborate with internal groups to identify, develop, and deploy manageable, scalable and robust services
Represent Cloud Engineering in design reviews and operational readiness exercises for new and existing services
Drive standardization efforts across multiple disciplines and services throughout the organization
Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of services
5 to 7 years experience programming with languages such as: Python, PHP, Ruby, or Java
2 to 3 years proficiency in Linux/Unix command line and understanding of package management on Linux systems
1 to 2 years experience with configuration management systems such as Puppet - experience designing and implementing configuration management systems, deploying production services using config management
Experience implementing service monitoring and alerting using tools such as Nagios
Ability to find and troubleshoot issues across the entire OSI stack
Track record of successful practical problem solving, excellent written and interpersonal communication, and documentation skills
Demonstrated ability to manage projects and deadlines
Experience with database administration, replication, troubleshooting production performance problems is a plus
CS/Engineering Masters Degree and 4+ years of job experience or 8+ years of job experience

182

Site Reliability Engineer Resume Examples & Samples

Detective: SREs troubleshoot problems in live production systems, both on their own and in collaboration with systems and application engineers
Ambassador: Keep the company informed about the status of Fitbit services, the impact of known issues, and the progress of ongoing investigations
Developer: Design and refactor parts of the Fitbit backend system for stability and performance, and write tools and scripts to automate maintenance and monitoring tasks
Coach: Meet with other teams and attend architecture reviews, and offer advice on how to implement features that are efficient, highly available, and fault-tolerant

183

Site Reliability Engineer Resume Examples & Samples

Strong Linux system administration experience and production troubleshooting experience
Expert-level Java knowledge
Experience working with high traffic, scalable web applications
Experience of diagnosing and fixing complex production issues
Understanding of data sructures, algorithms and framework internals
Good verbal and written English

184

Site Reliability Engineer Resume Examples & Samples

2+ years or equivalent experience, 5+ for senior engineers
A strong background in Linux software development
Experience working with Mesos is a plus!
Working experience with web-scale deployment tooling (Chef, Puppet, Ansible, etc)
Knowledge of containerization (Docker) and web services frameworks

185

Senior Site Reliability Engineer Resume Examples & Samples

Deploy and manage solutions efficiently on private or public cloud, and ensure they meet established SLAs
Maintain database performance by investigating, debugging, identifying and resolving production issues
Expand existing automation tools to streamline onboarding.pment and the Adobe Campaign platforms
Design system solutions of the product from custom requirements for customers
Working experience/knowledge of public cloud services such as Amazon Web Services (AWS or Microsoft Azure)
Automation tools (Ansible, Salt, Chef or Puppet)
Good knowledge of virtualization technologies and container technologies
Experience with Postgres (Deployment, maintenance, and scalability)
Experience with email-related technologies including authentication, accreditation, DNS, SMTP, FBLs, and MTAs

186

Senior Site Reliability Engineer Resume Examples & Samples

Own and scale EAA Cloud services
Analyze and improve the efficiency, scalability, and reliability of our backend systems
Automate new cloud service deployments
Participate in an on-call rotation backed by our engineering teams
Perform advanced troubleshooting and monitoring of our systems to ensure adequate SLA and capacity requirement
Proactively monitor the EAA cloud service for potential issues
Analyze logs and events from the cloud service and provide recommendation for ensuring smooth service operation
BS with 8+years or MS with 6+years in Computer Science / Engineering or equivalent
6+ year’s experience managing the setup, configuration and monitoring of different components of AWS infrastructure system like EC2, Route53, VPC using Boto/Boto3
3 years of shell scripting and at least one other scripting language. Python preferred
4+ years in development of automated deployment scripts using tools like Ansible, Puppet and Chef 5+ years with setup or configuring load balancers, Nginx/Apache proxies
5 years with configuration and maintenance of common infrastructure such as Postgres, Cassandra, Storm, Elasticsearch, Rabbitmq, Celery, Active Directory
4 years with monitoring/reporting technologies e.g. Sensu, Nagios, Graphite, Elasticsearch, Logstash, Kibana
4 years with performance, scalability, and reliability issues of 24x7 commercial services
3 years implementing SOC2 and PCI
2 years with AWS S3 and IAM services 5 years with troubleshooting
5 years with Linux. Ubuntu preferred
Good hands-on Knowledge Internet protocols (HTTP, DNS, TCP/IP, ICMP, DHCP)
Experience in: Configuring web servers (NGINX/Apache/IIS) and the infrastructure necessary to support enterprise web applications (load balancers, Application Delivery Controllers, SSL-VPN solutions, SQL and databases etc.)
Familiarity with Cloud computing services like VMWare, OpenStack, Microsoft Azure Excellent written and verbal communication skills

187

Senior Site Reliability Engineer Resume Examples & Samples

Extensive experience in designing, configuring, and delivering large scale application technical infrastructure
Experience as a project lead, supporting multiple simultaneous projects, in high scale environment
Strong coding and scripting ability (Java, C, C++, Python, Perl)
Strong experience with database technologies (Oracle, Mongo preferred)
Experience and knowledge applying best practices to build secure platforms
Excellent Analytical and creative problem solving skills
Must be highly collaborative and able to work with different teams
A strong sense of focus and excellent attention to detail while working in a very fast-paced environment
Ability to learn new technologies in a short time
Strong communication skills and ability to articulate complex solutions well

188

Site Reliability Engineer Resume Examples & Samples

Front office application support – basic knowledge of infrastructure helpful but not critical
Aptitude for and wiliness to learn – almost all home grown (expect to train on application portfolio from ground up)
Logical problem solver (how they answer questions/process/must be a good listener)
Linux – Python / Perl / shell scripting /
Monitoring – mostly home grown apps – will train + Anturis
Basic understanding of OSI stack and ITIL fundamentals – will require ITIL certification eventually
Communications important (with business users: traders)
Ability to work in high pressure environment – must be thick skinned
Will be co-located with traders – will create tickets
Basic tech screening is less important than mind set
SW engineering background helpful
Financial background helpful (derivatives)

189

Site Reliability Engineer, Self-driving Resume Examples & Samples

BS in Computer Science or a related field. A MS or equivalent is preferred
Programming skills in at least one of Python, C, C++, or Java
Strong skills in process, documentation, and change management
Excellent interpersonal and customer-facing skills

190

Site Reliability Engineer Resume Examples & Samples

Contribute to a team responsible for the availability, scalability, and performance of our enterprise platforms
Build and maintain automation systems to help us manage our rapidly growing infrastructure
Gain deep knowledge of our complex applications to develop a bird's eye view of our platform
Assist our Software Engineering teams to ensure proper monitoring and metrics are being built into the applications
Maintain and develop custom systems and tools to improve our ability to deploy, automate, and effectively monitor custom applications in a mixed Windows/Linux environment
Assist in the rollout and deployment of new product features and installations to facilitate our rapid iteration and constant growth
Lead troubleshooting of issues that occur in our production environments
Gain and use knowledge of monitoring systems and configuration management systems (AWS-specific tools, Puppet, Nagios, New Relic, etc)
Troubleshoot issues across the whole stack - hardware, software, applications and network
Partner with development teams to build the standards by which we deliver our infrastructure

191

Site Reliability Engineer Resume Examples & Samples

Strong understanding of the Linux operating system
Ability to code or script automation in at least one language (Java, Go, Python, Ruby, Perl, Bash, etc.) on Linux-based platforms
Deep experience in at least one infrastructure component (operating systems, compute, storage, networking, data center, distributed systems, big data, cloud, etc.) and solid understanding of the rest, and how they impact services
Familiarity with cluster management tools such as Mesos, Docker Swarm, Kubernetes, Marathon, Aurora
Familiarity with distributed storage and filesystems such as CEPH, HDFS, GFS, IPFS

192

Senior Site Reliability Engineer Resume Examples & Samples

Installs, supports and maintains new software infrastructure for SaaS deployment
Monitor the SaaS environment and work with QA, Developers, hosting company to identify and troubleshoot service problems
Analyses and resolves SaaS infrastructure faults and undertakes routine preventative measures, including backups and implements, maintains and monitors network security
Contributes to Security and Risk management initiatives, including business continuity planning for Saas deployments
Ensures that failover mechanisms are in place and are working correctly
Configures and maintains hw and software monitoring solutions to send alerts in case of service problems
Works with all other departments in the company to gather requirements and coordinate activities

193

Site Reliability Engineer Resume Examples & Samples

3+ years of professional software development in Scala
Production-level operations experience across JVM applications
Experience monitoring distributed systems application architectures

194

Senior Site Reliability Engineer Resume Examples & Samples

Experience using AWS, especially provisioning EC2 nodes, and setting up CloudFront distributions
Unix/Linux administration
Scripting (Bash/Python)
Http server configuration (NGNIX/Apache)
Configuration management (Terraform/Puppet)
Experience deploying apps using Docker containers
Node health monitoring (DataDog or similar)
Web traffic analysis with Splunk or similar
Continuous deployment using Bamboo, Bitbucket Pipelines, or similar

195

Senior Site Reliability Engineer, Hipchat Resume Examples & Samples

Scripting and software development across multiple programing languages
Network optimization and troubleshooting: TCP/IP, UDP, ICMP, MAC addresses, DNS, OSI layers, and load balancing
Building, automating, and maintaining infrastructure in Amazon Web Services
Development and maintenance of configuration management systems responsible for thousands of hosts
Leading a team of engineers in troubleshooting service outages affecting millions of users
Implementing system and application level telemetry for large distributed cloud architectures
Diagnosing and resolving capacity problems in high-throughput web applications and network services

196

Site Reliability Engineer Resume Examples & Samples

Work with engineering teams to design and write code to create systems which are highly available and able to scale seamlessly
Plan for and eliminate any potential threats to stability, availability or security
Improve monitoring, alerting and resilience of systems
Write tools to assist work such as capacity planning or improving the ability to debug production issues over distributed systems
Contribute to a culture of learning and responsibility by writing detailed postmortem reports
Tackle live issues on production when on-call with assistance from the rest of the teams

197

Site Reliability Engineer Resume Examples & Samples

Work with product teams on design and implementation of large scale distributed systems
Automate, automate, automate and then …. automate!
Bring ideas to life to help make the engineers' lives better
Help developers with some of their hardest problems

198

Senior Site Reliability Engineer Resume Examples & Samples

Troubleshoot complex production issues in a distributed environment
Work on projects that make our network more stable and faster
Work with our other L3 engineering support teams to troubleshoot complex problems our network for our customers
5 years of relevant experience and a Bachelor’s degree Computer Science or its equivalent or
3 years of relevant experience and a Master’s degree Computer Science or
1 year of relevant experience and a PhD Computer Science
Education: Bachelor's Degree in Computer Science or equivalent
Minimum of 5 years of experience in troubleshooting and reading complex code using debuggers
Minimum of 5 years of experience in DevOps, SRE or similar roles
3+ years experience with AWS tools and methodologies
3+ years experience programming in Python or Java-Script
3+ years experience with MySQL and NoSQL databases
Education: Master's Degree in Computer Science or equivalent
Highly responsible, self-disciplined, self-managed, self-motivated, able to work with little or no supervision
Experience in distributed systems or security products
Passion to understand, learn, and dissect new technologies quickly on your own
Extensive experience working on multiple projects at a time in a fast paced, results oriented environment
Pluses: Security domain knowledge, Web Application Firewall

199

Site Reliability Engineer Lead-digital Resume Examples & Samples

Ensure the secure availability and 100% uptime (reliability) of all JPMC digital properties from an application delivery perspective
Be part of a team which proactively monitors all application flows into JPMC (Web and Mobile)
Recognize false positives in flow and recommend course of action
Understanding the security implications of a certain pattern and socializing it with a sense of urgency and course of action
Troubleshooting issues and quickly determining the root cause in working with other development / networking or security teams
Document root cause and solutions in an concise manner and build a knowledgebase
Shadow and teach/lead colleagues with skills to ensure the availability of JPMC digital properties securely
Liaise with other organizations within JPMC to manage IT compliance/audits/security with National and International laws and regulations, as well as contractually enforced industry standards
Interface with IT Security and Risk, Audit, and privacy to coordinate related policy and procedures, and to provide for the appropriate flow of information regarding risk
4+ years of experience in developing, deploying and supporting commercial and custom software solutions with an emphasis on identity and access management framework, Security, integration and support
Expert knowledge of the HTTP protocol, response codes and behavior
Expert knowledge of Application Security ( App Level Firewalls)
Knowledge of ASM (F5) or equivalent application firewall and policy
Creative and inquisitive professional with excellent interpersonal and cross functional/divisional collaboration skills able to handle work smoothly under stress, managing multiple assignments concurrently, adjusting easily as business needs change, and acquiring necessary new working knowledge quickly
Highly analytical with strong research skills, able to discern key issues and information in complex situations and resolve issues quickly
Advanced communication (including group presentations), problem solving, and conflict resolution with internal and external stakeholders including senior leaders
Ideally has held positions in software development, networking, operations and other technical areas in career demonstrating well round command of other technology disciplines

200

Site Reliability Engineer Resume Examples & Samples

Scripting and software development across one or more programming languages (ideally Java and/or Python)
Deep understanding of Linux systems
Hands on experience with cloud infrastructure such as AWS, Google compute, Azure, Rackspace cloud (minimum of 2 years)
Expert level troubleshooting skills across different levels of the stack

201

Lead Site Reliability Engineer Resume Examples & Samples

Create, maintain, own and operate your team’s services that supporting fundamental capabilities within Grubhub’s products
Tackle some of the most challenging problems you can face developing high availability services in a distributed cloud environment that needs to scale exponentially
Manage / Lead a team of 2 to 3 direct reports

202

Site Reliability Engineer Resume Examples & Samples

Experience as either a Systems Administrator with Programming experience or an application-focused DevOps Engineer
Substantial experience managing Windows and Unix/Linux infrastructure
Self-starter who is able to take ownership of technical issues and be a productive member in the on-call rotation and certain off-hours shifts
Strong troubleshooting skills that span systems, network, and applications
Strong scripting ability in at least one of the following languages: Bash, Ruby, Perl and/or Python
Experience with virtualized environments
Intermediate knowledge of networking and load-balancing concepts
Ability to write clear and thorough documentation
Prior experience in an Internet-facing technical operations role with high uptime requirements.Demonstrated ability to successfully work with Cloud architectures such as AWS, Azure, CloudStack, or OpenStack
Strong personal and professional initiative with a focus on the success of the team and organization
Expertise with configuration management systems such as Puppet
Experience with package management in multi-datacenter environments
Experience with monitoring systems, such Nagios and Sensu
Experience collecting and aggregating log data in an ELK stack

203

Senior Site Reliability Engineer Resume Examples & Samples

Developing, engineering and operating API-driven platform services that provide the foundation for our micro services
Supporting Cloud platform services from architecture, through development, deployment and production operations
Developing automation and orchestration to improve the deployment and management of our environments
Developing and maintaining build platforms, automation engines, micro-services, and compute platforms
Maintaining automated test suites using CI/CD tools
Participating in troubleshooting, capacity planning and analysis, performance analysis activities
Advising management on service onboarding strategies and execution

204

Site Reliability Engineer, Senior Resume Examples & Samples

5+ years of experience with Microsoft systems administration tools
3+ years of experience with LoadRunner
3+ years of experience in scripting with PowerShell
Experience with Cloud Service Providers (CSP), including AWS and Azure

205

Senior Site Reliability Engineer Resume Examples & Samples

Analyze, diagnose, replicate, troubleshoot and resolve technical issues reported by customer using the Fiserv PaaS platform
Take ownership, manage and maintain status on support requests from the business unit customers
Escalate unresolved issues that require more in-depth knowledge in a timely manner
Report and submit product defects and collaborate with other engineering disciplines to triage customer/business unit issues
Create and peer review knowledgebase articles and product documentation
Willing and able to learn new technologies
4-year college degree + 5 years of experience in applicable field OR advanced degree in applicable field + 4 years of relevant experience
Software engineering skills: experience writing maintainable reusable software. You use automation to make your job more efficient. You have an obsessive need to automate. You are offended when you have to do anything manually more than once
Operations or Systems Administration experience, particularly on UNIX. You know how page cache works and feel very comfortable at the command line
Experience with configuration management. You have managed an infrastructure with hundreds or thousands of servers and dozens of technologies
Strong networking fundamentals. You understand TCP/IP, subnetting and the difference between socket and connect timeouts. NSX, Panorama experience a plus
Knowledge of distributed systems (Windows and Linux) and virtualization software (VMware)
Automation/Continuous Integration (Salt, Fabric, Jenkins, Octopus Deploy, etc.)
Concurrent Versioning Software (Git, GitHub)
A knack for troubleshooting tough problems
Meticulous and cautious. You identify and consider all risks, and balance those with performing the task efficiently
Experience and/or interest in agile methodologies
Background in DevOps
Solid communicator
On-call experience

206

Site Reliability Engineer Resume Examples & Samples

Bachelor’s Degree preferred, Associate’s Degree with 1 or more years of experience, 3-5 years of experience in lieu of a degree
Experienced in at least one script language (Bash, Python, Perl)
Experience with configuration management systems (Chef, Puppet, Salt)
Experience configuring and supporting Jenkins or Hudson
Linux system engineering expertise
Networking Knowledge (strong VPC knowledge is a PLUS)
Experience with Artifactory (or Nexus)
Excellent written communication, problem solving, and process management skills
Experience with Application Server platforms (JBoss, EAP, EWS, Wildfly)
Experience with Cloud Computing platforms (e.g. Amazon AWS, Eucalyptus, VMware, Docker)
Experience with Java build tools such as Ant, Maven, Gant, or Gradle
Experience with agile development, continuous integration and automated testing

207

Site Reliability Engineer Resume Examples & Samples

Work closely with engineering stakeholders to define platform requirements and underlying service implementations
Rapidly iterate on existing platform features
Strong understanding of scaling systems reliably in AWS (global experience a bonus)
Exposure to containers (Docker)
Experience with automation/configuration management tools such as Ansible, Chef or Puppet
Proficiency in the use of code and script (Python, Ruby and/or Go)
BS in CS or 5+ years of comparable experience
Expertise and experience in site performance profiling and tuning
Familiarity with cutting-edge open source libraries and experience contributing to projects of personal interest a plus
Familiarity with Amazon Web Services administration a plus

208

Site Reliability Engineer Resume Examples & Samples

1 year of experience developing software for Windows (2000, 2003, XP, VISTA) or UNIX/Linux (redhat versions 3-5) operating systems
Hadoop Distributed File System (HDFS)
JSON or BSON
Restful services
Requirements analysis and design of at least one Object Oriented system
Developing solutions integrating and extending FOSS/COTS products

209

Senior Site Reliability Engineer Resume Examples & Samples

Applies full understanding of the business, the customer, and the solutions that a business offers to effectively design, develop, and implement operational capabilities, tools and processes that enable highly available, scalable & reliable customer experiences
Utilizes deep knowledge of operations engineering, connected services, and system administration plus knowledge of industry best practices to innovate and influence operational approaches and solutions
Maintain and improve the availability, performance, scalability and efficiency of the services by implementing monitoring, automation, redundancy, capacity and business-continuity planning
Automation and development of operations tools, application dashboards, etc
Metrics reporting on applications performance, availability, reliability, etc
Develop and implement automated deployments
Implementation and design of configuration management
Troubleshoot issues and participate in 24x7 on-call support, ensuring the stability and performance of the production environment
Review and development of performance and capacity plans (operational capacity and load requirements)
Facilitate the creation of the operational readiness documents
Coaches and mentors other application operations engineers
Coordinates technical dependencies with other teams
Technical lead for complex projects
Drive the end-to-end incident management process
Develop and automate monitoring processes
Oversee change management and configuration management operating mechanisms
Drive root cause analysis (RCA) and risk management processes
Drive ongoing improvements and efficiencies in operational practices, tools & processes BU and Intuit-wide
B.S. or higher in Computer Science or 3+ years of equivalent knowledge and experience
Passion and talent for advanced scripting and automation (e.g. Python, Ruby, Perl, Golang, Java, C, etc.)
Expert knowledge of managing services both in the Cloud (e.g. AWS) and in traditional data centers
Expertise in Linux/Unix system administration
Significant experience managing highly available services at scale
Significant experience with configuration management (e.g. Chef, Salt, Puppet, CloudFormation, etc.) and automated deployments
Experience with CI/CD process and technologies (e.g. Jenkins, Travis CI, etc.)
Strong knowledge and experience with metrics, monitoring and alerting tools (e.g. New Relic, Splunk, OMD/Nagios, Sensu, PagerDuty, etc.)
Experience in support/troubleshooting/operations of relational and NOSQL databases (e.g. MySQL, Postgres, Oracle, RDS, Cassandra, Dynamo DB, etc.)
Operational mindset with ability to do incident, problem, change, and SLA management
Expert problem solving capabilities and ability to think outside the box
Works well in a fast-paced, dynamic operations environment with geographically distributed teams
Experience with Tomcat, JBoss, Mule

210

Principal Site Reliability Engineer Resume Examples & Samples

Develop principles, patterns, and tools to improve our ability to rapidly deploy and effectively monitor application services in a large-scale and complex environment
Being able to multi task and deliver in a fast paced, rapidly evolving technology landscape and participate in an on-call escalation for incident resolutions
Responsible for Lights Out Management of our services and advocating and contributing toward the best operability of the services in production
Spends approximately half the time with the core engineering teams advocating and contributing towards making our application services resilient and with “illities” and the remaining half the time with core site operations teams
End-to-end user flow profiling – create requirements and implement solutions to optimize E2E customer interactions, and model customer transaction flows for long-term analysis and short-term troubleshooting
Mentoring DevOps and Site Reliability Engineers
Communicate with Engineers, Architects, and Executives across various organizations to convey ideas and influence outcomes
BS degree or higher in Computer Science (or equivalent experience)
Significant experience with cloud hosted apps/service, AWS experience preferred, and able to translate business requirements into securely implemented capabilities in the cloud
Expert level experience at building, deploying and operating services at scale
Hands on experience in working with distributed systems and ‘illities” of the services
Systematic problem solving approach, coupled with a strong sense of ownership and drive, leading solutions that span across CTG and partners
Internally motivated, self-starter with ability to plan, organize and establish priorities to meet goals and achieve results
Must work well under pressure, balancing multiple priorities and objectives, handles conflict well
Demonstrated technical leadership and ability to communicate in a complex cross-functional environment
Design reviews of operational approaches and solutions
Defines and contributes to Operational Standards and Requirements
Risk Analysis and root cause analysis
Technical feasibility and approach decisions

211

Site Reliability Engineer Resume Examples & Samples

Automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement
Build scalable infrastructure to manage metadata for hundreds of billions of files, hundreds of petabytes of user data, and millions of concurrent connections
Drive the company through “Disaster Recovery Tests”, where we manually turn down pieces of infrastructure to test Dropbox’s overall resiliency to failures
Design the system and processes that Dropbox engineers use to deploy their software into production
Build an auto-remediation system to automatically resolve production incidents before escalating them to on-call engineers

212

Site Reliability Engineer Resume Examples & Samples

Enable scaling by providing tools, developing training and/or augmenting processes
Build tools/automate to prevent re-occurrence of problem to mission critical products/services
Develop a deep understanding of the various services and applications that come together to deliver Walmart e-commerce products
Design new tools to monitor and smart alerts that help discover failures/issues in a timely fashion and work with engineers to identify root cause and fix issues
Influence, design and create new architectures, standards and methods for large-scale enterprise systems

213

Site Reliability Engineer Resume Examples & Samples

Design customized hosted managed solutions that are performant, cost effective, and delight our customers
Implement solutions by defining database physical structure and functional capabilities, database security, data back-up, and recovery specifications
Design, implement, and maintain integrating monitoring and alerting tools, leveraging existing tools and logging. Monitoring is happening at all levels in the application stack
Expand existing automation tools to streamline onboarding
Collaborate with various internal teams to provide the best service possible for our customers
Evaluate new technologies to enhance the level of customer service

214

Site Reliability Engineer Resume Examples & Samples

Drive the design, deployment and maintenance of Hortonworks HDP in multiple production environments
Architect and build redundant, multi-site monitoring toolsets and software
Automate repeatable processes and build tight full-stack integration with Linux and Hortonworks HDP
Build positive relationships with Hortonworks customers, demonstrating your leadership in the Big Data industry
Familiarity with network design and architecture principles
Development experience in Java, Scala, Python or other languages
Experience with PXE, kickstart, or Linux from Scratch
In depth understanding of Hardware and Storage technologies
Experience with streaming technologies, such as Kafka and Storm
Published Open-Source software or public contributions a great asset

215

Senior Site Reliability Engineer Resume Examples & Samples

Design, write and build tools to improve the reliability, latency, availability and scalability of Walmart e-commerce products
Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure
Participate in capacity planning, demand forecasting, software performance analysis and system tuning
Root-cause analysis complex problems involving multiple parties,

216

Site Reliability Engineer Resume Examples & Samples

Software engineering skills: experience writing maintainable reusable software
Operations or Systems Administration experience, particularly on UNIX. (You know what a daemon is and how to restart one. When the daemon won’t start because some other process is listening on the port it needs, you can find and kill the errant process.)
Experience and/or interest in Test Driven Development (TDD) and agile methodologies
Ability to dive into a polyglot codebase and contribute while learning
On-call experience: we build production systems, and believe the best way to understand what the means is to support real systems in the wild. Our ops teams write code, and our development teams help operate their code in production

217

Site Reliability Engineer Resume Examples & Samples

You will be responsible for maintaining and scaling production services and servers across multiple data centers for complex and data-intensive cloud services
You will improve scalability, service reliability, capacity, and performance
You will write automation code for provisioning and operating infrastructure at massive scale
You will work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up
You will work with QA on building pipelines and automation for delivering and deploying applications to production
You will roll up the sleeves to troubleshoot incidents, formulate theories and test your hypothesis, and narrow down possibilities to find the root cause
Hands on experience in building fault tolerant and scalable systems
5+ years of Unix/Linux experience, with some experience in managing 100+ nodes
Experience with AWS and AWS APIs
Experience with Configuration Management and CI/CD. Salt and Jenkins preferred
Familiar with web servers (Nginx preferred) and HA Proxy
Preferred experience: Hadoop, Kafka, RabbitMQ, Spark, HBase, Elastic Search, Containers, OpenStack

218

Senior Site Reliability Engineer Resume Examples & Samples

You will be responsible for designing, building, maintaining, and scaling production services and server farms across multiple data centers for complex and data-intensive cloud services
You will design and enhance software architecture to improve scalability, service reliability, capacity, and performance
You will write automation code for provisioning and operating infrastructure at massive scale. You are not an operator, you’re an experienced software engineer focused on operations
You will work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up. You will work with QA on building pipelines and automation for delivering and deploying applications to production
You will participate in the occasional on-call rotation supporting the infrastructure
You write postmortem reviews and remediation recommendation
Strong sense of architecture and design for fault tolerance, scale, and stability
Strong development/automation skills. Must be very comfortable with reading and writing Python code. Java is a plus
10+ years of Unix/Linux experience (shell/tools/kernel/networking)
Tools-first mindset. You build tools for yourself and others to increase efficiency and to make hard or repetitive tasks easy and quick
Subject matter expert in one of these areas: Big Data: Hadoop 2.x, Kafka, Spark, HBase, Elastic Search.:Data Center Virtualization: Containers, Mesos, OpenStack, SDN
Familiar with middleware software such as Nginx, HA Proxy,RabbitMQ, and typical AWS components, as building blocks of implementing services
Knowledgeable about collecting metrics, measuring systems and interpreting data to make decisions
Organized, focused on building, improving, resolving and delivering. Good communicator in and across teams, taking the lead

219

Site Reliability Engineer Resume Examples & Samples

Time series databases such as Graphite, InfluxDB, or OpenTSDB
ELK Stack or equivalent log monitoring technology (Sumo Logic, Splunk, Graylog)
Linux CLI
SNMP

220

Site Reliability Engineer Resume Examples & Samples

Drive immediate relief and provide a sustainable resolution to issues within the ServiceNow platform
Use knowledge and experience in software development, application support, systems engineering and networking to proactively prevent issues from reoccurring
Drive internal stakeholders and partner teams to improve the reliability, scalability and performance of the infrastructure through improved system design
Drive and contribute to a culture of intolerance to manual activity, which results in an automation environment delivering repeatable and scalable response to system issues
Deep knowledge of Linux systems
Coding in any development/scripting languages like Bash, Python, C++, Java or Javascript
Networking skills and IP addressing
MySQL database administration
Monitoring of performance/availability in systems, applications and networks
Uncompromising attention to detail
Ability to work in shifts that cover one weekend day
Ability to live and work full-time in Australia

221

Site Reliability Engineer Resume Examples & Samples

BS degree in Computer Science, or related technical field, or 3+ years industry experience
3+ years experience working with Unix/Linux systems from kernel to shell and beyond with experience working with system libraries, file systems, and client-server protocols
3+ years experience in one or more of the following languages: C, C++, Java, Scala, Python, Go, Ruby, or scripting experience in Shell and Perl
3+ years of experience in network theory e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing
Proficiency with configuration management frameworks such as Puppet or Salt
Proficiency with Docker and Kubernetes
Familiarity with one or more common web frameworks
Experience working with data processing, loading, and transformation systems
Experience building high quality distributed systems or backend services
Experience with Google and Amazon cloud services
Comfortable working on Linux-based systems

222

Site Reliability Engineer Resume Examples & Samples

Design, write and deliver software to improve the availability, scalability, latency, and reliability of Edge services
Solid understanding of networking concepts
Solid scripting and automation skills (Python, Shell, Go, Perl, etc)
Configuration Management experience (SaltStack, Ansible, Chef, etc)
Experience with Continuous Integration tools (Jenkins, Buildbot, Travis, etc)
Experience troubleshooting large scale/high performance systems
Strong communication skills and enjoy working in a highly collaborative environment

223

Site Reliability Engineer Resume Examples & Samples

Design, write, and maintain software to improve the availability, scalability, latency, and efficiency of Thumbtack's services, incorporating third-party open-source tools when available
Create new designs for a growing number of distributed systems
Design and implement the tools and processes used for deployment and change management
Plan and execute configuration management
Own, maintain, and continuously improve all systems provided as a service, such as monitoring and datastores
Engage in service capacity planning and demand forecasting, anticipating performance bottlenecks
Run software performance analysis and system tuning
Plan and execute disaster recovery drills
Participate in rotating on-call duties
Fluent in one or more of: C, Python, Go
Minimum of 4 years of industry experience in engineering
Familiarity with algorithms, data structures, and complexity analysis
In-depth knowledge of operating systems (processes, threads, IPC, concurrency, locks, mutexes, semaphores, etc.)
Experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols
Experience with network protocols and theory (TCP/IP, UDP, ICMP, MAC addresses, IP packets, DNS, OSI layers, and load balancing, etc.)
Experience with Puppet, or some other configuration management tool
Systematic problem solving approach
Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
Experience with PostgreSQL tuning and performance
Learn more about our culture, benefits, and perks
Learn more about engineering at Thumbtack
Follow Thumbtack on LinkedIn

224

Site Reliability Engineer Resume Examples & Samples

BS/BA preferred or equivalent experience
5+ years of relevant experience in Linux systems administration, provisioning, configuration, troubleshooting, and monitoring (Nagios, Zenoss, SNMP)
Programming experience in one of the following: Python, C, and/or Perl
Deep understanding of the Linux OS, internals, kernel & file systems tuning, protocols and services (PXE, DNS, HTTP, NFS, CIFS, FUSE) and troubleshooting utilities (strace, sar, vmstat, mpstat, tcpdump)
Extensive experience in shell/bash scripting and automation tools (Salt, Puppet, Chef, FAI, Kickstart)
Working knowledge of virtualization technology (KVM, Xen, ESXi, OpenStack) and Dell hardware (DRACs, Livecycle Controllers, BIOS, RAID controllers)
Must possess strong documentation skills and be able to work with rapid change, Configuration Management utilities (SaltStack, Puppet, etc), and Source Control (Git, SVN)
Prior experience with CDN, ISP, 2,000+ server environments, or production Internet-facing services for large enterprise customers is highly preferred

225

Site Reliability Engineer Resume Examples & Samples

Leads the on boarding of teams to the Gannett Cloud Platform, including writing the proper Continuous Integration Automation Scripts and optimizing the applications cost and performance inside of the Cloud Platform
Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering or related technical field
Minimum of 3 years of progressive experience in Linux systems administration position
Experience must include deploying to AWS or other clouds, auto-scaling and the architecture of stateless applications, and using Chef or other configuration management tools

226

HBO Site Reliability Engineer Resume Examples & Samples

Demonstrable knowledge of cloud based software deployments on the cloud stack of your choosing (Amazon Web Services, Google Cloud Platform, OpenStack VMWare)
A high degree of familiarity with Linux containers and container orchestration tools like Kubernetes or Mesos
Strong understanding of HTTP, REST, networking concepts and global load-balancing
A passion towards automation - we’re looking to automate anything possible
Work cross-functionally within a service team and be a core contributor in every significant engineering solution that is delivered
Debug production issues across services and levels of the stack
Participate in on-call rotations, along with every member of the engineering team
Solid understanding of system design, including the operational trade-offs of various designs
Solid programming and troubleshooting skills. You may be called upon to help with systems written in Go, Python, Node.js. You won’t be expected to know everything, but we are looking for people who can dig through a codebase for debugging and commit tactical fixes as needed
1 – 10+ years of experience. We’re looking for talented engineers at all experience levels

227

Site Reliability Engineer Resume Examples & Samples

Expert knowledge of the Oracle Enterprise Linux operation system (OS, networking, process level)
Fluent in at least two languages (Bash, Python, Ruby, golang)
Hands on experience with CloudStack or OpenStack, Mesos, Marathon
Familiarity with KVM, Qemu, and Docker
Familiar with Software Defined Networking, Open vSwitch
Able to troubleshoot issues across the entire stack
Familiar with monitoring and log aggregation tools such as Splunk
Recent experience in a large (5,000+ hosts) computing environment

228

Senior Site Reliability Engineer Resume Examples & Samples

Develop new and enhance existing features for DDE's massively distributed system
Work on performance-critical data processing system
Work on data collection, processing, storage and access subsystems
Work on projects that focus on system scalability, performance, and security
Drive feature development from idea inception through design and testing to operational deployment
Follow SW development methodology best practices, including collaboration with QA departments to successfully deploy high quality new system components
BS in Computer Science or equivalent, MS preferred
2+ years of experience developing SW on C/C++ or Java
3+ years experience with Linux/Unix environment
3+ years experience with computer networking
Knowledge of networking principles, including TCP/IP, SSH, SSL and HTTP protocols
Ability to troubleshoot complex network problems and customer issues
Proven track record of delivering large amounts of high quality, complex code
Highly responsible, motivated, able to work with little supervision
Experience with BigData systems (Hadoop, Spark, Appache Cassandra, etc) and principles (Map/Reduce, Stream Processing, etc)
Experience with scripting, e.g. Perl, Python, bash, and RESTful API
Experience with DBMS, e.g. PostGRE SQL, MySQL, etc

229

Site Reliability Engineer Resume Examples & Samples

Make sure our systems distributed around the global are designed and deployed securely
Make sure we are designing systems that will scale far into the future
Be paranoid and think of all the ways an attacker could compromise our systems, all the ways hardware and software could fail, and mitigate
Help design and build the systems that enable us to scale our data centers without scaling the number of people required to manage them
4+ years of C++ experience
4+ years of experience in high-performance distributed systems, software development, networks, security
2+ years of experience with at least one major IaaS provider (AWS, Google Cloud, Rackspace, Azure)
Bachelors of Science Degree in Computer Science, or 4+ additional years of related industry experience
Strong skills in debugging, performance optimization and unit testing
Experience with distributed computing
Effectively worked as a team member and in large code bases
Experience designing and implementing security for live infrastructure

230

iXp Intern, Site Reliability Engineer Resume Examples & Samples

Proactively monitor availability and performance of the Ariba Cloud using key performance tools
Effectively and quickly respond to monitoring alerts, incident tickets and overall technical support for the Ariba product suite
Perform extensive application and web site troubleshooting to quickly resolve issues
Work closely with subject matter experts within various Engineering teams
Ensure user tickets and monitoring alerts are handled according to pre-defined SLA's for response time, updates and closure
Develop and automate manual tasks to improve day-to-day monitoring and scalability of time critical operations
Handle communication and notification on major site issues to executive management teams
Document standard operating procedures to effectively utilize ITIL best practices
Ensure effective shift turnovers for continuous 24/7 support
Requires candidates to currently be enrolled in an undergraduate, Masters, MBA or PhD degree program which is applicable to the position
Experience working in a Unix environment
Triage and support system applications including but not limited to Apache, DNS, Sendmail, SSH, TCP/IP, NFS and common Internet protocols
Excellent knowledge of operating system internals, file system structures and machine architectures in a Linux operating environment
Some knowledge of Oracle database administration
Interest in writing Perl and Shell scripts to automate processes and enhance productivity
Willing to work in a dynamic, fast-paced environment with well-developed practices and procedures
Outstanding interpersonal, analytical, and communication skills
Must be reliable and dependable with ability to multi-task across various functions
Candidates must be local in the Silicon Valley to be considered
Must be able to work onsite in Palo Alto, CA during summer 2017

231

Senior Site Reliability Engineer Resume Examples & Samples

Experience with NoSQL databases (especially MongoDB)
Hands on experience with AWS - minimum of 1 year
Serious troubleshooting skills across different levels of the stack
Experience troubleshooting a continuous integration pipeline
Understanding of Linux systems

232

Site Reliability Engineer Resume Examples & Samples

At least a Bachelor’s degree in CS or relates field
Preference for a mid-level lead with hands-on developer skills
5+ years of progressive experience in the technical support space; ideally in the retail, hospitality or consumer goods industries
3+ years of J2EE platform experience
2+ years’ experience with supporting EA/middleware technology platforms
2+ years’ experience with release management
Solid knowledge of shell scripting and at least one scripting language
Ability to actively participate in infrastructure design and implementation
Must be adaptable and able to focus on the simplest, most efficient & reliable solutions
Practical knowledge of various aspects of service design, including messaging products & behavior, catching strategies and software design practices
Experience in Linux based shell scripting a BIG plus
Experience with DecOps and Release management is a plus
Experience working with Jira

233

Site Reliability Engineer Resume Examples & Samples

Demonstrated experience with data structures, and software design
Demonstrated experience programming in one or more of: C, C++, C# Java, Python, Go
5+ years of full Software Development Life Cycle experience
3 or more years of experience working as a programmer
Demonstrated experience with PL/SQL development
Ability and willingness to assume periodic “on call” duties for issues/escalations
Experience with running web services at scale
Understanding of Unix/Linux systems from kernel to shell and beyond
Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing)

234

Site Reliability Engineer Resume Examples & Samples

4+ years experience in network design and implementation
4+ years supporting high-volume customer facing networks at an L3 level
Expert level experience with multiple network equipment vendors
Expert level understanding of TCP/IP, IPv4/IPv6
Expert level BGP and OSPF
Administrator level knowledge of and comfort working in *nix environments
Proficiency in modern development languages such as Python, C++
Experience developing tools and automation to drive efficiency in operations tasks
Experience developing metrics and telemetry to measure and improve availability
Expert level troubleshooting skills
Experience with L2 technologies such as MLAG and VPC
Participation in on call rotation

235

Bluemix Site Reliability Engineer Resume Examples & Samples

At least 2 year experience in software development or engineering
At least 2 year experience in application operation
At least 2 year experience in problem determination
3 Years experience in multiple coding skills such as;Representational State Transfer (REST)/web services, distributed systems, messaging, knowledge of open source tools used in Cloud Foundry development (e.g. Git and Jenkins)
3 years experience in software development or engineering
3 years experience in application operation

236

Senior Site Reliability Engineer Resume Examples & Samples

Someone with a passion for learning
Someone gutsy who isn’t afraid to try new things
Someone who is comfortable failing, learning and trying again
Someone who is data obsessed
Someone with an automation mindset
Someone who hates black boxes
Security focused position working with a Security Team
Application Security exposure
Proficient in a high level scripting language
Continuous Integration Knowledge
Experience with RDBMs and NoSQL implementations
Knowledge of distributed systems design and constraints

237

Senior Site Reliability Engineer Resume Examples & Samples

Experience administering Linux systems
Experience working in a SaaS environment at scale
Fluency coding in either Python, Ruby, Java, or Go

238

Site Reliability Engineer Resume Examples & Samples

React to monitoring alerts and lead efforts to fix system issues in a quick manner
Work with technology partners to resolve issues and push improvements in our ecosystem
Develop and contribute to internal knowledge base
Be a champion for our customers and insure change management processes are followed
Automate Site Reliability Engineering operations by developing software applications and API Integrations to connect disparate systems
Work in a highly skilled a 24/7 environment. Provide support on weekdays and also off hours as needed
At least 3+ years in an IT environment with preference given to operational center environments
Strong background in interacting with relational database environments this includes constructing complex queries to proactively identify system trends and troubleshooting application issues
Familiarity with open source and 3rd party Monitoring Systems (Nagios, kafka, etc.)
Experience using scripting tools (Perl, Powershell, php) to facilitate the creation of utilities to facilitate day-to-day workflows
Systems/Network administration background
Experience with a broad set of system tools for troubleshooting and analysis including but not limited to: -SOAPui and LOADui for API testing and simulation
Knowledge of AppDynamics, Splunk, Microsoft System Center suite for operational monitoring,
Familiarity with Linux Operating Systems

239

Site Reliability Engineer Resume Examples & Samples

Track our cloud customer SLAs and be on-call to ensure total conformity to these customer commitments
Create and maintain complete and accurate documentation for the purpose of operational audits including security and compliance
Coordinate internal activities across company and our cloud partners, ensuring achievement of the above responsibilities
Continuously review and enhance processes and operating procedures needed to maintain the most cost effective enterprise-grade cloud infrastructure
Innovate and automate improvements to our Cloud Operations
Identify and promote best practices and patterns for the setup, configuration and management including databases, servers, and network and storage systems

240

Senior Site Reliability Engineer Resume Examples & Samples

Become a subject matter expert (SME) of several internal highly-distributed systems
Provide expertise to engineering and operations team on the use of these systems
Tune systems to perform better and operate more reliably
Manage the rollout and activation of new features and platform changes
Assist Akamai engineering team and operations staff who rely on our system to perform their job
Write Bash, PHP, SQL, Perl and Python code to enhance and fix the functionality of the system
Bachelor's degree in engineering, computer science or equivalent experience
5+ years experience in the industry
3+ years experience with networking systems
Experience with alerting, monitoring and performance management systems
Experience in Business Intelligence systems/tools (analyzing data and identifying trends)
Experience working with DNS, HTTP, and SSL protocols
Experience with Web servers, Perl, Python, SQL and PHP
Experience analyzing and optimizing security systems
Excellent problem solving/troubleshooting skills
Ability to working closely with other engineers to understand problems and work towards solutions

241

Site Reliability Engineer Resume Examples & Samples

5+ years of Linux systems administrator experience
Demonstrated programming skills in two or more of: Bash, Perl, Python, Ruby, PHP, Java, C
Solid understanding of operational principles, such as capacity planning, monitoring and incident handling
Experience with Hadoop, HDFS, Kafka, Spark, Docker, Mesos, Marathon, AWS or Azure desired

242

Site Reliability Engineer Resume Examples & Samples

Primary support for outages affecting production environments
Provisioning, configuration and maintenance across all our service environments, including writing software to automate repetitive tasks
Working collaboratively with your peers to solve complex issues
Measuring and improving service quality and reliability while remaining cognizant of the need for economic and operable solutions
Participation in code reviews, willingness to take time to help others grow and succeed

243

Site Reliability Engineer Resume Examples & Samples

Ensure reliable operation of processing pipelines that analyze terabytes of scientific data
Apply industry best practices to the field of scientific programming e.g., (nose), continuous integration and deployment (buildbot), and code quality best practices (prospector)
Provide hands on support to facilitate the scientific mission of users, including debugging, performance analysis, and training
Convert user-written programs into more robust automated pipelines
Maintain and operate existing applications via configuration management (Ansible) as well as implementing for new systems as needed
Respond to incidents that cause outages or threaten data integrity, while planning to implement new systems that might prevent reoccurrence
Perform periodic on call duty
Bachelor’s Degree and 3 or more years of general IT experience including
2 or more years of experience in software engineering
3 or more years of experience in UNIX/Linux system administration, technical troubleshooting, and performance tuning
Bachelor’s degree in computer science, engineering, or related field
Experience working on small cross-team technical projects with direct contact with stakeholders
Ability to work well independently and with a team, with a strong desire to learn and support the scientific community
Exposure to software development best practices including architecture, documentation, and testing
A background using technologies from most or all of these areas: UNIX/Linux systems administration (e.g., CentOS, Debian), programming languages (e.g. Python, JavaScript, shell scripts), scientific computing (e.g. Matlab, R), protocols and standards (e.g. HTTP, NFS), debugging tools (e.g. strace, firebug, pdb), web application technologies (e.g. Nginx, Tomcat, NodeJS), database technologies (e.g. MongoDB, PostgreSQL), resource management facilities (e.g. OpenPBS, Slurm)
Strong written and spoken English language skills

244

Site Reliability Engineer Resume Examples & Samples

1+ years of experience with Linux systems administration, including CentOS or RHEL
1+ years of experience in scripting with Bash, Python, or Ruby
Experience with Hadoop
Knowledge of Puppet, Chef, Ansible, or Salt
Knowledge of Docker
Experience with working in Agile or DevOps environments
Experience in working with distributed data stores, including Accumulo or HBase
Experience with continuous monitoring tools, including Kibana, SolarWinds, and Splunk
Experience with service discovery platforms, including DNS, Zookeeper, or Consul
Experience with Cloud Service Providers (CSP), including AWS and private Cloud implementations, including Open Stack
Experience with working in and maintaining PaaS environments, including Kubernetes or Mesos
DoD 8570 Compliance Certification, including Security+ CE or CISSP

245

Site Reliability Engineer, UK Resume Examples & Samples

Migrating daemons and services to a 64-bit architecture while minimizing service disruption
Identifying local and distributed performance bottlenecks and evaluating whether they can be allayed by caching, precomputation, or other similar techniques
Adding network topology awareness on a code deployment pipeline to increase performance using less bandwidth
Improving service monitoring to include automated anomaly detection
Automating Ruby scripts within a Unix environment and building large systems out of small components that each do one job and do it well

246

Site Reliability Engineer Resume Examples & Samples

Ensure user visible uptime and quality, providing operational and development expertise in making our systems fail rarely, and are fast to fix when they do fail
Participate in architecture and design reviews to provide recommended improvements to the development teams to improve the reliability and performance of applications
Minimize manual involvement by imagining & implementing continuous improvements that create an operating environment, including the development of new tools, dynamically monitoring, alerting, & automated self-healing & recovery
Identify and/or analyze problems relating to mission critical services and implement automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions
Engage in application performance analysis and system tuning, and capacity planning
Perform root cause analysis to identify & implement continuous improvements
Capable of presenting analyses and recommendations to leadership or discussing the technical merits of solutions with engineers and architects
Own the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
Practice Agile and Scrum methodologies
Strong experience with SharePoint 2013 platform and/or software engineering
Working knowledge of Azure Services, especially ARM templates
Strong experience with TFS 2010+, VSTS, or similar ALM tool
Strong experience with PowerShell
Experience developing in a software development language (e.g., preferably C#/C++)
Knowledge of virtualization and its benefits for improving reliability
Strong experience with instrumentation, monitoring, alerting, and responding relative to performance and availability of applications
Capable of technical deep dives into infrastructure, databases, and application, specifically in designing, coding, operating, and supporting high-performance, highly available services and infrastructure
Experience in designing for failure, including disaster recovery and business continuity planning
Experience operating and supporting mission-critical applications (e.g. incident and outage management)
Passionate for making things better and driving action with a sense of urgency
Experience problem solving issues on globally distributed systems and critical product service environments
Brings new thinking to challenge existing technology and processes
Excellent at building relationships across teams
Firm sense of accountability and ownership
Understanding of the concepts and principles behind DevOps, Continuous Delivery, Agile, Lean, etc
Use of DevOps tools to deliver and operate end-user services a plus (e.g., Chef, New Relic, Puppet, etc.)

247

Site Reliability Engineer Resume Examples & Samples

Solid understanding of Software Engineering and Computer Science principles
Solid foundation in Linux administration and troubleshooting
Fluent in the English language both spoken and written

248

Site Reliability Engineer Resume Examples & Samples

Engage in the entire lifecycle of services—from inception through deployment, operation and continuous integration
Acquire expertize on Splunk and similar log management tools to establish casual relationships between issues and trends
Manage tasks in Jira from production support standpoint
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Scale systems sustainably through mechanisms like automation, and spearhead improvements that help to improve reliability and velocity
Build knowledge base around common production support issues
Timely communication of Release Notes and QA and other teams
Able to build prototypes
At least a Bachelor’s degree in CS or related fields
Solid knowledge of shell scripting and at least one scripting language

249

Senior Site Reliability Engineer Resume Examples & Samples

Excellent troubleshooting skills that span systems, network (TCP/IP), and software
Comfort working with senior management to allocate and prioritize engineering energy in support of the SRE mission
Incredible attention to detail – plan ahead before making changes
Comfort with multiple customers and high complexity network and software servers
Effective communication with solid writing skills
Ability to extract details from design documentation, engineering release notes and extract possible points of contention for instrumentation
Familiarity with ITIL service management concepts
ESB and JMS systems administration
Experience with working with geographically-distributed teams
Experience with Monitoring infrastructure tools such as Prometheus, Grafana, etc
Experience with Log Management tools such as ELK Stack, Splunk, MongoDB, etc
Experience with Big Data technology, Hadoop, Cassandra, is a plus
Experience with Zookeeper is a plus
Experience with Docker is a plus

250

Site Reliability Engineer Resume Examples & Samples

Web and Application Server Technologies – Understanding of the configuration and management of the Apache, IIS, Nginx, TomCat, SendMail, EXIM, ProFTPd, etc. Will implement full LAMP stack configurations
Will perform debugging analysis and provide an engineered solution based approach for resolution of issues
Will work with API's both consuming and providing
Experience with mainstream Open Source technologies
Experience with database technologies (MySQL preferred)
Intermediate experience with monitoring technologies and methodologies
Experience working with San and NAS technologies (iSCSI, FCP, CIFS, NFS, etc.)
Experience providing solutions/automation via code/development using industry standard languages (Python, PowerShell, C#, Ruby, etc.) and code management methodologies (Git, SVN, TFS, etc.)
Experience with RESTful API's

Download Site Reliability Engineer Resume Sample as Image file

Related Job Titles

Site Engineer Resume Sample

Reliability Engineer Resume Sample

Quality & Reliability Engineer Resume Sample

Senior Reliability Engineer Resume Sample

Identity & Access Management Engineer Resume Sample

Site Reliability Engineering Resume Sample

Civil Site Engineer Resume Sample

Water Engineer Resume Sample

Maintenance Reliability Engineer Resume Sample

Algorithm Engineer Resume Sample

Engineer Reliability Resume Sample

Qlty & Reliability Engineer Resume Sample

Browse More

Site Reliability Engineer Resume Samples

The Guide To Resume Tailoring

Craft your perfect resume by picking job responsibilities written by professional recruiters

Pick from the thousands of curated job responsibilities used by the leading companies

Tailor your resume & cover letter with wording that best fits for each job you apply

Resume Builder

Resume Builder

15 Site Reliability Engineer resume templates

Read our complete resume writing guides

How to Tailor Your Resume

How to Make a Resume

How to Mention Achievements

Work Experience in Resume

50+ Skills to Put on a Resume

How and Why Put Hobbies

Top 22 Fonts for Your Resume

50 Best Resume Tips

200+ Action Words to Use

Internship Resume

Killer Resume Summary

Write a Resume Objective

What to Put on a Resume

How Long Should a Resume Be

The Best Resume Format

How to List Education

CV vs. Resume: The Difference

Include Contact Information

Resume Format PDF vs Word

How to Write a Student Resume

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

HBO Site Reliability Engineer Lead Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

New Grad-site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

HBO Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer NYT Beta Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Principal Site Reliability Engineer Resume Examples & Samples

Senior Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

Site Reliability Engineer Resume Examples & Samples

How to Make
a Resume

50 Best
Resume Tips

200+ Action
Words to Use

Internship
Resume