Site Reliability Engineer Resume Samples
4.7
(156 votes) for
Site Reliability Engineer Resume Samples
The Guide To Resume Tailoring
Guide the recruiter to the conclusion that you are the best candidate for the site reliability engineer job. It’s actually very simple. Tailor your resume by picking relevant responsibilities from the examples below and then add your accomplishments. This way, you can position yourself in the best way to get hired.
Craft your perfect resume by picking job responsibilities written by professional recruiters
Pick from the thousands of curated job responsibilities used by the leading companies
Tailor your resume & cover letter with wording that best fits for each job you apply
Resume Builder
Create a Resume in Minutes with Professional Resume Templates
CHOOSE THE BEST TEMPLATE
- Choose from 15 Leading Templates. No need to think about design details.
USE PRE-WRITTEN BULLET POINTS
- Select from thousands of pre-written bullet points.
SAVE YOUR DOCUMENTS IN PDF FILES
- Instantly download in PDF format or share a custom link.
MM
M Mitchell
Mireille
Mitchell
72011 Bauch Isle
Philadelphia
PA
+1 (555) 247 3500
72011 Bauch Isle
Philadelphia
PA
Phone
p
+1 (555) 247 3500
Experience
Experience
Chicago, IL
Site Reliability Engineer
Chicago, IL
Mueller, Hauck and Frami
Chicago, IL
Site Reliability Engineer
- Assist in the Development Priority List process working with Product Management group to address issue identified as part of Problem Management
- Provide solutions for performance management, disaster recovery, monitoring and access management
- Work/support business users to understand issues, develop root cause analysis and work with the team for the development of enhancements/fixes
- Works with the team to develop, maintain, and communicate current development schedules, timelines and development status
- Provide engineering design across different workloads including incident & problem management, change management, security and compliance
- Improve security and performance of infrastructure by working with other teams
- Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
Philadelphia, PA
Senior Site Reliability Engineer
Philadelphia, PA
Ryan, Ziemann and Lesch
Philadelphia, PA
Senior Site Reliability Engineer
- Keeping the ship sailing! Monitoring and supporting the IT infrastructure environment
- Monitoring and diagnosis of systems for optimal performance
- Generating well defined and documented standard processes for the enterprise
- Queuing and data-pipeline solutions (RabbitMQ, ZeroMQ, pub/sub, SQS)
- Scripting (Bash/Python)
- Identifying, gathering, analyzing and automating responses to key performance metrics, logs, and alerts
- Engineering solutions in the long term to make everyone’s life easier
present
Chicago, IL
Principal Site Reliability Engineer
Chicago, IL
Rogahn-Hickle
present
Chicago, IL
Principal Site Reliability Engineer
present
- Provide architectural and practical guidance to software development to improve resiliency, efficiency, performance, and costs
- Monitor and report on service level objectives for a given applications services. Work with business and product owners to establish key performance indicators
- Capacity planning and management – create, use, maintain a capacity model for on-prem and AWS hosting, based on E2E user flow profiles
- Work with product operations team to resolve trouble tickets, developing and running scripts, and troubleshooting services in a hosted environment
- Working knowledge of virtualized environments; VM management and provisioning
- Provide technical insight on development projects
- Assist with testing and validating production applications
Education
Education
Bachelor’s Degree in Computer Science
Bachelor’s Degree in Computer Science
Iowa State University
Bachelor’s Degree in Computer Science
Skills
Skills
- 2+ years testing and supporting a highly scalable, highly available online service
- 2+ years supporting a highly scalable, highly available online service
- Excited about continually reducing complexity, and creating systems that are easily understandable, repeatable, and observable
- Strong TCP/IP understanding and ability to produce detailed documentation
- Strong demonstrable scripting knowledge of PowerShell
- Ability to define and document technical architecture of complex and highly scalable products
- Fundamental networking knowledge of TCP/IP, ARP, IP Tables, routing and working knowledge of routers, switches, firewalls/VPNs #LI-HM2
- Basic networking knowledge, including TCP/UDP, how ARP and the routing table work, and some understanding of higher-level protocols like HTTP and DNS
- Strong personal and professional initiative with a focus on the success of the team and organization
- Good knowledge about Windows Operation System
15 Site Reliability Engineer resume templates
Read our complete resume writing guides
1
Site Reliability Engineer Resume Examples & Samples
- Manage the scalability and efficiency of the Livestream platform
- Respond to and resolve platform problems
- Solve tasks in a generic way that can be automated (so no task is ever done by hand twice)
- Participate in capacity planning and forecasting, system performance analysis, and system tuning
- Review and influence ongoing design architecture of the Livestream platform
- BS degree in Computer Science or related field. In lieu of degree, relevant skills or equivalent experience
- Familiarity with at least two of the following languages: Python, Perl, PHP, JavaScript, Erlang, Scala, Ruby
- Ability to write scripts using Shell, awk, sed
- MS degree in Computer Science or related field
- Expert knowledge of Linux kernel
- Understanding of video streaming protocols
2
Site Reliability Engineer Resume Examples & Samples
- Engages with the Account Team to ensure Critical Application Services client expectations are being fulfilled
- Respond to support requests and co-ordinate Customer support teams where appropriate
- Attend and participate in all customer service review meetings
- Identify opportunities for growth and advancement of the Service offering
- Requires 3-5 years of hands-on Linux exposure within the full LAMP stack * Bachelor's degree in computer science or engineering related field or equivalent work experience preferred * Familiarity with virtualization technologies * Experience with common scripting languages (Bash, Perl, Python, Ruby) * Experience working with mission-critical applications written using Ruby, Rails and Python Ex: Oracle Web Commerce or AEM * Experience with Java is preferred * Ability to diagnose and fix complex issues surrounding hardware, software, and network issues * Familiarity with MySQL and Oracle database administration concepts and performance tuning * Understanding of performance monitoring software such as Cacti, SNMP, Munin, sysstat, and Nimbus * 1-2 years knowledge of network concepts such as the TCP/IP stack, load balancing, firewalls, iRules, and network routing * Knowledge of RAID technology in regards to data integrity, and IO performance * Knowledge of email and DNS fundamentals * Experience with performance tuning and diagnosing system bottlenecks through root cause analysis * Experience working in large scale distributed environments * Ability to work independently and operate as a member of a team * RHCE certification is strongly preferred, RHCA a plus * Excellent problem solving abilities, coupled with a desire to take on responsibilities * Must be detailed orientated and a self starter
3
Site Reliability Engineer Resume Examples & Samples
- Support web based applications with systems administration, configuration, troubleshooting and monitoring
- Evaluate Linux systems and make recommendations to improve security, scalability, performance and availability
- Production application monitoring and support
- Should work with multiple Product development teams distributed Globallyand address their needs in a timely manner
- Passion to learn and explore new tools/technologies that improves the current process * Qualifications/Requirement*
- 3+ years of professional experience in systems, server hardware, virtualization and networking
- Proven ability to troubleshoot system/network issue(specific China Great firewall)
- Experience and working knowledge of Cloud services namely AWS(Amazon Web service) & Microsoft Azure Cloud services is a big plus
- Experience in a 24x7 high availability production environment
- Experience with programming/scripting language(Shell, Perl or other)
- Strong knowledge about Linux(CentOS/Ubuntu) and LAMP stack
- Good knowledge about Windows Operation System
- Good knowledge in virtual networking, VPN and CDN(Content delivery network)
- Fundamental networking knowledge of TCP/IP, ARP, IP Tables, routing and working knowledge of routers, switches, firewalls/VPNs #LI-HM2
4
Senior Site Reliability Engineer Resume Examples & Samples
- Web application performance baselining, analysis, tuning, capacity planning and demand forecasting
- Assist with development and implementation of DevOps SRE solutions for large scale distributed web applications across multiple tiers and data centers
- Work closely with development and QA teams on new and ongoing technology projects related to performance, high availability and scalability including load-based dynamic provisioning and de-provisioning of systems, etc
- Perform proactive daily system monitoring including reviewing system and application logs as well as responding to, triaging, troubleshooting and remediating incidents
- Coordinate with our enterprise operations team to communicate with impacted stakeholders and clients, escalating where appropriate
- Review entire environment and execute initiatives to reduce failures, defects and improve overall performance
- Design, develop and execute automated tests to validate solutions and environments
- Document current and future configuration, processes and policies
- Availability for On-call after-hours support
- Up to 10% travel may be required
- 5+ years experience working with large scale distributed web applications across multiple environments
- Software Development background in .NET or J2EE stack
- Strong working knowledge of Windows and/or Linux operating systems, their underlying components, system statistics, performance tuning, file systems and I/O
- Experience in one or more of the following languages: Powershell, Bash, Python or Perl
- Solid understanding of operational principles in capacity planning, monitoring and incident handling
- Must have understanding of building and managing large-scale systems and application architectures
- Experience with VMware, ESX, etc
- Experience with syslog-ng, ELMAH, Splunk, ELK or similar log monitoring and analytics solutions
- Experience supporting Windows, IIS7.x and .NET 3.5/4.x applications in a production environment
- Experience in scalability, optimization and performance analysis
- Experience with SNMP/MIB and REST
- Experience building performance test environments a plus
- Ability to work in a collaborative team oriented Agile/SCRUM environment
5
Site Reliability Engineer Resume Examples & Samples
- Build monitoring and automation tools
- Automate system capacity, uptime and other system related reports
- Develop dashboards for our network and infrastructure
- Gain expert level knowledge of our applications and services
- Perform deep-dive analysis on application issues
- Participate in a weekly on-call rotation
- 4+ years of experience as a Site Reliability or Operations Engineer
- Experience with high-volume, low-latency systems
- Experience supporting a variety of applications on Linux operating systems (CentOS is a plus; LXC is a strong plus)
- Proficiency with load balancing solutions, both open-source and commercial
- Self-starter with a can do attitude
6
Site Reliability Engineer Resume Examples & Samples
- Experience operating and supporting large-‐scale Internet hosted applications
- Hands on system administration experience Linux-‐based systems (CentOS, RHEL), storage systems (SAN and NAS), load balancers and virtualized environments (JVM, VMware, vSphere, Amazon AWS)
- Experience with custom tool development, research of tools and deployment of solutions in support of internet hosted applications and environments
- Familiarity with hosted application service provider environments, including remote administration of servers and devices
- Excellent written and verbal communication skills, demonstrating the ability to effectively convey technical information to both technical and non-‐technical audiences
- Excellent information management practices, such as thorough documentation, usage of wikis, blogs and other collaboration tools
- Familiarity with service delivery and project management principles
- Experience supporting large-‐scale SaaS based applications and databases
- Familiarity with agile software development processes including software builds and source code control
- Experience with administration of NoSQL technology is a plus (e.g. MongoDB, Cassandra, Hadoop/Hbase, etc.)
7
Senior Site Reliability Engineer Resume Examples & Samples
- LAMP stack support
- Manage our virtual infrastructure
- Production support of public cloud
- Application management
- Dealing with developers on application issues
- On-call support
- SaltStack
- BIND
- RedHat KVM
- Jetty/JBoss
- RedHat OpenShift
8
Site Reliability Engineer Resume Examples & Samples
- Analyze data to deeply understand operational characteristics of our systems in production
- Develop and maintain tools as appropriate and champion them through the organization in order to improve productivity
- Responsible for technical troubleshooting for high priority escalations. And getting to the root cause and helping drive the permanent resolution
- Work across functional teams to ensure improvements to system/unit functional specifications are pro-actively picked up
- Proactive communication skills and effectively communicates difficult messages consistent with management direction
- Effectively resolves conflict
- Excellent knowledge and experience of Linux systems administration (Preferably RedHat/Centos)
- Designing configuration management and automation tools (Puppet/Chef//Func)
- Working in large deployment environments
- Good understanding and experience with scripting (shell, perl, ruby, python)
- Experience with internal tools (spacewalk, Graphite, rsyslog, logstash, elasticsearch etc.)
- Designing and configuring patch management systems
- Good understanding of networking, including TCP and DNS
- Excellent problem solving and analytical skills; experience of troubleshooting using packet captures and root cause analysis
- Ability to build on and use open-source tools/projects
- Thorough knowledge of the HTTP protocol
- Experience in real-time traffic processing (firewalls, port forwarding, etc.)
- Prior employment within a web hosting, internet service provider or Software as a Service company
- Experience with Openstack or EC2
- Experience of supporting Java application servers
- Experience with Agile Scrum development environments
- Understanding of software development methodologies and some background in Java
- Experience of Virtualization technologies
- Experience of configuring automated monitoring systems
- Understanding of HTTP proxy servers is highly desirable
9
Site Reliability Engineer Resume Examples & Samples
- Review and influence new and evolving design, architecture, standards, and methods for operating services and systems
- Identifying GAP’s in current tooling and areas of improvements, working towards delivering the tools and where required engaging with other Scrum teams to ensure delivery of improvements
- Identify business needs and if needed look at new technologies and the relevance for the use case of the technologies to meet those business needs
- Assist in the Development Priority List process working with Product Management group to address issue identified as part of Problem Management
- Diagnose bottlenecks for the full stack and provide recommendations to overcome the bottlenecks as an interim work around, while a long-term solution is investigated
- Identify all monitoring requirements are met and carry out periodic reviews of checks currently in place to ensure service meets or exceeds customer expectations
- Proactively review and recommend changes to the live infrastructure after ensuring the right validation has been carried out
- If needed prepare and deploy Support releases
- Updating and maintaining Compatibility matrix, providing guidelines to Scrum Teams on Test Scenarios to be included in testing
- Small feature implementation for new projects
- Perform periodic on-call duty as part of a global team
- Strong working knowledge of networking, packet tracing, understanding latency and throughput
- Strong working knowledge of
- > Java or C/C++ development experience including solid scripting skills in Ruby, Perl or Python
10
Site Reliability Engineer Resume Examples & Samples
- Experience operating and supporting large-scale Internet hosted applications
- Hands on system administration experience in Microsoft Windows (Server 2008/2012), Microsoft SQL Server(2008/2012)
- Strong demonstrable scripting knowledge of PowerShell
- Experience with Group Policies, Active Directory and DNS
- Familiarity with storage systems (SAN and NAS), load balancers and virtualized environments (VMware, vSphere, Amazon AWS)
- Experience with ESX and vSphere
- Experience with Automation and Configuration management tools
- Strong experience monitoring solutions such as Splunk, Nagios etc
- Ability to prioritize tasks and work independently
11
Site Reliability Engineer Resume Examples & Samples
- 3+ years in various Operation roles, including experience administering Linux systems in a production environment
- Solid understanding of system performance and monitoring
- Experience with the VMware Iaas/CIM layer (vSphere, vCenter, vCloud Director, vROps and Hyperic)
- Experience in ITIL Best Practices including problem, incident and change management
- Experience in one or more of the following languages: Java, Shell, Python, or Ruby
- BS or MS degree in Computer Science, or a related field
- Good working knowledge of build automation and continuous integration/delivery processes and tools: Git, Gerrit, Maven/Gradle, Jenkins, Docker, Nexus, Artifactory. Selenium
- Experience with at least one of the VMware vRealize Suite of products: vRealize Operations, vRealize Automation, vRealize Business
- Experience with enterprise monitoring solutions: vRealize Operations, vRealize Hyperic, vRealize Log Insight, Nagios
12
Senior Site Reliability Engineer Resume Examples & Samples
- Experience deploying large-scale Internet hosted applications including deployment automation concepts
- Hands on system administration experience in Linux-based platforms, storage systems (SAN and NAS), load balancers and virtualized environments (VMware, Amazon AWS)
- Demonstrable technology experience with administration of Mongo Database. Familiarity with basic Oracle concepts is a plus
- Experience with networking concepts, protocols and technologies
- Strong experience with designing, deploying and maintaining monitoring solutions such as Splunk, Nagios, Cacti, etc
- Experience with one or more development or scripting languages suited for system administration and automation, such as Ruby, Python, Perl, PHP, Java/Javascript, Shell
- Excellent written and verbal communication skills, demonstrating the ability to effectively convey technical information to both technical and non-technical audiences
- BA/BS or higher in a technical field
- Experience supporting large-scale SaaS based applications and databases
- Experience with networking technologies such as TCP/IP, DHCP, TFTP, VLAN, QoS and VoIP
- Experience creating and maintaining automated server deployment scripts using tools such as Chef or Puppet
13
HBO Site Reliability Engineer Lead Resume Examples & Samples
- Troubleshoot issues across the entire stack: hardware, software, application and network. Physical hardware and cloud-based environments
- Drive standardization efforts across multiple disciplines and services
- Participate in a 24x7 on-call rotation
- Ability to effectively communicate with all levels of management and all stakeholders
14
Senior Site Reliability Engineer Resume Examples & Samples
- Leadership role in all Site Uptime projects from problem recognition to prioritization of work, design and implementation of solutions
- Focus specifically on all externally visible issues
- Focus on our Outage Response and Recovery processes and tooling
- Focus on Application Error Monitoring and Reporting
- Focus on Deployment and Rollback success and speed
- Integrate with existing systems and tools and rip and replace where needed
- Identify and address application and system performance bottlenecks in a high throughput production environment
- A solid understanding of all server infrastructure technologies with production operations experience
- Experience with scale issues and large infrastructures
- 6+ years of industry experience
15
Senior Site Reliability Engineer Resume Examples & Samples
- Strong working knowledge of networking including packet tracing, packet captures and diagnosis, understanding latency and throughput
- Strong working knowledge of Linux and Windows operating systems, their underlying components, system statistics, performance tuning, filesystems and io
- Prior experience managing production systems and assets
- Experience in handling traffic spikes, production outages, root cause analysis
- Experience with production deployment, monitoring and operational support for Enterprise class hardware
- Experience in performance diagnostics, capacity planning, performance architecture design, performance tuning, performance monitoring
- 6+ years of experience supporting production infrastructure
- 8+ years of experience supporting production infrastructure
- 2+ years hands on experience with VMware vSphere v5.x and above
16
Senior Site Reliability Engineer Resume Examples & Samples
- Network and system infrastructures
- Scripting languages (Python, Bash, Perl, etc.)
- Automation of infrastructure and deployment (Ansible, Chef or Puppet)
- Databases (Performance Analysis, High Availability, Scalability, SQL/NoSQL)
17
Site Reliability Engineer Resume Examples & Samples
- Designing, operating and troubleshooting large-scale, highly available distributed systems
- Building tools and automation to help with provisioning, monitoring, debugging systems
- Programming in JVM languages like Java/Scala, C/C++, Go, scripting in Perl, Python or Ruby
- Solid OS internals including concurrency, memory management, file systems, networking, system calls
- Network technologies like TCP, UDP, HTTP, DNS, ICMP
- Team player, take pride in your craft, get things done
18
Senior Site Reliability Engineer Resume Examples & Samples
- Perform performance analysis, proactive troubleshooting, continual improvement and capacity planning of production and pre-production, virtualized environments
- Review entire environment and execute initiatives to reduce failures, defects and improving overall performance
- Leverage automation framework to improve processes, automate deployment, and improve manageability of environment
- Design, develop and execute automated test to validate solutions and environments
- Design and implement next generation internal Cloud platforms
- Expand and evolve Production cloud Infrastructure in a scalable, highly available, and distributed fashion
- Perform requirements gathering and analysis to determine appropriate solutions
- Conceptualize, develop, and evaluate different architectures based on user and application requirements
- Design, implementation and maintenance of automation and configuration management for all levels of Production cloud Infrastructure
- Refine and enhance processes that allow for proper testing and validation of releases that go from development to staging to production environments
- Participate in programs to deploy pre-GA VMware products in production and provide direct feedback for product improvement
- Be able to keep a cool head under “fire” and take part in a shared weekly SME rotation
- Effectively interact with other R&D & IT organizations in order to accomplish shared projects successfully
- Work with other experienced Systems Administrators on maintenance of environment
- Provide feedback and recommendations for design and implementation of solutions
- Prepare detailed build / test plans to implement new technologies/configurations
- Strictly follows the change management process and the incidents management process when working with the production environment
- Perform tasks that are often unstructured and address issues that are less defined requiring new perspectives and creative approaches
- Works on other technical projects as required
19
Site Reliability Engineer Resume Examples & Samples
- At least 6 years of professional experience
- 2+ years supporting a highly scalable, highly available online service
- Active Directory Experience
- IIS Experience
- Experience with at least one of the following languages: C/C++, C# or Java
- PowerShell Experience
- Experience as part of a 24x7 oncall escalation path
- Live Site / Customer Support Experience
- Experience managing certificates SSL / TLS
- Virtualization Experience
- Azure Experience
- Excellent problem-solving and debugging skills with a solid understanding of testing practices
- Excellent cross group collaboration skills
- Familiarity and passion for agile/lean development and execution methods, including Scrum and Kanban
- Experience developing services using the DevOps model
- Can do / positive attitude
20
Senior Site Reliability Engineer Resume Examples & Samples
- Experience managing Linux Systems and advanced troubleshooting techniques
- Experience with VMware enterprise products such as ESXi, Virtual Center and vCloud Director
- Experience operating networking layer 2 and 3 routers and switches
- Strong working knowledge of Linux operating systems, their underlying components, system statistics, performance tuning, filesystems and io
- Experience with production deployment, monitoring and operational support for Enterprise class applications
- Experience with load balancers and firewalls
- 2+ years hands on experience with VMWare vSphere v4.x and above
- Knowledge of data storage protocols including CIFS, FC, FCoE, iSCSI, and NFS
- Experience with iptables, tcpdump/wireshark,etc
- Experience with perl/python/powercli a plus
- Knowledge of Windows operating systems a plus
21
Site Reliability Engineer Resume Examples & Samples
- Automating application configurations with Chef/Scalr
- Performance tuning (HP performance Center, Open Source)
- On-boarding new technologies / Integrating into automation
- Platform SLA monitoring and enforcement
- Cost Analysis across multiple clouds (Amazon Web Services, Google Compute Engine, Microsoft Azure, OpenStack)
- 4+ years experience in Linux systems administration
- Experience deploying to AWS or other clouds
- Experience with auto-scaling and the architecture of stateless applications
- Experience with Chef or other configuration management tools
22
Site Reliability Engineer Resume Examples & Samples
- 3+ years production environment experience
- Demonstrated tool building capability
- Grace under fire and willingness to help troubleshoot to keep our services up and running, in a 24x7 oncall rotation
- Positive attitude, and a self-directed work ethic
23
Site Reliability Engineer Resume Examples & Samples
- Work closely with developers in supporting new features and services
- Troubleshoot site issues
- Develop custom tools as necessary
- Document system design and procedures
- Participate in light on-call rotation
24
Site Reliability Engineer Resume Examples & Samples
- Work in the Operations team to design and maintain bulletproof systems, with security baked in from the start
- Model threats, uncover and fix vulnerabilities
- Automate security testing of production systems
- Integrate security information into our monitoring and metrics systems
- Respond to security incidents
- Advise other teams on secure design of their systems
25
New Grad-site Reliability Engineer Resume Examples & Samples
- Perform performance analysis, proactive troubleshooting, continual improvement and capacity planning for production, virtualized environment
- Develop tools that support the daily operation, this includes Monitoring tools, CMDB
- Perform deployment functions to ensure releases follow proper implementation lifecycle
- Refine and enhance processes that allow for proper testing and validation of release that go from development to staging to production environments
- Work with experienced System Administrators on maintenance of environment
- Perform troubleshooting analysis and implement fixes to ensure availability SLAs are met
- Programming experience using one or more structured languages (python/Java)
- Ability to use scripting languages to automate tasks and gather data
- Understanding of development process and intermediate knowledge of product development
- Familiar with CMDB discovery methods
- Excellent oral and written communication skills; including documentation
- Ability to perform well in a dynamic environment, with on-time delivery
- Ability to follow and adhere to policies, procedures and standards relating to Systems management. May recommend process improvements
- Require limited supervision and direction; drive results and set priorities independently
- Able to work a 24x7 on-call rotation schedule
- Hands on experience with VMware’s vCloud Suite or products
- VMware Certification, VCP, VCAP
- Knowledge of database technologies, MSSQL, PSQL
- Experience coding to product APIs
- Knowledge of data storage protocols including CIFS, FC and NFS
- Knowledge of TCP/IP networking, DNS, LDAP, SMTP, Linux Account Management
- Knowledge of Red Hat, Centos, Ubuntu, and Windows operating systems
- Hands on operational experience in a high-volume or critical production service environment
- IP networking, including familiarity with the functionality, operating, and failure modes of networks
- Proven technical troubleshooting and performance tuning experience
26
Site Reliability Engineer Resume Examples & Samples
- Create and improve tools to aid in monitoring and control of our systems
- Enhance our infrastructure to support a variety of different disaster recovery options
- Build out a new data center in Dublin
- Providing general Unix server administration and troubleshooting for an environment covering over 2000 hosts, both physical and Xen/VMWare virtualized
- Building new server images, deploying and migrating production systems, and tuning configurations to improve application performance
- Converting one-off and stand-alone systems to use our standard templates and management framework
- Troubleshoot and Resolve technical issues that affect the production environment
- Improve security and performance of infrastructure by working with other teams
- Performing all other related duties as assigned by manager
- Experience managing an internal Linux distribution in a large scale environment
- Experience working with Unix automation tools, such as Puppet, Chef, and/or Capistrano
- Knowledge of Web Servers (Apache/Ningx)
- Familiarity with load balancing tools and techniques
- Experience working in virtualized environments, such as Xen, VMWare, VirtualBox
- Familiarity with IP and Ethernet networks as well as transport protocols such as Email, FTP, and HTTP
- Scripting ability a plus, particularly Bash, PHP, Python or Ruby
27
Site Reliability Engineer Resume Examples & Samples
- Engineering Rigor
- Quality and standards
- EDUCATION and/or EXPERIENCE
- Desirable Experience on the following areas
28
HBO Site Reliability Engineer Resume Examples & Samples
- Identify and drive opportunities to improve automation for the company
- Manage timely resolution of all critical and/or complex problems meeting SLA requirements
- Develop, configure and optimize service and application monitoring and telemetry
29
Senior Site Reliability Engineer Resume Examples & Samples
- Hands on system administration experience in Microsoft Windows (Server 2003/2008, IIS Manager), Microsoft SQL Server(2005/2008), storage systems (SAN and NAS), load balancers and virtualized environments (JVM, VMware, vSphere, Amazon AWS)
- Demonstrable technology experience with administration of Microsoft SQL Server 2008 databases
- Strong experience with designing, deploying and maintaining monitoring solutions such as Splunk, Nagios, Zabbix, Cacti, SCOM, etc
- Familiarity with Linux based operating systems
- Experience creating and maintaining automated server deployment scripts using tools such as SALT, Chef or Puppet
- Base knowledge of the W3C’s Web Content Accessibility Guidelines v2.0
30
Site Reliability Engineer Resume Examples & Samples
- Perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
- Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services
- Participate in code reviews for projects primarily written in Java and Scala, built on open source libraries such as Finagle, and running on both physical and virtualized platforms
- Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
- Practical experience in Java or Scala
31
Site Reliability Engineer Resume Examples & Samples
- Work in engineering team to design, build, and maintain systems
- Write scripts to monitor and automate processes
- Take part in a 24x7 on-call rotation
- Participate in code reviews for projects written in Scala built on open source libraries such as Finagle
- 2+ years industry experience as Software engineer
- 3+ years of experience in Internet scale Unix environments
- Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience in multi-tier web application architectures
- Hands-on experience in building event driven backend systems on JVM with Java or Scala
- Ability to lead technical teams through designs and implementations across an organization
- Practical knowledge of shell scripting and at least one scripting language (Python, Ruby, Perl)
- Experience with existing open source projects such as Mesos, Hadoop, Scribe, Zookeeper, etc
32
Senior Site Reliability Engineer Resume Examples & Samples
- Understand how the different core infrastructure systems come together to enable provisioning engineering at Twitter, and help keep all other infrastructure running
- Interface with customers and partners in engineering to gather feedback, requirements on the overall objectives of CISS, aligned with the mission
- Ensure reliability of the existing core infrastructure systems, to guarantee 99.99% uptime while maintaining SLAs to guarantee low latencies across the systems
- Provide project and technical leadership to the team, keeping in mind the above responsibilities
- Practical experience in C / C++ / Python / Ruby / Java / Scala
- Ability to lead technical teams through design and implementation across an organization
- Experience with existing open source projects such as Scribe, ZooKeeper, and Apache Mesos
33
Site Reliability Engineer Resume Examples & Samples
- Build tools to monitor and automate operational processes
- Understand the application stack and perform troubleshooting across the whole stack
- Work closely with engineering and operation groups to design, build and maintain systems, application and infrastructure
- Work in a team-oriented environment
- Take part in a 24x7 oncall rotation
- Currently pursuing a Bachelor’s, Master’s, or PhD in Computer Science or equivalent field
- Extensive system administration coursework
- Knowledge of Internet fundamentals (e.g. RFCs, DNS architecture, best practices, security, etc.)
- Knowledge of TCP/IP, HTTP, security, storage and databases
- Interest in distributed systems and high availability architecture
- Programming skills in one or more of: C, Java, Python, Ruby
- Internship or other experience administering large-scale UNIX installations
- Practical knowledge and experience with Linux/Unix/BSD; demonstrated involvement in systems operations
- Previous success in a performance-critical environment is a plus
34
Site Reliability Engineer Resume Examples & Samples
- Work in engineering team to design, build, and maintain web and RPC services
- Develop automation to improve reliability and ease of deployment
- Hands-on experience with event driven backend systems on JVM with Java or Scala
- B.S. in computer science or equivalent
35
Site Reliability Engineer Resume Examples & Samples
- Build automation and tooling in Python
- Perform deep dives into partner teams infrastructure
- Advanced practical knowledge of Python or Ruby
- Experience with configuration management tools such as Puppet, Chef, or Ansible
36
Site Reliability Engineer Resume Examples & Samples
- Infrastructure Automation using Chef
- Multi Cloud API integration with Scalr
- Work in a world where code is configuration
- Analyze Cost of applications across multiple clouds / Work with Dev teams to right size applications based on those analyses
- Compare costs across all Clouds, this includes both public and private clouds
- Manage AWS Reserve Instance program / Work with finance team for best optimization
- On-boarding new technologies
- 20% - Chef Development
- 20% - Scalr integration
- 10% - New Technology Evaluation
- 40% - Cost Optimization
- 10% - Documentation
37
Site Reliability Engineer Resume Examples & Samples
- Champion site security, reliability and robustness while balancing developer requirements
- Be effective in a distributed team through strong communication
- Improve and maintain processes for systems infrastructure and app deployments
- Be responsible for uptime, performance, and quality of service and participate in on-call schedule as first line of defence
- 2+ year experience as a Site Reliability Engineer required
- Passion for running great operations
38
Site Reliability Engineer Resume Examples & Samples
- Manage site security, reliability & robustness while balancing developer requirements
- Improve and maintain processes for their systems infrastructure & app deployments
- Responsibility for the uptime, performance and quality of service
- 2+ years experience specialising in site reliability
- Linux OS experience
- Python, Django, PHP
- Docker, lxc
- Experience with monitoring & alerting systems
- Previous experience in a start-up environment would be an advantage
39
Site Reliability Engineer Resume Examples & Samples
- Combine software development, networking and systems engineering expertise to help engineering teams build and run Nordstrom.com
- Help us define, transition to, and grow a Site Reliability Engineering model
- Work closely with engineering teams with a demonstrated commitment to their success
- Represent Nordstrom in the technical community, which can include contributing to open source
- Demonstrate a passion and commitment towards advancing the use of public and hybrid cloud technologies
- Participate in 24x7 on-call rotation
- Investigates and recommends approaches & systems that meet quality, performance and sustainability criteria
- Experience with Amazon Web Services (AWS) and APIs
- Experience with Git or other source control
- Experience with automation tooling such as Chef, Docker, AWS, etc
- Some engineering development experience, preferably in Java, Node.js, or .NET
- Understanding of software development methodologies and practices, including agile or lean development, continuous integration and continuous delivery
- General awesomeness, positive attitude and passion trump all other requirements
40
Site Reliability Engineer Resume Examples & Samples
- Own, as part of a small cross functional squad, a particular infrastructure problem space at Spotify. For example bare metal provisioning, cloud provisioning, monitoring, networking, storage, or service containerization
- Design and document systems, including writing and reviewing code, to automate away problems within your squad’s domain
- Undertake measured, methodical, troubleshooting of complicated systems under pressure
- Partake in an on-call rotation alongside the engineers who build our production backends
41
Site Reliability Engineer Resume Examples & Samples
- Own, as part of a small cross functional squad, a particular infrastructure problem space at Spotify: ensuring Spotify has a secure, reliable, highly available perimeter and service discovery infrastructure
- Design, implement and drive internal adoption of infrastructure perimeter and service discovery products
- Architect, design and document systems, including writing and reviewing code
- Meet availability SLAs for the services your squad owns, contributing directly to Spotify global availability
42
Site Reliability Engineer NYT Beta Resume Examples & Samples
- Proficiency in at least one programming language, and willing to learn Go (our primary language), experience with Ruby is a plus
- Designing pragmatic systems with an eye for performance, reliability and security
- Linux environment and fundamentals
- Strong understanding of AWS products and services
- Designing and/or operating web/mobile stacks at scale
43
Site Reliability Engineer Resume Examples & Samples
- Management and support of consumer websites including infrastructure design and deployment on platforms such as Linux and Windows using Apache, IIS, PHP, .Net, MySQL, and SQL Server
- Monitor and ensure regular backups for the consumer websites
- Provide guidance and assistance to the third party development teams
- Develop and maintain effective working relationships with peers, teams, and other internal or external people critical to business functionality
- Management and troubleshooting of video encoding and streaming platforms including Roku
- Develop scripts, utilities, and web applications to support daily business unit processes
- Responsible for the resolution of incident services calls and requests providing 24/7 support
- Assist with level 2 and 3 support coordinating and assisting other IT departments to resolve issues
- Create process or troubleshooting documentation in the support knowledge base
- Minimum 5 years technical experience
- Undergraduate degree in computer science or related field or equivalent work experience
- Experienced in translating business requirements into technical specifications and in developing prototypes and in launching pilot tests
- Must have strong interpersonal and organization skills and be able to communicate clearly with all levels of the business including personnel, technical management, customers and/or external vendors
- Strong work ethic and be able to work in a team environment and work independently
- Excellent problem-solving and analytical skills including the ability to quickly identify, analyze, and resolve issues and system failures
- Ability to analyze project requirements, draft project plans, milestones and delivery schedules
- In depth knowledge and experience with video encoding and streaming of live and on demand content (Expressions, Silverlight, Smooth Sreaming, HLS)
- Experience in Perl, PowerShell, Linux Bash, Visual Basic, and batch file scripting
- Experience with .Net Framework, VB.NET, C#, XML, SOAP, SQL
- Experience with HTML, ASP.NET, PHP, JavaScript, Apache and IIS Web Server
- In depth knowledge of current technology
- Linux and Windows server platorms
- Database Fundamentals (SQL Server, MySQL)
- Network Fundamentals (Network Topology, LAN/WAN, TCPIP)
- Virtualized Environments (VMWare)
- Windows and Apple desktop platforms
44
Senior Site Reliability Engineer Resume Examples & Samples
- Contribute to the architecting and implementation of scalable architecture used to serve billions of video vies per month
- 24x7 Second Tier On-Call Rotation
- Mastery in debugging Linux/Unix applications (4+ yrs), networking, and relevant systems
- Proficiency in two or more development languages (Bash, Clojure, Go, Java, Javascript, Python, Ruby, etc)
- Experience implementing large-scale monitoring and alerting infrastructure
- 2+ Years Troubleshooting and automation of AWS or comparable systems
- Experience with modern infrastructure concerns such as: CDNs, Containers, GeoDNS, Service Discovery, etc
- Keen understanding and appreciation for security concerns
- Global-scale performance monitoring and improvement
- Continuous delivery for large engineering efforts
- Any distributed database or systems experience
- Large-scale container deployments
45
Site Reliability Engineer Resume Examples & Samples
- Engender reliability and availability starting with metrics and measurements
- Enable scaling by providing tools, developing training or augmenting processes
- 2+ years in an operations role, or closely related position
- Relocation assistance available for qualified applicants
- Displays knowledge of, and ability to apply, process design and
46
Principal Site Reliability Engineer Resume Examples & Samples
- Secure the system from issues, be they real, perceived or notional
- Experience with configuration management tools such as Ansible, CFEngine, Chef and Puppet
- Utilizes time management and project management skills to lead the
- 10+ years in an software development role, operations role, or closely related position
- Bachelor's Degree in Computer Science or a related field, or relevant work experience
- Experience with distributed version control like Git or Mercurial
- Experience with enterprise monitoring solutions like AppDynamics, Graphite, Nagios, and Splunk
- Familiarity with continuous integration/deployment processes and tools such as Artifactory, Gerrit, Git, Jenkins, Maven and Nexus
47
Senior Site Reliability Engineer Resume Examples & Samples
- Lead technical activities running proof of concepts and user stories from conception to delivery
- Proven design skills across database/storage or networks
- Hunger to get involved in every part of our system; from the earliest stage of product architecture, design and development to deployment, troubleshooting, and performance analysis – to ensure a reliable quality product
- Strong collaboration and communication skills with ability to influence across an organization from Architects, Developers to Managers
- Perform RedHat Linux, WebSphere, DB2: configuration, installs, automation, recovery, backup, monitoring
- Participate in periodic on-call rotation
48
Site Reliability Engineer Resume Examples & Samples
- Manage and maintain all production systems at Vevo
- Contribute to the improvement of Vevo's infrastructure tooling and monitoring
- Work closely with all engineering teams to maintain and deliver large-scale applications
- 24x7 First Tier On-Call Rotation
- Proficiency in debugging Linux/Unix applications (2+ yrs), networking, and relevant systems
- Proficiency in one or more development languages (Bash, Clojure, Go, Java, Javascript, Python, Ruby, etc)
- 1+ Years Troubleshooting and automation of AWS or comparable systems
- Good understanding of IP Networking
- Experience with key technologies used as building blocks for modern applications: DNS, HTTP, Load Balancing, etc
- Understanding and excitement for modern security concerns
- Experience working with continuous delivery pipelines
- Docker or other containerized deployment methodologies
49
Site Reliability Engineer Resume Examples & Samples
- Maintaining and supporting all areas of our platforms architecture, ensuring that everything we do is super reliable and super fast, all the time
- Enabling the Lyst’s engineering team to be agile and to make feature changes every day
- Supporting a continuous integration environment and production deployment run entirely in Docker
- An internal PaaS implementation
- Custom Chef configuration
- An entirely AWS-based environment, using multiple AWS services alongside EC2
- Creating a development platform, tools and pipeline that is effective and easy to iterate with
- The huge challenges that come from working with a website with over 2 Million monthly users, 9 million products and a spider architecture performing 3.8 million updates a day
- Fixing the interesting problems we face in the best way possible; we are not constricted to tool sets and languages. If you find a solution to a problem that will work better, we use your idea. Best idea wins
- Building your own brand and skills. Lyst is a company that will encourage and support you to get involved in the wider community. Events like FOSDEM, PyCon, JS London, AWS re:Invent are regular occurrences on our calendar
- Creating better solutions for everything we do, from our culture to the code. This is your company, own it
- The monitoring of performance metrics such as response and error rates on the Lyst platform
- Proposing and implementing performance improvements, anything from one-line fixes to rearchitecting a component into a microservice
- Capacity planning for new features, and ensuring they’re written with scalability in mind
- Evaluating supporting services for new features, would PostgreSQL perform better than Cassandra? Is a custom solution a better fit than using ElasticSearch?
50
Site Reliability Engineer Resume Examples & Samples
- Vulnerability Management and Threat Intelligence
- Influence feature design, architecture, standards & processes to ensure Security
- Conduct advanced network security forensics
- Assessment and recommendation of Web Application Security
- Proven experience as a team player working with devops groups to continuously improve security posture
- Working knowledge of industry standard tools and systems related to penetration testing and forensics
- Able to articulate and visually present attack and mitigation strategies and concepts
- 7+ years’ experience providing security insight and solutions in large scale environments
- Strong analytical and troubleshooting capabilities
51
Site Reliability Engineer Resume Examples & Samples
- Works with other members of cross-functional teams, joint ventures,third party vendors and Company's Product Managers and Marketing teams to deliver quality products, in a timely fashion, that meet defined requirements. Establishes and maintains working relationships within NE&TO, Product Development teams, joint ventures, vendors and contractors
- Ensures that projects are properly accepted into the engineering team, worked on in a timely and efficient manner and smoothly transitioned into Quality Assurance and Operations teams
- Participates in the review of failures and provides feedback to
- Strong understanding of REST web services
- Experience with the twelve-factor web app methodology
- Experience with Cloud based systems, deployment methods and technologies
- Experience with scripting language(s) (Go, Python, Perl or others)
52
Site Reliability Engineer Resume Examples & Samples
- Participate in on-call rotation duties
- Experience managing complex projects, with significant bottom‐line impact
- 7+ years’ experience in large scale internet service design & implementation
53
Site Reliability Engineer / E-commerce Resume Examples & Samples
- Build tooling to allow the self-service of AWS offerings by engineering teams in a secure, reliable, friction free manner
- Develop foundational capabilities to feed near real-time business and technical metrics into a modernized Live-Site Operations Center
- Future participation in 24x7 on-call rotation
- Designs systems, services and components that meet required levels of quality and performance sustainability
- Experience with automation languages like Ruby, Powershell or Unix, etc
54
Senior Site Reliability Engineer Resume Examples & Samples
- Influence feature design, architecture, standards & processes to ensure Security, Performance, Operability & Scale
- Experience with Anycast, Global Load Balancing, and CDNs
- Ability to manage multiple priorities, commitments & projects
- Demonstrated passion for customer experience & usability, including successful delivery of customer self‐service tools & automated management/optimization of services
55
Site Reliability Engineer Resume Examples & Samples
- Create solutions to improve performance, scalability, and reliability
- 3+ years of hands on systems administration experience on Linux or Windows platforms
- Scripting skills in at least two languages (bash, python, ruby, perl, powershell, etc.)
- Experience with configuration management tools, such as Puppet
- Working knowledge of NoSQL data stores (MongoDB, Cassandra, Couchbase, etc.)
56
Senior Site Reliability Engineer Resume Examples & Samples
- Maintain, build, implement and design telemetry systems and dashboards
- Troubleshoot complex issues and teach others how to use toolsets
- Ability to automate tasks using scripting or other programming language
- Demonstrated expertise in web services, virtualization, cloud concepts, REST, JSON, YAML, XML, SQL, PHP, LDAP, & object oriented methodologies
57
Site Reliability Engineer Resume Examples & Samples
- Participate in technical proof of concepts from conception to delivery
- Be hungry to get involved in every part of our system — from the earliest stage of product architecture, design and development to deployment, troubleshooting, and performance analysis – to ensure a reliable quality product in production
- Be able to collaborate and communicate clearly on status and progress
- Design and build tools to manage a rapidly growing number of servers and services
- Perform general OS, Web/Application server, database configuration, installs, automation,
58
Site Reliability Engineer Resume Examples & Samples
- Work closely with product engineering teams to help design systems for performance, fault tolerance, and scalability
- Develop the tools and training needed for product engineers to assume operational responsibility for their own software
- Monitor and audit production application stacks for opportunities to improve performance and capacity utilization
- Troubleshoot, isolate and fix production issues along with product engineers and help prevent them from happening again
- Operating and debugging production systems
- Designing and implementing infrastructure, deployment, monitoring, and logging tools
- Linux environment and fundamentalsDesigning web and mobile stacks at scale
59
Senior Site Reliability Engineer Resume Examples & Samples
- Root-cause complex problems involving multiple parties, networks, hardware and software that relate to scaling and performance
- Bachelors Degree or Equivalent
- Engineering, Computer Science
- Generally requires 7-11 years related experience
- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell
- Experience with distributed version control systems like Git or Mercurial
- Experience with IaaS and PaaS providers such as AWS, OpenStack, Heroku, and CloudFoundry
- Experience with enterprise monitoring solutions like InfluxDB, AppDynamics, Graphite, Grafana, Nagios, and Splunk
60
Senior Site Reliability Engineer Resume Examples & Samples
- Hands on system administration experience in Linux-based platforms, storage systems, load balancers and virtualized environments (Amazon AWS)
- Demonstrable technology experience with administration of relational (mySQL) and noSQL databases
- Strong experience with designing, deploying and maintaining monitoring solutions such as Splunk, New Relic, Data Dog etc
- The following requirements are not strictly required, but are preferred
- Experience with TCP, CIDR, RFC 1918, sub-netting, DNS, BGP
- Experience creating and maintaining automated server deployment scripts using tools such as Chef, Salt, or Puppet
61
Site Reliability Engineer Resume Examples & Samples
- Coordinate ongoing projects between Development & Service Engineering
- Drive release readiness from Operational and Infrastructure perspective
- Participate in Design reviews and clearly document operational requirements
- Eliminate duplicate efforts by bringing teams together and ensuring recycling of good process, technology and tooling
- Build requirements for tools, develop efficient DevOps processes, & contribute to our knowledge base
- Document requirements to optimize monitoring & self-healing capabilities
- Experience managing complex projects with significant bottom‐line impact
- Proven experience with bug tracking, reporting & project planning tools
- 3+ years’ experience in large scale internet service operations and service delivery
62
Principal Site Reliability Engineer Resume Examples & Samples
- Familiarity with container technologies such as Docker, LXC, Mesos and Kubernetes
- 10+ years in a software development role, operations role, or closely related position
- Programming experience in two or more of the following languages: Go, Java, Python, Ruby, Shell
- Experience with distributed version control such as Git or Mercurial
- Experience with enterprise monitoring solutions such as InfluxDB, AppDynamics, Graphite, Nagios, New Relic and Splunk
63
Principle Site Reliability Engineer Resume Examples & Samples
- Actively engage early in the service lifecycle interfacing with software delivery teams to influence service readiness
- Own serviceability, reliability and quality, measure service health and influence service KPIs
- Work across disciplines to design secure and resilient service architectures across platform services and standardize topologies
- Deliver and contribute to site availability, reliability and sustainability
- This job includes an expectation of on call availability split between the US and EMEA teams
- Partner with other engineering teams to resolve issues and defects
- Work with a global foot print of services, teams, and engineers
- Be willing to travel internationally a couple of times per year
- A minimum of 10 years’ experience in supporting the service design, development, or managing 24 x 7 enterprise systems
- BS/BA in Computer Science or related field, or equivalent work experience
- Solid working knowledge on cloud computing (Azure and/or AWS)
- A solid understand of telemetry and service monitoring tools and techniques
- Preferred 5+ years of coding/testing experience in a high level language
- Understanding of Microsoft products, services, and platforms (i.e. SQL Server, ASP.Net, SharePoint, Azure, System Center)
- Experience with OSS and 3rd Party development and platforms e.g. Java, Ruby, Python, Linux, iOS
- Deep passion for shipping a high quality product that customers love
64
Site Reliability Engineer Resume Examples & Samples
- Obsess over collecting and digesting metrics
- Collaborates with project stakeholders to identify product and
- Researches, writes and edits documentation and technical requirements,
- 6+ years in a software development role, operations role, or closely related position
- Experience with enterprise monitoring solutions like InfluxDB, AppDynamics, Graphite, Racon, Grafana, Nagios, and Splunk
65
CIB Investor Services Site Reliability Engineer Resume Examples & Samples
- Design software to improve availability, scalability and efficiency of the financing platform
- Develop tools and applications to automate and support applications
- Troubleshoot production issues
- Participate in on-call rotation
66
Site Reliability Engineer Resume Examples & Samples
- Solve problems and automate responses for recurrent issues
- Gathering and refining requirements from stakeholders
- Working with developers in other component teams to ensure consistent integration of services across teams
67
Site Reliability Engineer Resume Examples & Samples
- Serve as primary point responsible for operations, maintenance and performance of network applications in high-volume production environment
- Assist in the roll-out and deployment of new features to facilitate fast growth
- Automation of build, release and operational activities
- Develop and or customize tools to rapidly deploy applications in production environment
- Work closely with development teams to ensure that applications are designed with “operability” in mind
- Specifying and designing non-functional components of software
- 5-12 years’ experience in managing Network Management applications and tools
- Experience of networking and infrastructure at scale
- Use and support of Infrastructure-as-a-Service
- Expert troubleshooting and debugging skills
- Strong Scripting and Automation Skills using Python, Perl, Javascript
- SevOne
- HP Network Automation
- SPLUNK
- ThousandEyes
68
Site Reliability Engineer Resume Examples & Samples
- 3+ years in various DevOps/SRE roles
- Understanding of containers and container orchestration
- Demonstrate skills in priority setting, analysis, communication, time management, scheduling, and multitasking
- Experience with infrastructure configuration and automations processes and tools: Puppet, Ansible, Chef, Fabric, Terraform
- 76002BR
69
Site Reliability Engineer Resume Examples & Samples
- Leadership role in all projects from problem recognition to prioritization of work, design and implementation of solutions
- Focus on application networking processes and automation
- Work in and contribute to a shared codebase
- Rapidly debug, fix and solve problems
- A solid understanding of infrastructure service technologies with production operations experience
- Experience with performance analysis, scale issues and large infrastructures
- Experience with workflow automation
70
Senior Site Reliability Engineer Resume Examples & Samples
- In Depth Experience on Azure Services like SQL Azure, Compute, IaaS, PaaS, WAAD, VNET, & Express Route
- Expert in troubleshooting Live Site issues across the applications
- Expertise in web services, virtualization, cloud concepts, REST, JSON, XML, SQL, LDAP, & object oriented methodologies is desired
- Engage with Partners and customers effectively to mitigate/resolve/preempt issues/outages
71
Senior Site Reliability Engineer Resume Examples & Samples
- Drive efficiencies in process: implement and enforce process for change management, emergency response and capacity planning
- Solve problems relating to mission-critical services and build automation to prevent problem recurrence, with the goal of automating response to all non-exceptional service conditions
- Participate in an on-call rotation and be available for escalations
- A strong system and software engineering background
- A solid understanding of system availability, latency and performance
- Experience with enterprise messaging systems like MQ, Kafta and RabbitMQ running on Linux and Solaris
- Knowledge of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems
- Strong programming skills in Java, Python or C++ and the ability to learn new languages as needed
72
Site Reliability Engineer Resume Examples & Samples
- Hands on system administration experience in Linux-based platforms, storage systems, load balancers and virtualized environments (VMware, Amazon AWS)
- Demonstrable technology experience with administration of Mongo Database. Familiarity with basic dbase concepts is a plus
- Knowledge of ITIL process framework
73
Courts Site Reliability Engineer Resume Examples & Samples
- Work to improve the reliability and performance of front-end and back-end IT services in close collaboration with the architecture team and developers with a focus on automation, availability, performance, and efficiency
- The successful candidate will be a key contributor, working closely with members of our newly formed agile teams in a fast-paced environment
- The ideal candidate is energetic, innovative, and has superior technical, analytical, and problem solving skills
- Writing scripts to monitor and automate processes
- Troubleshooting issues across the entire stack - hardware, software, application and network
- Taking part in 24x7 on-call rotation as needed
- Participating in code reviews
- Bachelor’s (or equivalent combination of education and experience)
- Master’s preferred
- Experience working with Agile Teams is preferred
- Software engineering experience in Internet scale environments
- Experience troubleshooting performance, availability, scalability and reliability issues in production with enterprise- grade applications and driving architectural improvements to address these issues
- Knowledge of TCP/IP, HTTP, web application security, and multi-tier web application architectures
- Hands-on experience developing event-driven back-end systems
- Knowledge of shell scripting and at least one scripting language (Python, Ruby, Perl)
- Experience with open source projects (Mesos, Hadoop, Scribe, Zookeeper, etc.)
- Previous experience as a primary contributor to an architecture team in systems design, building, testing, and implementation
- Assesses current or future customer needs and priorities through communicating directly with customers, conducting surveys, or other methods
- Excellent written and verbal communication, interpersonal skills, and team skills
- Ability to listen to the needs of our internal clients, communicate clearly, and determine ideal configurations for a new development teams and projects
74
Site Reliability Engineer Resume Examples & Samples
- 5+ years of experience in Service Engineering/Engineering roles; 5+ years of coding/testing experience
- Leadership skills: Sound problem resolution, judgment, negotiating and decision making skills
- Develop test plans/cases, conditions and scenarios in support of ongoing applications and infrastructure
- Familiar with SCVMM and operational knowledge of deployment and challenges
- Demonstrated solid working knowledge on cloud computing / Azure/AAD
- Evaluate current services and drive performance, availability and supportability improvements
- Constantly focus on how to enable the business and increase agility
- Support a 24x7 live site support model for the services team owns
75
Site Reliability Engineer Resume Examples & Samples
- Process customer service requests and meet established SLAs
- Identify and automate manual workarounds and process improvements
- Utilize Rails console to resolve customer account issues
- Monitor the availability, latency, scalability and efficiency of all services
- Understand the loan life cycle, transactional logic, and framework of our finance models
- Implement reports and automations for business stakeholders
- Assist with major incidents from identification and troubleshooting through to service restoration
- Reconcile daily bank files and monitor critical application processes
- Troubleshoot issues and discrepancies with the appropriate individuals, teams, or vendors
- Identify, document, and recommend solutions for bugs, processes, and feature enhancements
- Update knowledge base articles with current information and communicate to your teammates
- Perform periodic on-call duty as part of the SRE team
- Experience with Ruby, Perl, Python, Java, or PHP
- Capable of learning new technologies and concepts quickly, and continuously building upon that knowledge
- Can balance and prioritize multiple projects/tasks and remain calm under pressure
- Problem solving skills- if you don’t know how to do something, you consider it a challenge to try to figure it out on your own
- Ability to write accurate and efficient SQL queries
- Strong understanding of IT infrastructure (Linux or systems administration, network technologies, relational databases, web technologies, etc.)
76
Site Reliability Engineer Resume Examples & Samples
- Maintain and support the product and data systems: proactively monitor events, investigate issues, analyze solutions, and drive problems through to resolution
- Use a wide variety of operational tools and monitoring platforms to gain in-depth knowledge, understanding, and ongoing monitoring of system availability, performance, and capacity
- Define requirements and assist with development of customized tools and reporting as needed by projects and operations
- Work with business partners to establish Service Level Indicators and Objectives (SLIs and SLOs)
- Rationalize deployment strategies to facilitate continuous delivery
- Implement alerting strategy that makes alerts actionable and unique
- Operate within ITIL Problem, Incident and Change Management practices
- Provide follow-through to ensure issues are resolved to satisfaction
- Create and improve standard operating procedures and documentation
- Drive continuous improvement and innovation within the team
77
Site Reliability Engineer Resume Examples & Samples
- Dedicated member of an agile software/devops team
- Deployment, support and maintenance of development software stacks, overseeing build frameworks
- Manage and maintain enterprise infrastructure tools as the primary subject matter expert
- Respond to system issues related to the infrastructure and fulfill service requests
- Lead infrastructure deployments in the scrum
- On-Call support for Pre-prod and Production environments
78
Site Reliability Engineer Resume Examples & Samples
- Improve the reliability and efficiency of Twitter’s traffic management systems
- Build and maintain high-performance, fault-tolerant, and scalable distributed systems in the context of Twitter's service-oriented architecture
- Troubleshoot complex distributed systems problems and develop solutions that have a significant impact at Twitter’s scale
- Write performant, maintainable, clear, and concise code and accompanying documentation
79
Site Reliability Engineer Resume Examples & Samples
- App Services - our core services handling users, tweets and more
- Storage infrastructure - our next-generation distributed cache and storage systems
- Core Infrastructure System - our internal core infrastructure services (provision engineering stack, DNS, Puppet, LDAP, Subversion, Kerberos etc.),
- Database Engineering - our relational stores like MySQL, PostgreSQL and Vertica
- Engineering Effectiveness - our tools and services related to build, test and deployment systems
- Hadoop/Data Platform - our Hadoop clusters, data management services and all the ecosystems YARN, Scalding, Parquet, Hbase,.
- M&A - help our acquired companies manage their infrastructure
- Mesos/Aurora - our compute platforms that all other Twitter runs on top of
- Platform - our API/frontend services
- Traffic Engineering - our traffic management systems
- You will perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
- You will troubleshoot issues across the entire stack: hardware, software, application and network,
- You will drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization
- You will mentor SREs on standard methodology for everything from monitoring to troubleshooting complex code issues
- You will participate in code reviews for projects primarily written in Java and Scala, built on open source libraries such as Finagle, and running on both physical and virtualized platforms
- You will represent the SRE organization in design reviews and operational readiness exercises for new and existing services
- Experience with existing open source projects such as Scribe, ZooKeeper and Apache Mesos
80
Site Reliability Engineer Resume Examples & Samples
- Work in engineering team to design, build, and maintain Hadoop clusters and data services
- Participate in and build tools to
- 2+ years of managing services in a distributed, internet-scale *nix environment
- Familiarity with systems management tools (Puppet, Chef, Capistrano, etc)
- Demonstrable knowledge of Linux operating system internals, filesystems, disk/storage technologies and storage protocols and networking stack
- Hands-on operational experience on managing JVM services
- BS or MS degree in Computer Science or Engineering, or equivalent experience
- Understanding of Hadoop, YARN,
- Understanding of Scalding, Parquet
81
Site Reliability Engineer Resume Examples & Samples
- You would perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
- You will build relationships with ISPs and other industry partners
- Troubleshooting tools (i.e tcpdump, netstat, iostat, traceroute)
- Experience with iptables or other firewall solutions
- Ability to work with engineering teams and minimal hand-holding
- You will Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services
- You will participate in an oncall rotation with other engineers to support your services
- Practical experience in Postfix, Exim or Msys
- Strong contacts at ISPs and track record of excellent deliverability
82
Site Reliability Engineer Resume Examples & Samples
- Build automation and tooling in Python and assist teams in their software (e.g. Ruby, Python, Perl)
- Perform deep dives into partner teams infrastructure of varying types, sizes, and quality
- Drive standardization efforts across multiple disciplines, systems, software, and practices
- Functional knowledge of bootstrapping technologies: PXE or cloud-init
- Experience with configuration management tools: Puppet, Chef, or Ansible
83
Site Reliability Engineer Resume Examples & Samples
- Understand how the different core infrastructure systems come together to enable provisioning engineering at Twitter, and help keep all of the infrastructure running
- Meet with customers and partners in engineering to gather feedback, iterate on requirements, and align with the team and company mission and objectives
- Conceptualize, architect and develop systems and features for enabling core infrastructure engineering effectiveness: ease of use and ease of maintenance of these systems
- Development of solutions to enable automated workflows that bind together with Twitter's core engineering principles, and enable all engineers to make use of and achieve their objectives as required of the core infrastructure systems
- Perform deep dives into both systemic and latent reliability issues; partner with software, systems and security engineers across the organization to produce and roll out fixes
- Practical experience in Python and Ruby
- Experience with existing open source projects such as Scribe and Apache Mesos
- B.S. in Computer Science or related field
84
Senior Site Reliability Engineer Resume Examples & Samples
- 5+ years of experience as a SRE or Operations or administration of customer-facing, high-availability, large scale web-based applications
- 5+ years of PHP, Perl, Python or other scripting language
- 5 years of experience in Java-based technologies
- Mastery in PHP, Perl or Python Programming
- Administrative Experience with installs, configures, troubleshoots, monitors, maintains of Linux infrastructure
- Experience in writing SQL and PL/SQL procedures
- Experience with one of the log analysis tools like Splunk or ElK Products (ElasticSearch, Logstash, Kibana)
- Experience with Orchestration Tools like Ansible etc
- Experience with monitoring tools like Sensu, Collectd, Grafana etc
85
Site Reliability Engineer Resume Examples & Samples
- You will automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement
- You will build scalable infrastructure to manage metadata for hundreds of billions of files, hundreds of petabytes of user data, and millions of concurrent connections
- You will drive the company through “Disaster Recovery Tests”, where we manually turn down pieces of infrastructure to test Dropbox’s overall resiliency to failures
- You will design the system and processes that Dropbox engineers use to deploy their software into production
- You will build an auto-remediation system to automatically resolve production incidents before passing them to on-call engineers
86
Site Reliability Engineer Resume Examples & Samples
- Conduct performance analysis and monitoring of multiple products/platforms
- Identify, plan and implement solutions that continually improves performance
- Automation development for infrastructure efficiencies across multiple products/platforms
- Automation towards un-attended fix actions
- Capacity management for multiple products/platforms
- Code contributions in languages such as Python, C/C++, Bash, Ruby
- Performance tuning for Apache web server, tcp/ip & network stack, system/kernel level tuning, NFS
- Alert response for performance and capacity related thresholds
- On-call rotations and Incident call handling
87
Site Reliability Engineer Resume Examples & Samples
- Work with engineering teams to design, build, and maintain systems
- Write scripts and software layers to monitor and automate processes
- Identify and drive opportunities to improve automation for the team (deployment, management and visibility of our services)
- Participate in an on-call schedule
- Work closely with Adobe operations teams to help develop and optimize solutions
- Strong comprehension of continuous integration and continuous deployment methodologies
- Deep understanding of both software engineering and technical operations
- Experience with programming in Python, Java, Ruby, Scala, Go, or similar programming language
- Experience with existing open source projects such as Mesos, Hadoop, Spark, ZooKeeper, Kafka, Cassandra, Docker
- Experience with developing frameworks, platforms, APIs
- Developing, running, and/or consuming cloud technologies such as AWS, Azure, OpenStack, Google Cloud Platform
88
Site Reliability Engineer Resume Examples & Samples
- BS /MS in Engineering, Computer Science, or equivalent with 5 years of experience
- System admin experience
- Experience with cloud hosting - AWS, RackSpace, CIS, Openstack
- Experience to 24X7 support model with oncall rotation
- Experience with monitoring with alerting
- Good proficiency with script languages such as Python/Shell
- Knowledge of distributed computing
- Experienced with implementing back-end services in large / “web scale” distributed systems
- Knowledge and experience with micro-services design and implementation
- Knowledge and experience with “Platform as a Service” environments or other application development platforms
- Strong team player, with ability to actively contribute in teams with different skill and experience levels
- Knowledge about buffering, stream processing, complex event processing, and storage solutions (e.g., RabbitMQ, Kafka, , Mongo, etc)
- Experience with Cluster enabling solutions with services like Apache Mesos, Kubernetes
- Service discovery solutions like Consul.io, HA proxy etc
- Understanding on how to develop within a continuous integration environment leveraging tools such as Jenkins, Hudson, Bamboo
- HA/DR/Availability/Capacity
- Monitoring setup (Cloud Watch/Nagios/New Relic/SPM/Sensu)
- Experience building highly scalable solutions
- Excellent troubleshooting skills on a busy infrastructure setup
- About Cisco
89
Senior Site Reliability Engineer Resume Examples & Samples
- Manage the infrastructure for a cloud service that processes a billion metrics per day, and replicates tens of billions of database writes to our backup service
- Design, implement, operate and troubleshoot the automation and monitoring of a service that seamlessly spans several data centers and several cloud providers
- Become an expert in MongoDB performance, helping us optimize from the application level all the way through the firmware
- Participate in a weekly on-call rotation, and make trips to our data centers as needed
- Troubleshoot and resolve issues in multiple environments
- Improve our infrastructure capabilities, optimizing for cost, simplicity, and maintainability
90
Site Reliability Engineer Resume Examples & Samples
- 2+ years of demonstrated experience managing and maintaining large scale SaaS platforms
- Deep experience in at least one infrastructure component (operating systems, compute, storage, networking, data center, distributed systems, big data, cloud, etc.) and solid understanding of the rest and how they impact services
- Experience building, configuring, and maintaining operational monitoring and reporting tools
- Solid understanding of infrastructure and application performance metrics, including capacity planning
- Proven ability to work independently, and strong problem solving skills
- U.S. citizen or a lawful permanent citizen in order to pass government security clearance requirements, preferred
91
Site Reliability Engineer Provisioning Resume Examples & Samples
- Deliver the infrastructure and configurations that engineering teams need: both by provisioning systems directly, and building out tooling for getting it done faster
- Improve and automate our tools for provisioning, monitoring, trending, and configuration management
- Explain and document our tools and processes, so that developers can own and self-serve their own operational needs wherever possible
- Communicate effectively with SRE teammates and developer “customers”
- Advise engineering teams on how to configure systems for high reliability
- Participate in periodic on-call rotation as part of a global team maintaining the availability and performance of the New Relic site and APIs
92
Site Reliability Engineer Resume Examples & Samples
- Design, write and deliver software to improve the availability, scalability, latency and efficiency of Mozilla's services
- Solve problems relating to mission critical services and build automation to prevent problem recurrence with the goal of automating response to all non-exceptional service conditions
- Experience with algorithms, data structures, complexity analysis and software design
- Experience in one or more of: Python, Go, JavaScript
- Familiarity with running web services at scale, understanding of Unix systems internals and networking
- Familiarity with Cloud services like AWS
- Familiarity with Linux container engines like Docker
- Familiarity with container scheduling systems like Kubernetes, Fleet, and/or Mesos. Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way
- Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing
93
Site Reliability Engineer Resume Examples & Samples
- Keep the customer facing services available at top performance by maintaining constant health of the supporting systems
- Incident management - Act in key support roles during major incidents e.g. Sev0, Sev1. Also participate in the technical review of the incident for problem management
- Problem Management - populate in participate in RCAs and hand them off to the Global Solutions team
- Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company's internal compliance policy and directives
- Being available to discuss and resolve technical issues and escalations with other technical staff as required
- Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
- Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required
- Ability to operate in high pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities
- BS/BA Degree in Computer Science or equivalent industry experience(3-5 years in an Enterprise scale internet service engineering or support role)
- Expertise in TCP/IP related technologies (networking protocols, network programming, etc.)
- Expertise in CLI enterprise support of Unix variants (Linux/Solaris/BSD) as well as strong Linux/UNIX knowledge with significant exposure to Red Hat Enterprise Linux and Solaris
- Strong understanding of monitoring implementations and administration
- Strong communication skills(Written and Oral)
- Past experience in Incident Management and good understanding of ITIL service operations
- Experience in working in a 24/7 team managing large data centres
- Exposure to Oracle and high end Storage Infrastructure (Hitachi/EMC Tier 1)
- Perl/Python scripting experience
- Prior Chef/Puppet or automated deployment experience
- Experience in supporting and maintaining a monitoring system
- Experience supporting and troubleshooting relational databases and distributed platforms
- Experience in supporting and maintaining Java applications
94
Senior Site Reliability Engineer Resume Examples & Samples
- Must take ownership and accountability for one's actions, and be known for meticulous attention to detail
- Work in a highly skilled global team in a 24/7 environment
- Develop debugging tools to assist engineers in diagnosing production service problems
- Maybe required to define requirements for, and/or write/test custom tools to handle system automation tasks (installation, configuration, monitoring, etc)
- BS/BA Degree in Computer Science or equivalent industry experience
- Mandatory Knowledge on RHEL/CentOS 6.x or above - Preferable RHCSA certified
- Have scripted using shell/perl/python to automate repeatable tasks such as deployment, tuning of services
- Solid working knowledge of Unix systems internals, including file systems, kernel modules, packages, and networking
- Knowledge on Clustering/HA
- Exposure to automation frameworks such as Puppet/Ansible/Rundeck
- Understanding/management of code repositories like perforce/svn/cvs/git
- Experience designing, deploying and maintaining bare metal provisioning and multiple server installation (Kickstart) monitoring resources etc
- Understanding of Network security (eg. Firewalls, IDS, IPS etc.)
- Knowledge on Application Virtualization and Operating system-level virtualization
- Hardware training and/or certification in systems and storage management (Red Hat Linux, Sun Microsystems, Hitachi Storage Solutions, EMC, Dell, Solaris, etc.) desired but not mandatory
- Solid networking experience - TCP/IP, administration of networking hardware (Cisco, Foundry, etc.), load balancing - Considered a PLUS
- MS or PhD in Comp Sci, Mathematics, Machine Learning, Artificial Intelligence or similar
- Salesforce force.com Development (Apex and Visualforce)
- Visual/Interface Design skills
- Familiarity with open source and 3rd party Monitoring Systems (Nagios, kafka, SMARTS, etc.)
- IRC Bot development
- Agile development experience/understanding
- Network administration background
95
Site Reliability Engineer Resume Examples & Samples
- Partner with fellow engineers to architect and build mission critical software and systems that can stand the test of scale and availability, while limiting operational overhead
- Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis
- Participate in an oncall rotation and be available for escalations
96
Site Reliability Engineer, Senior Resume Examples & Samples
- 7 years of experience with leading activities in an Operations Center maintaining overall command and control
- Experience with software and systems engineering
- Experience with problem management and change control
- Knowledge of Linux, Java, Oracle, Networking, VMware, Apache, IT Process Management, including ITSM and ITIL, and Microsoft Office
- Ability to provide incident management, initial triage of events, and categorization and prioritization
- Ability to direct troubleshooting activities and make technical and executive escalations
- Ability to monitor several areas, including the Data Center Network, Infrastructure Bandwidth, Server and Infrastructure, applications within Marketplace environments, Agent-based Monitoring, Production Batch Processing, and Events and Alerts Escalation
- Ability to perform advanced troubleshooting and escalation and drive service restoration activities
- Knowledge of Remedy, Outlook e-mail, New Relic, Mixpanel, Oracle Enterprise Manager (OEM) Grid Control Management, Data Services Hub (DSH) Tool, or Tivoli
- Ability to function at a high level in critical situations
- Ability to ensure proper shift coverage and delegation of responsibilities
- Possession of excellent critical thinking, problem-solving, troubleshooting, and process-oriented thinking skills
97
Site Reliability Engineer Resume Examples & Samples
- Masters degree in Computer Science, Engineering, Information Technology, or related field with 1 year experience or Bachelors degree in Computer Science, Engineering, Information Technology or related field with 2 years experience
- Academic training or experience must include some exposure to
- Experience in security, dev-ops or Linux sysadmin role, preferably in a fast-paced web application environment
- Experience in web or network security
- Experience writing security policies, educating teams on security best practices, performing penetration tests, and analyzing a code base
- 1 year experience programming with two of the following languages; Ruby, Bash, Java, Python, C, C++, and Perl
- 6 months of experience implementing information security best practices such as SAS70, SSAE16, SOC or ISO
98
Senior Site Reliability Engineer Resume Examples & Samples
- Develop and deliver configuration and deployment automation required for improving the functionality, availability, and manageability of our microservices using Python or Ruby and configuration automation tools such as Puppet, Chef, or Ansible
- Build infrastructure and application monitoring by gathering application and system metrics and implement tools for recoveries
- Troubleshoot availability/performance problems and build software-based solutions to prevent recurrences
- Define and evangelize cloud-related optimizations and best practices to improve reliability and performance
99
Senior Site Reliability Engineer Resume Examples & Samples
- You will have a maniacal focus on site uptime
- Engineer application infrastructure that is reliable, efficient, and maintainable
- Partner closely with software engineering teams using a strong devops mindset
- Act as a subject matter expert for troubleshooting and resolving complex, multi-tier web problems that span a number of different platforms
- Automate production operations processes
- Automate continuous integration and deployment processes
- Deliver infrastructure requests for software projects on time
- Define, measure, and meet key operational metrics including site performance, incidents and chronic problems, site traffic and conversion, etc
- Participate in a 24x7x365 support rotation for a website that never sleeps
- 5+ years’ experience building and supporting large-scale, business critical systems
- Public Cloud experience (AWS/ Google Cloud)
- Expert knowledge of at least one web application platform: WebSphere, JBoss, Tomcat, Apache, NginX, Varnish, Endeca,
- Expert knowledge of Application Performance Monitoring tools: Dynatrace, Splunk, Gomez, Coradiant, and Tealeaf
- Experience with continuous integration platforms such as Jenkins
- Experience with infrastructure configuration management tools such as Puppet and Chef
- Mastery of at least one scripting language including Python, PERL, Ruby, Shell
- Unix/Linux power user
- Strong interpersonal skills, written and verbal communication
- Strong decision-making, problem-solving skills, critical thinking, and testing skills
- Ability to self-manage assigned tasks and projects
- Ability to work independently with minimal direction
- The knowledge, skills and abilities typically acquired through the completion of a bachelor's degree program or equivalent degree in a field of study related to the job
100
Site Reliability Engineer Resume Examples & Samples
- Development of a reliable set of micro REST APIs to help create agile and robust infrastructure management and reporting workflows
- Improve and build upon our existing automation tools for systems provisioning and management
- Independently learn new technologies and master the New Relic infrastructure so that you can provide 'full stack' diagnostics, when necessary, to help to figure out the root cause of internal problems
- Communicate effectively with fellow SREs and other engineering teams, and describe problems succinctly with sufficient detail that you can hand-off an ongoing problem to another team or a peer for completion
- Strategize with fellow SREs and other engineering teams on complex problems, and make decisions and recommendations about systems improvements after analyzing possible courses of conduct
- Perform periodic on-call duty as part of a global team maintaining the availability and performance of the platform that enables all of New Relic
101
Site Reliability Engineer Resume Examples & Samples
- Min 2+ years of experience in a Cloud administration role
- Min 3+ years of experience in IT
- Must be familiar and comfortable with continuous integration
- Ability to work well with various team members from developers to business people
- Self-starter with the ability to work solo on projects and get results
- Proactive and strong ability to learn new things with little guidance
- QlikView/Qlik Sense experience a plus
- Ability to work on computer for extended periods
- Must be willing to be on-call for rotation and system emergencies
- LI-MP
102
Site Reliability Engineer Resume Examples & Samples
- Proactively monitor availability and performance of the Ariba Cloud using key tools
- Effectively respond to Monitoring alerts, incident tickets, email requests coming in to Site Reliability Engineering
- Perform application and web site troubleshooting to quickly resolve the issues per documented procedures
- Ensure user tickets and monitoring alerts are handled per defined SLA's for response time, updates and closure
- Escalate issues as needed to back line production operations teams or Engineering per documented procedures
- Handling communication and notification on major site issues to the company and executive management team
- Document resolution run books and standard operating procedures
- Ensure smooth hand offs between shifts
- Experience working in a 24 x 7, large-scale Internet web environment
- Prior experience working with Java/J2EE applications
103
Senior Site Reliability Engineer Resume Examples & Samples
- Building and running Global Compute platforms
- Operate and deploy cloud services and related projects from development to production
- Develop automation, processes, and tools designed to make this process simpler and more robust
- Bridge Engineering and core shared operations services
- Participate in troubleshooting, capacity planning and analysis, performance analysis activities
- Advise management on service on boarding strategies and execution
- Mentor team members on areas of subject-matter expertise
104
Site Reliability Engineer Lead Resume Examples & Samples
- 5+ years of experience with Linux and UNIX administration
- 5+ years of experience with scripting languages, including PHP, Python, Java, Node, or Ruby
- 2+ years of experience with maintaining high availability production systems in a Cloud environment
- 2+ years of experience with automation and configuration management using either Puppet, Chef, or an equivalent
- 2+ years of experience with continuous integration tools, including Jenkins or TravisCI
- Experience with LXC and containerization, including Docker
- Experience with continuous monitoring tools, including Sensu, Nagios, and Splunk
- Experience with working in and maintaining PaaS environments, including Kubernetes, Mesophere, Flynn, or Deis
105
Senior Site Reliability Engineer Resume Examples & Samples
- Such as NewRelic, DataBase Query Optimization, Webpage optimization
- Fine tuning of web servers, use of caching layers etc.,
- Optimize buffer pools, java settings, propose & implement CDNs
- Proactively Identify, troubleshoot, and resolve and/or propose solutions across ALL environments
- Own capacity planning and infrastructure upgrades for computer platforms
- Collaborate with Software Engineering and Product teams
- Work the L2 queue, create & implement fixes in production and back-port to code-base
- In creating pull-request for said code fixes, changes & improvements
- Integrating infrastructure changes to optimize for speed and accuracy
- Develop systems diagram aka solution architecture for special projects
- Implement the said solutions across the various environments
- Document the proposed solution via visual aids such as MS Visio or LucidChard or draw.io etc.,
- Create & evangelize high level plan, timelines and cost estimate for the proposed solution
- Build out the solution bringing the paper solutions and/or pipelines to reality
- Collaborate tightly, working in short sprint cycles with Product, Development and QA for load testing scenarios, execute & provide feedback to peer groups
- Propose site reliability measures as a result of such experiments
- Define, setup, and manage full stack monitoring and corrective actions
- Recommend and champion software improvements to support elastic scalability
- Continuous Improvements to minimize manual Ops while improving availability
- Support corporate and business critical security certifications
- Understand business domain quickly and be responsive to peer team needs
- Document, train, and mentor peer team members as required
- 10+ years of progressive experience with Software Development Engineering - .Net & Java
- Strong knowledge authoring, troubleshooting & optimization of SQL
- Strong knowledge in Visual Studio, TFS, TeamCity and related MS Development concept & tools
- Strong knowledge in Java and related development concept & tools
- Experience with Continuous Integration/Deployment with tools such as Jenkins, Bamboo etc
- Willingness to learn & work with
- Configuration Management tools such as Salt Stack (In-use) (previous Chef/Puppet is good)
- Big Data solutions such as Apache Kafka (in-use), Apache Cassandra (in-use), Apache Spark (in-dev)
- Distributed computing concepts such as streams processing, fault tolerance, job management, map-reduce etc.,
- Android/iOS Applications - our Point-of-Sale systems runs as an application on these two platforms
- Solid working knowledge of VLANs, Routing, NAT/PAT, TCP/IP, and TCPDUMP
- Experience dealing with firewalls, packet filters, proxy servers, traceroute, ping etc
- Strong Infrastructure experience in: Cloud (AWS/Azure etc.,), VMWare/XEN etc.,
- Expert Systems engineering experience in: Linux and Microsoft Windows Servers
- Expert in web ware such as: Apache/NGINX/IIS, DNS, NTP setup and management
- Bachelor’s Degree in a technical discipline or greater preferably in Computer Science
- Ability to work and make reasonably sound business decisions under minimal supervision
- Strong verbal, written, and presentation skills
- Support for on-call & off-hour/weekend activities relating to production support
106
Site Reliability Engineer Resume Examples & Samples
- Ensure high availability with adequate monitoring and instrumentation
- Experience automating Linux system provisioning
- Experience with logging infrastructure and tools such as Logstash, Elasticsearch, Kibana, Splunk, and HDFS
- Maintain and extend operational processes to ensure high reliability of our entire technology stack
- Implement best practices to manage utilization, optimization, and monitoring of our cloud services
- Handle deployments, automation, and instrumentation and conduct analysis to improve DevOps
- Configuration management experience with Ansible, Puppet, Chef, or Salt is a nice to have
- Participate in daytime on-call rotation to support site and production issues
- Manage our computing and storage infrastructure if needed on private cloud
- Experience with Devops tools, processes, and culture
- Advanced Linux SysAdmin experience
- 2 years minimum experience in SRE, DevOps or Production Operations role
- Hands on experience scripting with Perl, Python or Ruby
- Technical generalist knowledge of System Administration, Databases, Network Operations
107
Senior Princ Site Reliability Engineer Resume Examples & Samples
- Work with multiple development teams to deploy code, maintain and enhance infrastructure
- Develop and maintain auto-scaling monitoring and performance analysis tools in a large-scale dynamic environment
- Assist with capacity planning
- Communicate regularly with globally dispersed teams
- Operational Automation and CI/CD experience (RunDeck, Puppet, Jenkins, Chef, Ansible, GIT)
- Expertise in UNIX / LINUX / Solaris / Windows Operating Systems
- Development experience with Java, Python, Ruby, Bash, Groovy, C#, C++ languages, .NET framework
- NoSQL Database experience (Cassandra, Riak, MongoDB, Redis)
- Clear understanding of software development lifecycle
- Experience with Oracle Database query tools, including the ability to write complex SQL queries, and Oracle DB monitoring tools
- Solid Networking experience; CCNA certification preferred
- Proven track record of delivery and commitment to customers
- Be able to isolate defects effectively and provide specific conditions for reproducibility
108
Site Reliability Engineer Resume Examples & Samples
- Design, implement, and support access management solution
- Provide timely and concise status to customers and team members for engagements
- Engage in agile transformation of products, solutions, team, and culture
- 5+ years of experience in the IT industry (system administration, automation, development, etc.)
- Ability to initiate, design, execute, and complete projects independently with minimal direction
- Must possess excellent communication skills (written, verbal) and be able to work effectively with technical and non-technical individuals alike
- Ability to solve problems independently
- Development & support
109
Site Reliability Engineer Resume Examples & Samples
- Use complex algorithms to develop systems & applications that deliver business functions or architectural components
- Develop system architecture that improve designs & mapping form to function
- Educate team members about integrating systems collaboratively & efficiently
- Assess how the competition differs from Verizon’s current state and update Verizon operations
- Bachelor’s degree in Computer Science or Information Technology
- 4+ years of experience in supporting Java and J2EE applications deployed in multi-platform environment
- Experience and/or knowledge of CA Wily Introscope, Siteminder and other monitoring tools
- Experience and/or knowledge of IBM Tealeaf, SPLUNK, Dynatrace and other
- Experience and/or knowledge of DevOps – CI/CD tool chain and successful implementation
- Good understanding of JVM Thread Dump, GCC and various system log files
- Good understanding of Application Logging, Monitoring and Alerting products
- Solid experience with working on UNIX platforms and good UNIX scripting skills
- Good knowledge of some scripting language (perl, php, Python)
- Good experience with both VM's and high-availability architecture
- Good understanding of web server load balancing concepts and experience with highly available distributed environments
- Experience working in DMZ environments with good understanding of Firewalls, TCP/IP, hardware load-balancing (ideally Netscalar, HST.F5), firewalls, multi-tiered architectures
- Troubleshooting of JEE based applications (analyze JVM logs, Trace Logs, FFDC, native logs, java core, heap dumps)
- Good time management, documentation and communication skills
- Enthusiasm and eagerness to learn and embrace new technologies
- Teamwork & collaboration skills to work across organizations and lead cross-functional teams
- Communication & stakeholder management skills
110
Site Reliability Engineer Resume Examples & Samples
- Being embedded within a Software Engineering team
- Writing testable code in Python (more than just scripts)
- Highly available architecture
- Continuous delivery principles and practices
- Monitoring best practices
- Service discovery
- NoSQL
- Load testing
- Public Cloud (AWS, Google or Azure)
- Linux (understand how things work under the hood)
111
Site Reliability Engineer Resume Examples & Samples
- 5+ years’ combined of professional software development and system administration experience
- A B.S. or M.S. degree in Computer Science, MIS, CIS, or a related field, or equivalent experience
- Passion for providing a great customer experience
- 2+ years’ prior system administration experience with 24x7 SaaS products
- The ability to manage SaaS architectures and operations at large scale
- Experience with Amazon Web Services (AWS) or similar infrastructure as a service
- DevOps and software development experience with large-scale environments
- A deep understanding of Ruby, Python, or JAVA
- Experience with shell scripting
- Experience with REST APIs and standard Linux command line utilities
- Experience with Chef, Puppet, Ansible, or other configuration management software
- Experience using a version control system—such as Git or SVN—and code deployment methodologies
- Network design and troubleshooting
- SQL and NoSQL databases
- Jenkins
112
Principal Site Reliability Engineer Resume Examples & Samples
- Respond to critical Application Alerts
- Research and investigate production application defects and provide solutions for resolution
- Provide technical insight on development projects
- Perform Root Cause Analysis (RCA) on system incidents
- Assist with testing and validating production applications
- Development experience with Java, Python, Ruby, C#, C++ languages, .NET framework
- Clear understanding of software development lifecycle methodologies
- Demonstrated ability to understand complex network architecture (Firewall, Load Balancer, DNS, Routing, Switching)
- Understanding of Client/server/database architecture
- Able to articulate strong stance for product quality and customer satisfaction, balancing business goals and commitments with quality objectives
- Must be able to work effectively in globally distributed team structure
- Experience with server and application monitoring tools (Nagios, Splunk, NewRelic, Sensu, Graphite, LogStash, etc)
- Operational Automation experience (RunDeck, Puppet, Jenkins)
- Experience working with .NET Applications
- NoSQL Database experience preferred
- Experience with Mercury test suite including QTP, Loadrunner and Quality Centre or similar products preferred
- Knowledge of Apache product suite particularly Tomcat
113
Site Reliability Engineer Resume Examples & Samples
- Evaluate and contribute to product and service design and architecture, helping shape service engineering technical strategies, review specifications, and design and improve upon core tools and processes
- Help build a data-driven culture by providing statistical trending and analysis using real service data to increase service health and quality
- Work closely with peer engineering teams on defining and implementing improvements to service tooling, monitoring, and reporting to enhance reliability and capability
- 5+ years in the technology industry with broad engineering experience; 3+ of those in online services
- Demonstrated understanding of Microsoft SQL, PowerShell, C#, .NET web applications and software security concepts
- Experience developing and publishing applications in Azure
- Working knowledge and hands-on experience in developing tools designed to run in complex, large scale online/hybrid services
- Previous working experience in a DevOps engineering model a plus
- Strong interpersonal, verbal, and written communication skills, with the ability to assemble, document, and present technical information to the team
- BA/BS in Computer Science, Mathematics, Electrical/Computer Engineering or related degree preferred or equivalent work experience
- Experience in large scale implementation (greater than 40K clients and 1K mobile devices) in one of the following technologies is a plus: System Center Configuration Manager, Altiris, Airwatch, MobileIron, Identity Management, Remote Desktop, VDI, Exchange/Office 365 or SharePoint
114
Site Reliability Engineer Resume Examples & Samples
- Create and/or improve the tools that provide insight into availability and performance of our services
- Solve problems related to these mission critical services and build automation to pro-actively detect and prevent their re-occurrences along with driving down time to resolution
- Share in the creation of new designs and architectures for multi-region, multi- datacenter distributed systems
- Collaborate with application engineering teams in solving business needs with our services
- Partake in the periodic on-call duties for provided services
- BS in Computer Science (or equivalent experience) plus 5-7 years of experience including experience with open source technologies, automated configuration, DevOps, or cloud automation development
- Experience in one or more languages of Python, Ruby, Java, or Go
- Understanding of Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery
- Knowledge of open source technologies such as Docker, Elasticsearch, Kafka, Redis, Cassandra, Consul, Nginx, HAProxy, and ability to quickly learn new technologies
- Configuration management experience with Puppet, Chef, Ansible, CFEngine desirable
115
Site Reliability Engineer Resume Examples & Samples
- Develop automation to improve reliability, performance, and deployment speed of core services
- Work on problems related to deployment, monitoring, failure handling, and traffic management
- Design and implement solutions that improve availability and use of Oracle's Cloud services
- Create new designs, architectures, processes, and methods for the implementation, operability, and support of large-scale distributed systems
- Be self-driven and independent thinker, act on ideas and drive them to completion
- Define best practices and standardization in Cloud operations
- Stay informed and relevant in new technologies
- Innovate!
- Prior experience in designing, implementing or supporting high performance and large-scale web applications in high scale customer facing environments
- Prior software development experience in one or more of: Python, Java, Ruby, Go
- Strong communication and analytical skills
- Able to accurately estimate efforts and deliver on time
- Experience with agile processes and good understanding of software development practices
- Expertise with software development eco-system such as Git, Jenkins, Artifactory, and CI/CD practices
- Knowledge on system and application security
- Understanding of virtualization solutions and Cloud services
- Strong knowledge of Linux-based OS internals
- Strong networking knowledge: TCP/IP, UDP, ICMP, IP packets, DNS, OSI layers, and load balancing
- Experience with configuration management tools
- Understanding of the DevOps toolchain components, and how they fit together
- Ability to define and document technical architecture of complex and highly scalable products
- Ensure the quality of the products being delivered
- Eagerness to automate, wherever and whenever the possibility arises. Automation is part of your DNA
- Must possess a strong desire to measure application performance and act upon results
- Experience with Linux containers (e.g. Docker)
- Familiarity with cluster management solutions: Mesos and job schedulers
- Expertise with databases and big data stores like MySQL, Memcached, PostgreSQL, and Oracle DB
116
Site Reliability Engineer Resume Examples & Samples
- BS in Computer Science or related field, or equivalent employment experience
- Strong sense of ownership, customer service, and integrity demonstrated through clear communication
- Demonstrated ability to write programs using a high-level programming language like: C, Java, Ruby, Python, or Perl
- Proclivity towards efficient programming emphasizing improvement via complexity analysis
- Experience managing large numbers of diverse systems with configuration management systems like: Puppet, Chef, Ansible, or Salt
- Deep understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals
- Understanding of standard networking protocols and components such as: HTTP, DNS, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing
- Fundamental understanding of distributed systems including: the CAP Theorem, Microservices, and the Twelve Factor App
- Passion for eliminating repetitive manual processes using automation
117
Site Reliability Engineer Resume Examples & Samples
- 5+ years of managing services in a large scale *nix environment
- Experience with DevOps tools, processes, and culture. Experience with Puppet, Chef or Ansible is a plus
- A systematic, test-and-measure approach to continually improving service operations
- Practical knowledge of shell scripting and at least one scripting language (i.e. Perl, Python, Ruby)
- Good understanding of Java is a major plus
- Working knowledge of Oracle Database is a plus
- Understanding of cryptography is a plus
118
Site Reliability Engineer Resume Examples & Samples
- Minimum of 3 years of experience supporting internet-facing production services and distributed systems
- Extremely organized, detail oriented, and thorough in every undertaking
- Able to balance multiple tasks and projects effectively and quickly adapt to new variables
- Demonstrated problem solving ability utilizing creative and innovating thinking but also adhering to
- Self motivated and eager to learn
- Professional and open minded attitude
- Able to work closely with other team members as well as work independently
119
Senior Site Reliability Engineer Resume Examples & Samples
- Practical knowledge of shell scripting and at least one scripting language (i.e. Perl, Python, PHP)
- Strong hands-on knowledge in Unix/Linux environment
- Experience with monitoring tools such as Nagios, Splunk highly preferred
120
Site Reliability Engineer Resume Examples & Samples
- System administration in a large environment
- Good understanding of TCP/IP network protocols
- Fluency in one or more of: C, Ruby, Perl, Python
- Superior analytical/troubleshooting skills
- Understanding of data structures, algorithms, and complexity analysis
- Tendency to automate repetitive tasks
121
Site Reliability Engineer Resume Examples & Samples
- 5+ years supporting database infrastructure in a high volume of customer-facing environment
- 3+ years of data modeling and database design
- Strong experience in optimizing and performance tuning of large RAC clusters
- Extensive Oracle Data Guard experience including fast start failover
- Proficiency with SQL, shell, Perl, & Python languages
- Experience on any of NoSQL data store such as Voldemort, MongoDB and Couchbase
- Background in building of a large infrastructure supporting a high volume of transactions in a mission critical environment
- Strong communication skills and ability to work effectively across multiple business and technical teams
- Ability to thrive in a fast-paced, tight deadline delivery timeline
122
Site Reliability Engineer Resume Examples & Samples
- 3+ years supporting hosted services in a high-volume customer-facing environment
- Demonstrated ability to write programs using a high-level programming language like: C, C++, Objective-C, or Java
- Experience with relational databases and No-SQL
- Background building distributed, server-based infrastructure supporting a high volume of transactions in a mission critical environment
- Demonstrated ability to deliver results on time with high quality
123
Siri Senior Site Reliability Engineer Resume Examples & Samples
- Expert knowledge of the Linux operation system (OS, networking, process level)
- Understanding of one or more object-oriented programming languages (Java, C++)
- Fluent in at least one scripting language (Shell, Python, Ruby, etc.)
124
Site Reliability Engineer Resume Examples & Samples
- Typically requires at least 5+ years of experience in Linux or UNIX systems administration in a large engineering or R&D environment and demonstrated skills in the following
- Linux (RHEL/CentOS preferred)
- NFS and NAS appliances (NetApp preferred)
- Layer 2 / Layer 3 networking (Arista or Cisco preferred)
- Scripting in shell, Perl, Python or Ruby
- Revision control systems (SVN, git, Perforce)
- Centralized configuration management (Puppet, cfengine)
- Software/tool compilation and installation
- Flexlm and similar licensing systems
- Monitoring systems such as Nagios, Zenoss, Groundwork
- LDAP (OpenLDAP, DSEE, OpenDirectory)
- IPAM with DNS (BIND) and DHCP
- Must be analytical and possess strong organizational/problem-solving skills
125
Site Reliability Engineer Resume Examples & Samples
- Experience automating Linux/Mac OS X system provisioning
- TCP/IP layer 2 and layer 3 networking
- Centralized configuration management, eg Puppet, Ansible, Chef, or Salt
- Experience with logging infrastructure and tools such as Logstash, Elasticsearch, Kibana, Splunk, HDFS
- Independently driven, proactive, accountable, reliable, team player
- Additional desirable qualifications
- Scripting: JavaScript, Ruby, Python, Perl, Tcl/Expect, Bash
- AppleScript/UI automation
126
Site Reliability Engineer Resume Examples & Samples
- Using your Linux system administration skills, monitor and manage the reliability of the systems under the responsibility of the Control Room Bridge
- Develop and maintain monitoring tools used to support the HPC community within NERSC using programming languages like C, C++, Python, Java or Perl
- Provide input in the design of software, workflows and processes that improve the monitoring capability of the group to ensure the high availability of the HPC services provided by NERSC and ESnet
- Support in the testing and implementation of new monitoring tools, workflows and new capabilities for providing high availability for the systems in production
- Assist in direct hardware support of our data clusters through managing component upgrades and replacements (dimms, hard drives, cards, cables, etc) to ensure the efficient return of nodes to production service
- Maintain outage documentation through a trouble ticketing system
- Help in investigating and evaluating new technologies and solutions to push the group’s capabilities forward, getting ahead of our users’ needs, convincing staff incentivized to transform, innovate and continually improve
- Bachelor’s Degree in a Computer Science or similar discipline with a minimum of 3 years related experience, or the equivalent combination of education and experience
- Minimum of 1-2 years of experience as a system administrator or system engineering supporting critical systems and applications in UNIX or Linux, in a high-volume customer-facing environment managing data clusters, administering the replacement of hardware, and ensuring its continuous availability to the user community
- Experience with or have taken classes in programming languages such as C, C++, Perl, Java and Python or a scripting language
- Experience in or have taken classes in the areas of TCP/IP related technologies (networking protocols, network programming, e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing, etc.)
- Ability in or have taken the appropriate classes or certifications in areas of enterprise support of Unix variants (Linux/Solaris/BSD)
- Past experience in Incident Management and good understanding of IT service management
- Experience in working in a 24/7 team
- Exposure to Oracle and high end Storage Infrastructure (Hitachi/EMC Tier 1) or have taken the appropriate classes in these areas
- Familiarity in configuring distributed, server-based infrastructure supporting a high volume of transactions in a mission critical environment in a Linux environment
- Knowledge of network security: configuring/maintaining ACLs, knowledge of firewalls
- Excellent problem solving and skills with ability to work both independently and collaboratively, contributing to an active intellectual environment
- Must demonstrate good judgment and ability to schedule and lead small group projects with systematic problem solving, coupled with a strong sense of ownership and motivation
127
Site Reliability Engineer Resume Examples & Samples
- Building, managing and securing Linux systems in an Enterprise environment
- Deploying production developer code
- Monitoring systems and applications
- Creating and managing internal tools
- Deep troubleshooting of production issues
- Creating and maintaining tools for use within the team
- Participating in on-call rotation
128
Site Reliability Engineer Resume Examples & Samples
- Continuous improvement.You will make the DevOps motto yours:“There is always a better way to do things”. You will help uncover and implement these better ways
- Release management. You will help deploy new products or services into production, being the liaison between Software Development and Hosting, and taking into due consideration security and compliance considerations
- Data and knowledge creation. You will help generate information about the system, from service availability statistics, capacity usage monitoring, to feature usage measurements. This will allow the business to take data-driven decisions in all circumstances
129
Senior Site Reliability Engineer Resume Examples & Samples
- Up to 50% of time spent on Incident Response with the goal of driving bridge calls to the quickest possible MTTR
- Integrates with select Scrum teams to determine infrastructure and environment related impacts and drive operational readiness
- Become infrastructure capability Subject Matter Expert and work with Dev teams to build to standards that drive the highest levels of availability
- Build and implement recovery tooling capabilities to better respond to common production incidents
- Help drive best software practices in code to avoid incidents or be more resilient to environment instability
- Inventory and document Application dependencies at all layers of the technology stack. Build tools and automate dependency discovery where possible
- Implement and validate new environment requirements before they are needed. This includes certificates, firewall rules, websites, VM’s, containers, load balancer configuration, and all necessary infrastructure to reliably run the target application
- Apply specialized knowledge of industry standards or practices to assigned initiatives
- Identify complex and or broad problems and issues and formulate recommendations
- Interact with development and infrastructure partners to better understand their objectives and help them understand the technical landscape
- 8+ years of active engineering and/or architecture experience in a complex environment and/or comparable development experience such as
- Large Scale web infrastructure experience, preferably in high transaction volume OLTP sites or the Financial Services industry
- Experience supporting a 24/7 site with on-call responsibilities for production support
- Broad Technical field exposure preferred, with preference to skills in one or more of the following: Infrastructure, VM, load balancing, containers, JVM’s, web servers, application debugging, queing technologies, Caching technologies, databases, routing and switching, etc
- Bachelor’s Degree in related field preferred; Relevant industry experience can substitute
130
Site Reliability Engineer Resume Examples & Samples
- Design and development of a new service or customer interface
- Provisioning and configuring the servers and applications behind services
- Adding new monitoring and measuring key metrics to better understand a process
- On-call troubleshooting or direct customer support
- Improving automation and analyzing workflows
- Proficiency in provisioning, administering and using modern Unix/Linux operating systems
- Strong knowledge of shell scripting and at least one programming language, preferably Python
- Practical knowledge of TCP/IP networking and troubleshooting connectivity issues
- Excellent verbal and written communication skills, including documentation for peers and end-users
- Demonstrated commitment to centralized configuration management and automation, preferably with Puppet
- Experience supporting high availability solutions such as load balancing, clustering, or fail-over mechanisms
- Database, web server, web applications design and administration experience, including Apache, MySQL, Postgres, Sybase, PHP
- Experience using Juniper, PulseSecure, NetApp, Infoblox and other proprietary OS's and devices
- Solid understanding of network protocols and ability to use diagnostic tools appropriately
- Red Hat training or certification
- Demonstrated experience in a production environment supporting enterprise wide essential services like DNS, DHCP, IP address management, LDAP, authentication, firewalls, file transfer, network access control, remote access, proxies and VPNs
- Proven ability to provide technical leadership to colleagues and customers
- Demonstrated experience in a production environment supporting essential services and 100+ servers
- Demonstrated ability to solve problems in creative and effective ways
131
Senior Site Reliability Engineer Resume Examples & Samples
- Fluent in systems programming and/or automation, and can leverage their experience to solve complex problems associated with running production environments at massive scale in multi-tenant environments
- Implementation of proactive monitoring, alerting, trend analysis and self-healing systems
- Participate in on-call rotations, driving restoration and repair of service-impacting issues
- Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
- 5+ years of PHP, Perl, or other scripting language
- Master's Degree in Computer Science or equivalent
- Experience with one of the log analysis tools like Splunk or ElK Products (ElasticSearch,Logstash, Kibana)
132
Site Reliability Engineer Resume Examples & Samples
- 4+ years of experience with Linux
- Experience programming with at least one of: Python, Go, Ruby, PHP, or Bash
- Experience with configuration management (including tools like Chef, Puppet, Ansible, or others)
- Experience managing high-throughput, high-availability systems
- Knowledge of networking protocols, including familiarity with TCP/IP, HTTP, SSL, and DNS
- Knowledge of monitoring systems and strategies for writing useful alerts
133
Site Reliability Engineer Resume Examples & Samples
- Strong systems scripting skills (Python, shell)
- Comfortable analyzing and troubleshooting large-scale distributed systems
- Fluency with Ubuntu (or other Linux distribution)
- Experience with running production systems
- Willingness to occasionally rack and wire servers
- Experience with continuous integration/deployment with Jenkins (or similar tools)
- Experience with SaltStack (or similar) for configuration management
- Experience with virtualization, containerization, and system image management (Docker, KVM, LXC)
- Experience with building and operating monitoring/alerting infrastructure
- Familiarity with distributed storage systems (HDFS, Amazon S3)
- Basic understanding of network switch configuration and network troubleshooting process
134
Principal Site Reliability Engineer Resume Examples & Samples
- Critical Analysis: Analyzes the current IT architecture to identify weaknesses and opportunities for improvement based on performance, reliability and capacity metrics, existing and planned application architectures, and criticality and efficiency of business processes
- Vision and Strategy: Drives development of our next generation global infrastructure strategy, design and roadmap based on deep understanding of data center strategy, private and public cloud “infrastructure as a service” (compute, storage, network, etc.) and “platform as a service” (software-defined networking, auto-scaling, automated provisioning and configuration) capabilities, as well as desktop computing and mobile device management
- Expertise: Hands-on technical practitioner and individual contributor who knows from experience which infrastructure architecture patterns, tools, practices and vendors/partners which will enable us to effectively and reliability scale our various systems
- Solutions: Works collaboratively with enterprise architecture, information security, application & infrastructure teams to produce an optimal, high level, conceptual, and cost effective designs. Facilitates the evaluation and selection of solutions and product standards. Acts as an arbitrator, as needed, to bring the appropriate parties together to drive discussions to decisions in a timely manner
- Delivery: Leverages hands-on expertise to implement proof-of-concept and production-class designs (read: gets hands dirty) and ensures proper monitoring and alerts are in place to govern adherence to standards and requirements
- Insights: Work with engineering and support teams to convey how best to improve application reliability and resiliency as input to technology roadmaps. Share real world implementation challenges and recommend new capabilities that would simplify adoption and drive greater value from use of Sephora’s infrastructure services
- Minimum 10 years' experience in a medium to large IT organization running a large, complex IT infrastructure
- Minimum 5 years' experience in a technical architecture role with demonstrated ability to lead via influence
- Expertise in problem solving and analyzing global scale distributed systems (public/private cloud, traditional datacenter infrastructure, multiple operating systems)
- Expertise with public cloud IaaS and PaaS providers including AWS, Azure
- Expertise with private cloud IaaS and PaaS providers such as OpenStack, OpenShift, CloudFoundry, etc
- Expertise working with and evaluating services of Managed Service Providers such as Verizon/Terremark, HCL, Rackspace, IPSoft. LiquidHub, etc
- Experience building software and systems to manage infrastructure and applications through automation; i.e. fluency with configuration and automation tools such as Puppet, Chef, Ansible, SCCM
- Expertise in compute virtualization such as VMware, MS Hyper-V, KVM, VirtualBox; familiarity with container technologies such as Docker and LXC is a plus
- Expertise with monitoring solutions such as Nagios, Cacti, Splunk, Zenoss, AppDynamics
- Fluent in Linux (various flavors) and Windows
- Comfortable with Java, Perl, Python, Ruby and shell scripting languages including bash and PowerShell
- Successful in building relationships with leaders, peers and business partners as well as vendors
- Experience developing strategic and tactical plans to meet business objectives
- Familiarity with retail and ecommerce technology is a strong plus
135
Bluemix Site Reliability Engineer Resume Examples & Samples
- Involvement in every facet of platform — from the earliest stage of product architecture, design and development to deployment, troubleshooting, and performance analysis – to ensure a reliable quality product in production
- Ability to collaborate and communicate clearly on status and progress
- Take initiative to do what must be done in order to keep critical systems operating
- Perform general OS, Web/Application server, database configuration, installs, automation
- Participate in periodic on-call rotation in a 7X24 environment
- Scripting experience in 1 or more of the following languages: Python, Ruby, Perl
- Strong software Linux development, networking, and problem solving skills
- Identify best practices and tools across Bluemix DevOps, Infrastructure, CFS Conductors & Network Teams
- Proven and relatable troubleshooting and triage skills across systems
- German: basic
- English: fluent
- German: fluent
136
Site Reliability Engineer Resume Examples & Samples
- Participate in service capacity planning and demand forecasting, software performance analysis, and system tuning
- Identifying underlying root causes and provide recommendations or solutions for long term permanent fixes to critical production issues
- Develop effective documentation, tooling, and alerts to both identify and address reliability risks
- Participate in on-call rotation with other members of the site reliability / service engineering team
137
Bluemix Site Reliability Engineer Resume Examples & Samples
- Involvement in every facet of the platform
- At least 1 year experience in software development or engineering
- At least 1 year experience in application operation
- At least 1 year experience in problem determination
- Some coding skills such as;Representational State Transfer (REST)/web services, distributed systems, messaging, knowledge of open source tools used in Cloud Foundry development (e.g. Git and Jenkins)
- At least 2 years experience in software development or engineering
- At least 2 years experience in application operation
138
Senior Site Reliability Engineer Resume Examples & Samples
- Experience with GPOs
- Live Site / Customer Support experience
- Experience automating and improving deployment service
- Experience automating and improving monitoring service
- Knowledgeable on software security concepts
- Test Design and Test Automation experience
- Excellent problem-solving and debugging skills with a solid understanding of testing practices. Strong communication skills
- Familiarity and passion for agile/lean development and execution methods, including Scrum and Kanban Experience developing services using the DevOps model
139
Site Reliability Engineer Resume Examples & Samples
- At least 4 years of professional experience
- Azure experience
- IIS experience
- Experience as part of a 24x7 on-call escalation path
- Experience managing certificates SSL / TLS Virtualization experience
- Exploratory Testing experience
- Performance Testing experience
140
Site Reliability Engineer Resume Examples & Samples
- Design, write and deliver software to improve the reliability, scalability, latency, and efficiency of your services
- Solve problems relating to mission critical services and create solutions to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions
- Conduct periodic on call duties using a follow-the-sun model (on an as needed basis)
- Ability to build and drive consensus towards common goals and priorities through advanced impact and influence skills
- BS Degree in Computer Science, Electrical & Computer Engineering or Mathematics or equivalent experience
- 4+ years of experience and outstanding coding skills in C, C++, C#, Java, Python or similar
- 2+ years of Software Engineering and experience in testing, deploying and supporting large scale services on Azure, AWS or similar environments
- Capable of technical deep-dives into networking, service design, operating systems and storage, yet verbally and cognitively agile enough to engage in strategy discussions with leadership team members
- Experience in SDLC, distributed systems, networking, hardware, logistics and operations or capacity planning
- A passion for building and participating in highly effective teams and development processes
- Expertise in problem solving and analyzing global scale distributed systems and critical production service environments
- Strong debugging, testing / validation and analytics/SQL skills
- Fundamental understanding of TCP/IP, BGP, AnyCast and CDNs
- Experience defining and measuring internal/customer facing OLA/SLAs
- Statistics experience and bias for measurement and driving action with metrics
141
Site Reliability Engineer Resume Examples & Samples
- Configuring, installing, and verifying new gTLDs
- Configuring TLD changes, such as pricing changes, hub migrations/consolidations, and enabling premium tiers
- Identifying and recommending TLD onboarding process improvements
- Embracing and advancing an automation mindset, developing and executing automated tests
- Implementing TLD installation process improvements
- Defining and recommending key indicators of TLD system health (KPIs)
- Developing and maintaining visualization tools to report real-time system health (metrics)
- Developing and maintaining alerting mechanisms to report system health problems (monitoring)
- Proactively capturing, investigating, and reducing operational failures
- Efficiently researching issues using system logs such as Kibana logs, perfmon, eventvwr
- Quantifying scope and impact of operational failures to inform prioritization decisions
142
Lead Site Reliability Engineer Resume Examples & Samples
- Build production services that host millions of transactions every day and generate billions of dollars in revenue annually
- Define and ensure formal Service Level Objectives for production services
- Champion tools and automation to drive deployment, monitoring, and management of rapidly changing production systems
- Automate, scale, and support complex production services
- Implement application performance monitoring to ensure site uptime and performance
- Participate in a 24x7x365 support rotation
- 6+ years of engineering experience working on teams that ensure high availability for very large scale, revenue generating production environments
- Experience in Java, Tomcat, Database, Oracle, Kafka, REDIS, MQ, JBOSS
- Scripting and automation experience for deployments using shell, Puppet, GIT
- Experience in handling production issues and incident management
- Experience in performance engineering and tuning of applications
- Expert with production monitoring concepts and usage
- Demonstrated experience in executing/delivering projects in a dynamic, fast-paced environment
- Bachelor’s degree in Computer Science or related field
143
Site Reliability Engineer Resume Examples & Samples
- 1 core services (OpenLDAP, LDAP, SMTP)
- 2 5+ years as a UNIX SA/SE experience...ideally Linux and should know LAMP stack well
- 3 Python, Bash and/or Ruby scripting
- Minimum four years experience as a systems engineer; at least two years with cloud infrastructure technology
- Experience building and managing authentication systems such as OpenLDAP, FreeIPA, SAML
- Strong UNIX and TCP/IP fundamentals, particularly in cloud environments
- Fluent in at least one software development/scripting language, ideally Python but open to Bash and Ruby
- Fluent in at least one configuration management tool such as Puppet and/or Chef
- Architect and build our core internal services to scale and endure security threats
- Own our authentication backend: OpenLDAP + Onelogin + AD
- Build our cloud-based and physical systems platforms
- Assess our current infrastructure and evolve it to become cohesive and scalable
- Script and automate your way out of problems
- 70/30 split for duties: 70% design and 30% operations. The main focus for this engineer is security and deciding how to secure the systems in terms of authentication and cloud infrastructure
144
Site Reliability Engineer Resume Examples & Samples
- Core services (OpenLDAP, LDAP, SMTP)
- 5+ years as a UNIX SA/SE experience (Ideally Linux and should know LAMP stack well)
- Python, Bash and/or Ruby scripting
145
Senior Site Reliability Engineer Resume Examples & Samples
- Influence feature design, architecture, standards & processes to ensure security, performance, operability & scale
- In Depth Experience on Azure Services like SQL Azure, Compute, IaaS, PaaS , WAAD , VNET, Express Route
- 5+ years’ experience in large scale internet service design & implementation
- Experience managing/Troubleshooting Dynamics AX 2012/AX 7 environments is desired
146
Site Reliability Engineer Resume Examples & Samples
- Refining our automated software deployment and release management processes
- Responding to, and resolving unexpected and potential service problems
- Writing software and defining best practices to prevent the same problem happening twice
147
Site Reliability Engineer Resume Examples & Samples
- Troubleshoot and analyze hardware, networks, application, and storage/DB related issues
- Cohesively work within a team to manage the monitoring and stability of our multi-cloud+Colo environments
- Build, configure, and proactively automate systems and services
- Install and support in-house, open-source, and 3rd party applications throughout the technology stack
- Collaborate with our Developers and Data Scientists on design concepts to help right-size, build, scale and automate production infrastructure solutions
- Provide peer support to other SREs and End Users
- Experience with any of the following tools we like: AWS, Ansible, Zabbix, Spark, Kafka, Docker
- Comfortable working in a dynamic startup environment
148
Site Reliability Engineer Resume Examples & Samples
- 78203BR
- Enable SPs to grow their top line revenue by delivering new, innovative services to market quickly and consistently across a variety of end point devices
- Help SPs deliver new services safely and confidently by providing comprehensive security and integration across VMware and 3rd party ISV and SaaS solutions
149
Senior Site Reliability Engineer Resume Examples & Samples
- Serve as a primary point responsible for the overall health, performance, and capacity of one or more of our Internet-facing services
- Gain deep knowledge of our complex applications
- Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth
- Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX environment
- Work closely with development teams to ensure that platforms are designed with ""operability"" in mind
- Function well in a fast-paced, rapidly-changing environment
- Participate in a 24x7 rotation for second-tier escalations
- B.S. or higher in Computer Science or other technical discipline
- 5+ years in a UNIX-based large-scale web operations role
- UNIX/Linux systems knowledge/administration background
- Trouble-shooting skills that span systems, network (TCP/IP), and code
- Programming skills (we like Javascript, shell, command line tools)
- Knowledge of most of these: data structures, relational and non-relational databases, networking, Linux internals, filesystems, web architecture, and related topics
- Experience deploying NodeJS application
- Experience with web-based Java/J2EE architectures and JVM configuration
- Exposure to MongoDB
- Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc
150
Site Reliability Engineer Resume Examples & Samples
- Monitor and troubleshoot issues that arise on customer-facing infrastructure
- Laise cross-departmentally in the event of an incident or escalation to or from ShiftOps
- Identify system data, hardware, or software components required to meet customer needs
- Verify stability, interoperability, portability, security or scalability of system architecture
- Analyze, install, acquire, modify and support operating systems, databases and software
- Linux/UNIX administration
- Scripting/data manipulation: Perl, Bash, or Python
- Scalability plumbing (LVS and other load balancers)
- Virtualization: Virtuozzo, Linux KVM, or OpenNebula
- Experience with Configuration management and version control
- OS: Solaris, CentOS, Debian, or Nexenta
- Network technologies: Foundry/Cisco, HP, Procurve
- Hardware: HP, DDN, Sun/Oracle, Dell
- Experience with Jira
- Filesystems: Ext2, Ext3, NFS, ZFS, DRBD, GlusterFS
- Web Servers: Apache, Nginx Email: Dovecot, Exim, DNS: Power DNS
- System/Network monitoring tools: Graphite, Observium, or Netflow
- Automated imaging systems: System Imager, PXE, Kickstart, Cobbler
- Remote management systems: iLO, ILOM, DRAC, iPEPS
- Hardware and Software RAID
- Databases (Configuration, access control, replication, tuning, other engines) MySQL/Percona, PostgreSQL, NoSQL technologies
- Working with Firewalls
151
Site Reliability Engineer Resume Examples & Samples
- Work in engineering team to design, build, and maintain cache layers, key-value, relational and binary file storage systems
- Diagnose, and troubleshoot complex distributed systems handling petabytes of data and develop solutions that have a significant impact at our massive scale
- Participate in building advanced tooling for testing, monitoring, administration, and operations of multiple clusters across data centers, primarily in Python, C and Java
- Work and collaborate across teams such Application services, Linux kernel, JVM and Capacity Planning, Hardware, Network, and Datacenter Operations to design next-gen storage platforms
- 5-7+ years of managing services in a distributed, internet-scale *nix environment
- Demonstrable knowledge of TCP/IP, Linux operating system internals, filesystems, disk/storage technologies and storage protocols
- Hands-on operational experience on managing cache services (memcache, redis)
152
Site Reliability Engineer Resume Examples & Samples
- Manage the availability, latency, scalability and efficiency of the Livestream platform, ensuring the platform has 100% uptime
- Participate in capacity planning and forecasting, system performance analysis, and system tuning at application, database, filesystem and networking layers
- Manage backups and disaster recovery, including backup monitoring and verification, and leading restoration tests and disaster recovery drills
- Analyze and improve system security policies on all layers of the platform; track and handle vulnerabilities affecting it
- Help manage physical datacenter infrastructure, networking and operations
- Build, manage and maintain all the base OS images and system configurations
- Bachelor’s Degree in Computer Science or related field. In lieu of degree, relevant skills or equivalent experience
- Fluency in at least one of the following languages: C, C++, Perl, Python, Ruby
- Familiarity with at least one of the following languages: Perl, Python, Ruby, Lua, JavaScript
- Ability to write scripts using shell, awk, sed and other core Linux tools
- Knowledge of essential Linux system calls, signals and memory management
- Experience in low-level system debugging and performance measurement tools
- Expert knowledge of IPv4 networking and routing protocols
- Experience with automation and working with monitoring tools including cloud monitoring tools
- Knowledge of SQL & NoSQL technologies
153
Site Reliability Engineer Resume Examples & Samples
- Contribute to design, write and deliver software to improve the reliability, scalability, latency, and efficiency of your services
- Influence and contribute to new designs, architectures, standards and methods for large-scale distributed systems
- Experience in SDLC and Agile projects
- Expertise in problem solving and analyzing distributed systems and critical production service environments
- Debugging, testing / validation and analytics/SQL skills
- Big data experience preferred (COSMOS, Hadoop)
- Fundamental understanding of OSI model/stack
- Firm sense of accountability, ownership and initiative for end-to-end project lifecycle with solid project management and skills
- Strong communication and collaboration skills to work with people from a variety of technical backgrounds
- Experience defining and measuring service key performance indicators
- Statistics experience and bias for measurement and data driven improvements
- The ability to drive Live Site issues and repair items to resolution
154
Principal Site Reliability Engineer Resume Examples & Samples
- Minimum of a HS Diploma or Associate’s Degree and/or
- A minimum of 10 years of technical experience along with established leadership credentials across technologies
- Exceptional knowledge and experience deploying and managing Cloud technologies
155
Senior Site Reliability Engineer Resume Examples & Samples
- Bachelor's degree in Computer Science, Information Management or other technical / IT field, OR in lieu of a degree: a HS diploma/GED and minimum 4 years of IT experience or equivalent military experience/training
- Minimum 3 years IT experience in enterprise-wide deployments
- Demonstrated experience scripting or developing software and services for the cloud Ruby, Python, Go, Java, Node.js, .NET, etc
156
Site Reliability Engineer Resume Examples & Samples
- Will configure and operate RHEL, CentOS, CloudLinux, Windows Server 2008/2012 and other mainstream distributions
- Web and Application Server Technologies – Understanding of the configuration and management of the Apache, IIS, Nginx, TomCat, SendMail, EXIM, ProFTPd, etc. Will implement full LAMP stack configurations as well as Windows based stack
- Virtualization Technologies – Will work with OpenStack, KVM, and implementation of Linux / Open Source technologies on the platform
- Config and Automation Management – General operational understanding of common Configuration Management technologies employed on Open Source and Proprietary platforms such as Puppet, SpaceWalk, SaltStack, SCCM and Fabric
- Will work with SQL and noSQL data based technologies, such as MySQL and Cassandra, with a focus on basic troubleshooting and query abilities
- Performance and Reliability Management – Will troubleshoot and make recommendations on improvement. and will write basic scripts and applications in PowerShell, Python, C#, and other industry standard languages
- Average 3-5 years experience in a large scale production environment (1000+ servers)
- Strong analytical thinker and problem-solver
- Organized, detail-oriented and able to multi-task
- Exposure to LeanIT concepts, DevOps, and Agile methodologies
- Experience with mainstream Open Source and Microsoft technologies
- Advanced understanding of Linux and Windows operating systems
- Experience with Virtualization technologies (KVM, OpenStack, Virtuozzo, VMWare, Xen, Etc.)
- Exposure with database technologies (MySQL or MSSQL)
- Basic experience with monitoring technologies and methodologies
- Exposure with San and NAS technologies (iSCSI, FCP, CIFS, NFS, etc.)
- Exposure to scripting and coding in industry standard languages (Python, PowerShell, C#, Ruby, etc.) and code management methodologies (Git, SVN, TFS, etc.)
- Recommended Certifications: RedHat RHCSA , LPIC-2, ITIL Foundations, Cisco CCNA, Network +
157
Site Reliability Engineer Resume Examples & Samples
- Excellent knowledge and experience in Software Engineering, System Administration, and Operations
- Experience developing in any of the following languages (Java, Javascript, C#, Python, Go)
- Understanding of Unix/Linux systems from kernel to shell and beyond, including internal Unix systems and networking (DNS, TCP/IP, UDP, etc)
- Experience designing and implementing tasks in Continuous Integration systems (Jenkins, Travis, CircleCI, etc.)
- Strong grasp of security, privacy and monitoring concepts
- Strong sense of project ownership and team responsibility
- 5+ years of relevant working experience and at least 3 years in a DevOps / Site Reliability Engineer role
158
Site Reliability Engineer Resume Examples & Samples
- Knowledge on MapReduce, SaaS, PaaS
- Knowledge on building/compiling the code using GNU make
- Knowledge on profiling and debugging the java/gcc application
- Knowledge on managing services on AWS
- Knowledge on tools like Jenkins, Git, Nagios, PostgreSQL, JIRA, Hadoop, Kafka, Flume
- Certification on Database administration, RHEL administration, network administration or any valuable certification would be an added plus
159
Site Reliability Engineer Resume Examples & Samples
- Manage the scalability, performance, and availability of MediaMath platform APIs by solving for reliability against existing systems and services spanning the entire stack
- Develop tools and automation to minimize delivery time and increase developer productivity
- Participate in the design and development of new and evolving services, architecture, and performance standards
- Support team members in the development of a SOA strategy and migration path
- Participate in capacity planning and service performance analysis and tuning
- Respond to and resolve emergent issues. Be on-call periodically as part of shared team
- This is not an exhaustive list of responsibilities. Other duties may be assigned, as needed. MediaMath retains the right to change job duties at any time
- As part of our global technology team, you may be required to be work off-hours or be on-call on a rotating basis
- You are considered a “security employee” and have a particularly noteworthy security aspect to your role and are required to undergo additional training annually
- Administer and ensure logical security in carrying out all job duties
- Support in Security Incident response and monitoring, as needed
- 3-5 years of relevant work experience, including experience with high-volume, production distributed systems environment
- Fluency with Python, Perl, Shell, Ruby, Scala, Go, or similar
- Experience managing and deploying full stack, distributed services
- Experience with system automation tools such as Ansible, Chef, Puppet, Salt Stack, etc
- Experience with monitoring, alerting, and pipeline analysis tools such as Nagios, Sensu, Graphite, Riemann, Logstash, etc
- Expertise in the use and optimization of SQL
- Experience with queuing/data-pipelining solutions such as Storm, Kafka, RabbitMQ, ZeroMQ, etc
- Experience with systems such as PostgresSQL, MySQL, Cassandra, CouchDB, Redis, and Memcached
- Exposure to AWS and OpenStack APIs preferred
160
Principal Site Reliability Engineer Resume Examples & Samples
- Takes ownership of all issues and does not out-source issues to engineering, but collaborates with Engineering as appropriate. Identifies methods and opportunities to build automation into RCA. Drives resolution activities and automated remediation's to ROCC
- Identifies automation enhancements and associated use cases. Constructs work plans to develop and deploy. Identifies resource needs and schedule. Applies impact analysis techniques. Leverages Agile methodologies to code required result
- Leads Change Management activities in a Cloud environment by developing complex test plans and seamless deployment processes. Able to identify impact analysis and quality control validation of changes. Defines agile automation techniques to rapidly deploy changes
- Primary interface to Product Engineering, Service Engineering, and SRE Development for continuous integration and delivery of product changes
- Provides technical expertise in developing solutions to complex software engineering and cloud operations problems, which require frequent use of ingenuity and creativity. Provides work leadership to others. Interfaces with senior management to provide and obtain information and gain consensus regarding project direction. Understands industry tools and trends and their relevance to the Cloud industry
- Collaborates well with other engineers and other engineering groups, voluntarily shares information. Submits Customer and Field Features for future product releases. Ability to probe into issues and opportunities to define requirements and scope. Links technology to business value
- Exercises considerable latitude in determining technical objectives, without appreciable direction. Understands agile software development methodologies and has senior development skills. Understands the value of software and associated time savings. Offers proposed design changes/suggestions to processes and products, exerts significant latitude in determining objectives of an assignment. Accurately analyzes current environment and business requirements to define solutions. Spends 50% of time with software development. Uses proper modeling and documentation techniques
- May be accountable for overall product and/or serve as a customer advocate. May represent organization as principal customer contact. Able to drive technical solutions and provide clear requirements to supporting resources
- Interacts cross functionally on matters that require coordination across functional/organizational lines. Accurately scopes technical level of effort. Leverages tools and provides clear and concise updates to project leaders
- Significant contributor to organizational goals and objectives. Demonstrates core values and leads by example
- Writes functional detailed design specs as well as responding to requirement documents and system level test plans. Accurately estimates both level-of-effort and resource skill sets at the assignment level. Accurately documents development, operation processes, and incident remediation activities
- Understands and adheres to cost/delivery/quality targets established during the program design phase. Applied understanding of process flow and how activities influence downstream processes. Uses metrics to plan and control projects
161
Site Reliability Engineer, Agent Lifecycle Resume Examples & Samples
- Working with Java or Kotlin
- Operationalizing different datastores
- Troubleshooting in a complex environment
- Foundation in systems knowledge including some of
162
Site Reliability Engineer Resume Examples & Samples
- Improve the predictability and reliability of software releases with the implementation of automated build, test and deployment tools and processes
- Engage with Software Engineering and Architect Teams to ensure Release Engineering best practices are implemented
- Provide afterhours release and change control support based on the most current change control schedules
163
Senior Site Reliability Engineer Resume Examples & Samples
- Create and deliver automation software required for improving the functionality, availability, and manageability of our Cloud collaboration micro-services using Python and Go language
- Basic networking skills and familiarity with Unix/Linux systems including CLI used in checking component status and logs
- "Cloud" (using IaaS and PaaS) experience desirable
164
Site Reliability Engineer Resume Examples & Samples
- Providing support for the applications, scripting, automation, managing Incidents, Changes and Day-To-Day support in Production, QA and Development environments
- The majority of the job responsibility will be focused on Production Application support but there will be some script development and automation work (25-50%)
- Job responsibility will from time to time requires work during off hours, weekends and night shifts as this is a 24x7 environment
- Maintaining strong links with the 1st Level, DCIS and the Development teams
- Management of change required to support product enhancement and growth
- Following departmental change management procedures in defining, planning and implementing changes in such a way that service disruption is minimized, and adherence to Service Level Agreements is ensured
- Following defined procedures in order to log and track issues
- Minimum 3-5 years experience with various software applications and development
- Good analytical skills (root cause analysis and create solutions to resolve incidents)
- Strong communication skills - this role will involve communicating regularly with a wide range of people of all professional levels and cultures
- Ability to code in one or more scripting languages (Python, Java etc) and open platforms like CDAP and Docker
- Ability to work independently in a challenging environment to meet product schedules and deadlines
- Good understanding of Networking, BigIP and DNS
- Experience with scripting and automation (monitoring, configuration and deployment automation)
- Experience implementing system and application monitoring (familiarity with monitoring tools like Kibana)
- Experience Cloud based systems (like Open Stack and AWS)
- System Administration experience (Linux)
- Experience with Big Data technology (Hadoop, Elastic search, HDFS etc)
- Familiarity with open-text based formats like JSON etc and web services
- Familiarity with Oracle Databases
165
Site Reliability Engineer Resume Examples & Samples
- SREs are engineers with the right mix of knowledge and skills in software (i.e. programming, data structures and algorithms) and systems (i.e. operating software on internal and external infrastructure at scale)
- We constantly evaluate products and services before and after production releases to prevent, identify and fix problems that impact service availability in deploying, configuring, releasing, monitoring, recovering, and scaling
- We dedicate at least 50% of our time applying software engineering principles to resolve problems impacting service uptime or our operational efficiency
- Experienced in writing clean code in at least one Object Oriented language such as: Java, C#, JavaScript, Python, Ruby
- Experienced automating software build, deployment and server configuration management using tools such as puppet, chef and jenkins
- Expertise in Windows and Linux system administration, databases (relational and NoSQL), web servers (Apache, IIS), networking and storage technologies
- Experience with Cloud technologies and platforms such as AWS or Azure
166
Senior Site Reliability Engineer Resume Examples & Samples
- 5 plus years experience administering production Linux environments
- Demonstrated proficiency writing automation in Python, Ruby, or a similar language
- Resourcefulness and independence when required to find solutions to new problems
- Ability to incrementally improve legacy configuration code
- Expertise in writing clearly about work scope and status
- Experience working with widely distributed teams
167
Site Reliability Engineer Resume Examples & Samples
- Work closely with technical writers and service engineers to ensure that our best-in-class documentation is delivered reliably and on time
- Develop and maintain build systems and automation tooling
- Develop and maintain internal tools to support authoring and collaboration
- Maintain both integration and production infrastructure as part of an operations-focused culture
- Work closely with other members of the technical content team to provide systems support, training, and solutions as needed
- Collaborate with the user experience team to ensure that technical content — including documentation, contextual help, UI text, and other forms of assistance — are discoverable, usable, and pleasing to the customer
- 7+ years experience
- Deep knowledge of at least one of the following scripting languages: python, javascript, bash
- Knowledge of several of the following technologies: CI/CD systems (TeamCity/Jenkins), Docker, linux OS (Oracle Linux/RedHat), infrastructure-as-a-service (AWS/OpenStack/Google Cloud/Azure), web servers, load balancers, log analysis tools, remote system and network debugging tools
- Impeccable written English skills
- Strong team player with outstanding communication, organization, and interpersonal skills
- Comfort with agile, swiftly changing, dynamic software development situations
- Ability to drive, follow, and evangelize cross-team processes
- Knowledge of cloud infrastructure concepts and technologies
- Experience using distributed source code management systems such as Git
- Experience using enterprise-grade bug tracking systems, such as JIRA
- Experience (and commitment to) capturing and maintaining institutional knowledge
- A Bachelors degree in a Computer Science-related field, or significant work experience in startups or fast-paced enterprise technology development environments
168
Site Reliability Engineer Resume Examples & Samples
- Experience developing for and supporting specific Azure technologies (ex: Cloud Services, Azure SQL, Azure Service Bus, Azure Storage, KeyVault, Service Fabric, Azure Active Dirctory, etc.)
- Experience developing for and supporting an on-premise Microsoft stack (ex: Server 2012 R2, Active Directory, IIS, SQL Server, Hyper-V, CRM)
- Experience supporting Live Site as part of a 24x7 on-call escalation path
- Strong understanding of automation practices in a DevOps space (ex: automated deployments, synthetic transactions, monitoring, etc)
- A communicative team player with a can-do attitude
169
Site Reliability Engineer Resume Examples & Samples
- Work within various areas of focus (e.g. monitoring, secrets management, deployment pipeline, containerization, etc) and research, strategize, and propose solutions that meet requirements, reduces friction for product engineers, and consolidates existing solutions
- Drive adoption and onboard teams to Delivery Engineering tooling and solutions
- Contribute to designing, implementing and maintaining team tooling
- Work closely with engineering teams to learn about needs, current process and to promote best practices
- Migrate application stacks to cloud infrastructure
- Solid programming and troubleshooting skills. You may be called upon to help with systems written in Go, Python, Java, Scala, Php, Ruby amongst many other programming languages. We don't expect you to know everything but we expect you to learn quickly
- A good understanding of databases. Both relational and otherwise
- An understanding of cloud based deployments on Amazon Web Services or Google Cloud Platform
- Strong grasp of multi-tier application architecture & concepts of networking, load balancing, monitoring and *nix OS
- A passion towards automating things. We love repeatable processes and know that humans are prone to error. We'd like to automate deployments, monitoring releases and even brewing our coffee
- An understanding of an application that is one of our goals to move our systems in that direction
- A high degree of interest in Linux containers and smart clustering solutions like Kubernetes/Mesos/fleet, etc
- A bias towards helping people. Many teams will rely upon you for help to build their systems
170
Senior Site Reliability Engineer Resume Examples & Samples
- Create enterprise infrastructure and tooling
- Monitoring and diagnosis of systems for optimal performance
- Generating well defined and documented standard processes for the enterprise
- Research and development of new technologies related to our problem set
- Occasional presentations and training of integrated technologies
- Full stack developer, fluent in multiple languages (python, perl, bash, go)
- Experience with system automation tools (Ansible, Puppet, Chef)
- Exposure to SQL and NoSQL solutions (MySQL, Postgres, Redis, Cassandra)
- Strong cloud skills, AWS and GCP
- Experience with containerization and virtualization technologies (Docker, Kubernetes, CoreOS)
- Hands-on experience with system configuration (Nginx, Apache, Linux, consul, etcd)
- Queuing and data-pipeline solutions (RabbitMQ, ZeroMQ, pub/sub, SQS)
- Effective communicator with a desire to share and guide others
171
Site Reliability Engineer Resume Examples & Samples
- Support the daily build and release needs of agile teams
- Coordination with the development organization on production deployments
- Provide support and rollback facilities in a timely manner
- Develop suggestions for enterprise-wide best practices
- Post release monitoring and validation
- Professional experience with CI/CD solutions
- Good understanding of CI/CD principles (Jenkins, Bamboo)
- Master VCS skills (git)
- Good experience in a Linux environment (Multiple distros, CentOS, Debian, RedHat)
- System scripting experience in multiple languages (Python, Perl, Go)
- Comfortable automating builds and releases with Make or other tools
- Working experience with Atlassian tools a plus (JIRA, Confluence
172
Senior Site Reliability Engineer Resume Examples & Samples
- Appropriate CS and IT technology technical background (Bachelor in computer science or equivalent)
- Strong knowledge of Linux operating systems (CentOS/RHEL) and configuring common core services, debugging. Knowledge of BSD a plus
- Strong knowledge of L1-4 Networking, Switching/Routing, L2-7 reverse proxy and proxy load balancers, firewalls, DNS/DHCP, TCP/IP stack. Should have basic knowledge of OSPF, BGP, SNMP, and SMTP
- Must demonstrate strong skills in L7 debugging and analysis. Knowledge of Curl and other tools to diagnosis and differentiate L1-4 issues from L7 HTTP, HTML, JS issues. Should be familiar with diagnosing HTTP header and caching issues
- Strong knowledge and experience with SOA/RESTful/JSON environments running node.js, vertx.io, tomcat, apache, NGINX, varnish, memcached, redis, Ruby, Go, Python
- Strong knowledge and experience with version control (Git), deployment tools (Capistranio) and Continuous Integration technologies (Jenkins, Puppet, Chief)
- Experience with transactional databases (MySQL, Postgres) configured for high availability and redundancy
- Ultimate self-starter
- Experience in handling production outages and root cause analysis
- Strong crisis management leadership ability
- Strong and effective written/verbal communication skills, whether talking to individual contributors or to executive management
- Skill with Ruby, Python, or Java a plus
- Experience creating tools for infrastructure (IaaS and PaaS) management and automation a plus
- Part of a Global TDO team responsible for overall site availability and reliability with rotating shifts and follow-the-sun global support. TDOs are not on-duty longer than 8 hrs at a time
- Works with the Global Systems Operations Center staff (SOC) (Tier 1-2 support), and SRE team (T3-4 support) to prioritize issues and ensure adequate follow-up to issues
- During an incident, leads efforts to triage and mitigate impact globally. After an incident, responsible for incident reviews and action items for follow-up/restoration in order to improve overall service stability
- Manage real-time communications during outages with both technical and non-technical audiences
- Evangelize Best Practices to the rest of the company
- Develop policies and procedures that improve overall product stability and availability
- Design and create tools to help manage site services, and host monitoring/alarming
- Participate in Incident Reviews of outages in order to improve overall product stability
- Build relationships with development teams and technology leaders across the company
173
Site Reliability Engineer Resume Examples & Samples
- Experience managing Linux systems in a 24/7 production environment
- Ability to program in Python, Ruby or Perl highly preferred
- Working knowledge of multi-tier applications and their dependencies including load balancing, TCP/IP networking, web services, LDAP and DNS
- Proficiency with web server administration including Apache and Nginx highly preferred
- Knowledge of database support and administration including MySQL, Postgres & HBase
- Experience with monitoring tools such as Nagios and Graphite highly preferred
- Develop and maintain automation for system administration and application management
- Experience with configuration managers such as Puppet, Chef or Ansible highly preferred
- Excellent interpersonal and communication skills demonstrated through previous projects or assignments (work or academic related)
- Network administration experience a plus
174
Site Reliability Engineer Resume Examples & Samples
- Experience making hardware decisions based on workload projections
- Desire to work closely with both team members and other people throughout the company
- Proficiency writing clearly about work scope and status
- Ability to clearly communicate technical concepts and reasoning to both technical and non-technical audiences
- 2 plus years experience administering production Linux environments
- Proficiency writing automation in Python, Ruby, or a similar language
- Ability to perform physical activities, including, but not limited to, walking, standing, lifting items up to 35 lbs. and using, handling and controlling tools
175
Site Reliability Engineer Resume Examples & Samples
- Evolve our continuous deployment infrastructure. Our deployment infrastructure is the lifeline tying our development teams to the True Fit base. We rely heavily it's success, and will look to you to help provide guidance and direction for it's growth
- Programmatically build and administer cloud based Linux servers (Ubuntu, RHEL) on AWS. True Fit lives entirely in the cloud. We need someone who understands what this means and is comfortable navigating these seas. We need you to know what things like RDS, EC2, Cloudwatch, Lambda, Route53, and VPC mean, and a willingness to utilize their apis
- Architect deployment systems for our application software. True Fit utilizes the cutting edge in analytics engines and methods in our systems. We need someone willing and able to build and utilize a varied knowledge base to support what we do
- Automate the world. We're a lean shop, and we work at a breakneck pace. If it needs doing more than once, it needs to be automated. Build, deployment, monitoring, testing, and infrastructure are all within our automation sphere
- Analyze and troubleshoot network and infrastructure issues. True Fit's environment demands a sharp mind and rapier analytical skills. We run into issues now and again, and we rely on smart people to tease a problem apart to find elegant solutions
- Monitor and measure system performance. The heart of any well-tuned system is a known system. True Fit must understand what's going on under our hood and requires someone mindful of details
- Work with other departments to design and build operations-friendly software. While our operations infrastructure may provide the guts of the True Fit machine, our product & support people, engineers, and scientists, provide the heart, mind, and soul of what we do. We'll need a person who can liaise with other departments, understand their needs, and collaborate to find solutions
- Participate in our agile development environment. We maintain a fast moving, but tight ship. We need a person that can pull recent changes, commit a fix, and push a working system back into our infrastructure
- 2+ years as a system administrator, network engineer, build engineer, or software developer. True Fit is looking for a person ready to get their hands dirty. Vast seas of knowledge are not a requirement, but we do look for a quick and inquisitive mind, and one capable of applying what is learned. A development background is preferred
- Experience in an environment that develops and releases commercial software products in hosted environments. We're looking for someone that understands an agile release process and can contribute to the support of our systems release processes
- Proficiency in the Open Source Ecosystem. We're looking for a person steeped in the open source tea and proficient in that ecosystem. You can expect to see and be responsible for various different systems' smooth functioning
- Expert scripting skills in at least one of shell, python, perl, ruby, etc. We live and breathe by our code and processes. We need someone that can speak our language
- Some datastore knowledge. True Fit's data collection is vast and varied. Ideally, you'll have relational database acumen. Postgres & Oracle are preferred. NoSQL knowledge a la Mongo or Hadoop would be a plus as well
- Knowledge of configuration management / desired state frameworks. Our systems are built by our code. We're looking for a person with knowledge of Chef, Puppet, Ansible, etc
- Working knowledge of application technologies including JavaScript, HTML, Scala/Java, C++. Our operations department participates in many of the company's core functions. The more you can understand of what's being said, the better
- Undergraduate degree in a quantitative field (Math, Physics, Engineering, Computer Science) or relevant experience. We find a STEM background or relative industry experience sets people up for success in this position
- Strong listening and communications skills. We need to understand what's up, and for you to do the same! What's needed? Where are we going? How are we getting there? How's our progress?
- Highly motivated self-starter and can do attitude.We need you to get your hands dirty and to go shoulder deep when required. We also need a person able to suss out where the work needs to be done
176
Site Reliability Engineer Resume Examples & Samples
- Keep the customer facing services available at top performance by monitoring the health of the system
- Build the tools we need to monitor our system and respond quickly
- Develop systems to monitor the capacity of our applications and work to solve these capacity issues before they become a problem
- Lead Pardot Engineering by identifying areas for improvement in reliability and prototyping these solutions
- Collaborate with other teams to discuss and resolve technical issues and escalations
- Expertise in TCP/IP related technologies (networking protocols, network programming, etc
- Prior Chef or other configuration management experience
- Experience working on large-scale systems with many moving parts
177
Site Reliability Engineer Resume Examples & Samples
- At least 2 years’ experience in troubleshooting complex systems, including OS, Network, and Application code
- At least 2 years’ experience in coding in at least one modern language such as Python, Ruby, NodeJS
- At least 2 years’ experience with UNIX/Linux systems
- Experience with DevOps, Continuous Delivery, Continuous Deployment
- BS in computer science
- Experience with SCM systems like Git
- Basic security knowledge
- Database knowledge including SQL and NoSql
178
Site Reliability Engineer Resume Examples & Samples
- DevOps Engineer
- Reliability Engineer
- Systems Engineer
- Cloud Platform Engineer
179
Site Reliability Engineer Resume Examples & Samples
- Ensure service reliability and uptime for Barracuda Cloud services
- Ensure consistent application of operational standards across cloud services
- 2 to 4 years proficiency in Linux/Unix command line and understanding of package management on Linux systems
- Demonstrated programming skills in one or more of: Bash, Python, Perl, PHP, Ruby, C, Java
- Understanding of network OSI model
- CS/Engineering Degree or 2+ years of job experience
180
Site Reliability Engineer Resume Examples & Samples
- 3+ years proficiency in Linux/Unix command line and understanding of package management on Linux systems
- Demonstrated programming skills with scripting languages such as Python, PHP, Bash, Ruby, or Java
- Experience with configuration management systems such as Puppet
181
Senior Site Reliability Engineer Resume Examples & Samples
- #1 responsibility: solving problems with code (automating solutions to common problems, designing tools, troubleshooting production product code to find efficiencies, etc)
- Collaborate with internal groups to identify, develop, and deploy manageable, scalable and robust services
- Represent Cloud Engineering in design reviews and operational readiness exercises for new and existing services
- Drive standardization efforts across multiple disciplines and services throughout the organization
- Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of services
- 5 to 7 years experience programming with languages such as: Python, PHP, Ruby, or Java
- 2 to 3 years proficiency in Linux/Unix command line and understanding of package management on Linux systems
- 1 to 2 years experience with configuration management systems such as Puppet - experience designing and implementing configuration management systems, deploying production services using config management
- Experience implementing service monitoring and alerting using tools such as Nagios
- Ability to find and troubleshoot issues across the entire OSI stack
- Track record of successful practical problem solving, excellent written and interpersonal communication, and documentation skills
- Demonstrated ability to manage projects and deadlines
- Experience with database administration, replication, troubleshooting production performance problems is a plus
- CS/Engineering Masters Degree and 4+ years of job experience or 8+ years of job experience
182
Site Reliability Engineer Resume Examples & Samples
- Detective: SREs troubleshoot problems in live production systems, both on their own and in collaboration with systems and application engineers
- Ambassador: Keep the company informed about the status of Fitbit services, the impact of known issues, and the progress of ongoing investigations
- Developer: Design and refactor parts of the Fitbit backend system for stability and performance, and write tools and scripts to automate maintenance and monitoring tasks
- Coach: Meet with other teams and attend architecture reviews, and offer advice on how to implement features that are efficient, highly available, and fault-tolerant
183
Site Reliability Engineer Resume Examples & Samples
- Strong Linux system administration experience and production troubleshooting experience
- Expert-level Java knowledge
- Experience working with high traffic, scalable web applications
- Experience of diagnosing and fixing complex production issues
- Understanding of data sructures, algorithms and framework internals
- Good verbal and written English
184
Site Reliability Engineer Resume Examples & Samples
- 2+ years or equivalent experience, 5+ for senior engineers
- A strong background in Linux software development
- Experience working with Mesos is a plus!
- Working experience with web-scale deployment tooling (Chef, Puppet, Ansible, etc)
- Knowledge of containerization (Docker) and web services frameworks
185
Senior Site Reliability Engineer Resume Examples & Samples
- Deploy and manage solutions efficiently on private or public cloud, and ensure they meet established SLAs
- Maintain database performance by investigating, debugging, identifying and resolving production issues
- Expand existing automation tools to streamline onboarding.pment and the Adobe Campaign platforms
- Design system solutions of the product from custom requirements for customers
- Working experience/knowledge of public cloud services such as Amazon Web Services (AWS or Microsoft Azure)
- Automation tools (Ansible, Salt, Chef or Puppet)
- Good knowledge of virtualization technologies and container technologies
- Experience with Postgres (Deployment, maintenance, and scalability)
- Experience with email-related technologies including authentication, accreditation, DNS, SMTP, FBLs, and MTAs
186
Senior Site Reliability Engineer Resume Examples & Samples
- Own and scale EAA Cloud services
- Analyze and improve the efficiency, scalability, and reliability of our backend systems
- Automate new cloud service deployments
- Participate in an on-call rotation backed by our engineering teams
- Perform advanced troubleshooting and monitoring of our systems to ensure adequate SLA and capacity requirement
- Proactively monitor the EAA cloud service for potential issues
- Analyze logs and events from the cloud service and provide recommendation for ensuring smooth service operation
- BS with 8+years or MS with 6+years in Computer Science / Engineering or equivalent
- 6+ year’s experience managing the setup, configuration and monitoring of different components of AWS infrastructure system like EC2, Route53, VPC using Boto/Boto3
- 3 years of shell scripting and at least one other scripting language. Python preferred
- 4+ years in development of automated deployment scripts using tools like Ansible, Puppet and Chef 5+ years with setup or configuring load balancers, Nginx/Apache proxies
- 5 years with configuration and maintenance of common infrastructure such as Postgres, Cassandra, Storm, Elasticsearch, Rabbitmq, Celery, Active Directory
- 4 years with monitoring/reporting technologies e.g. Sensu, Nagios, Graphite, Elasticsearch, Logstash, Kibana
- 4 years with performance, scalability, and reliability issues of 24x7 commercial services
- 3 years implementing SOC2 and PCI
- 2 years with AWS S3 and IAM services 5 years with troubleshooting
- 5 years with Linux. Ubuntu preferred
- Good hands-on Knowledge Internet protocols (HTTP, DNS, TCP/IP, ICMP, DHCP)
- Experience in: Configuring web servers (NGINX/Apache/IIS) and the infrastructure necessary to support enterprise web applications (load balancers, Application Delivery Controllers, SSL-VPN solutions, SQL and databases etc.)
- Familiarity with Cloud computing services like VMWare, OpenStack, Microsoft Azure Excellent written and verbal communication skills
187
Senior Site Reliability Engineer Resume Examples & Samples
- Extensive experience in designing, configuring, and delivering large scale application technical infrastructure
- Experience as a project lead, supporting multiple simultaneous projects, in high scale environment
- Strong coding and scripting ability (Java, C, C++, Python, Perl)
- Strong experience with database technologies (Oracle, Mongo preferred)
- Experience and knowledge applying best practices to build secure platforms
- Excellent Analytical and creative problem solving skills
- Must be highly collaborative and able to work with different teams
- A strong sense of focus and excellent attention to detail while working in a very fast-paced environment
- Ability to learn new technologies in a short time
- Strong communication skills and ability to articulate complex solutions well
188
Site Reliability Engineer Resume Examples & Samples
- Front office application support – basic knowledge of infrastructure helpful but not critical
- Aptitude for and wiliness to learn – almost all home grown (expect to train on application portfolio from ground up)
- Logical problem solver (how they answer questions/process/must be a good listener)
- Linux – Python / Perl / shell scripting /
- Monitoring – mostly home grown apps – will train + Anturis
- Basic understanding of OSI stack and ITIL fundamentals – will require ITIL certification eventually
- Communications important (with business users: traders)
- Ability to work in high pressure environment – must be thick skinned
- Will be co-located with traders – will create tickets
- Basic tech screening is less important than mind set
- SW engineering background helpful
- Financial background helpful (derivatives)
189
Site Reliability Engineer, Self-driving Resume Examples & Samples
- BS in Computer Science or a related field. A MS or equivalent is preferred
- Programming skills in at least one of Python, C, C++, or Java
- Strong skills in process, documentation, and change management
- Excellent interpersonal and customer-facing skills
190
Site Reliability Engineer Resume Examples & Samples
- Contribute to a team responsible for the availability, scalability, and performance of our enterprise platforms
- Build and maintain automation systems to help us manage our rapidly growing infrastructure
- Gain deep knowledge of our complex applications to develop a bird's eye view of our platform
- Assist our Software Engineering teams to ensure proper monitoring and metrics are being built into the applications
- Maintain and develop custom systems and tools to improve our ability to deploy, automate, and effectively monitor custom applications in a mixed Windows/Linux environment
- Assist in the rollout and deployment of new product features and installations to facilitate our rapid iteration and constant growth
- Lead troubleshooting of issues that occur in our production environments
- Gain and use knowledge of monitoring systems and configuration management systems (AWS-specific tools, Puppet, Nagios, New Relic, etc)
- Troubleshoot issues across the whole stack - hardware, software, applications and network
- Partner with development teams to build the standards by which we deliver our infrastructure
191
Site Reliability Engineer Resume Examples & Samples
- Strong understanding of the Linux operating system
- Ability to code or script automation in at least one language (Java, Go, Python, Ruby, Perl, Bash, etc.) on Linux-based platforms
- Deep experience in at least one infrastructure component (operating systems, compute, storage, networking, data center, distributed systems, big data, cloud, etc.) and solid understanding of the rest, and how they impact services
- Familiarity with cluster management tools such as Mesos, Docker Swarm, Kubernetes, Marathon, Aurora
- Familiarity with distributed storage and filesystems such as CEPH, HDFS, GFS, IPFS
192
Senior Site Reliability Engineer Resume Examples & Samples
- Installs, supports and maintains new software infrastructure for SaaS deployment
- Monitor the SaaS environment and work with QA, Developers, hosting company to identify and troubleshoot service problems
- Analyses and resolves SaaS infrastructure faults and undertakes routine preventative measures, including backups and implements, maintains and monitors network security
- Contributes to Security and Risk management initiatives, including business continuity planning for Saas deployments
- Ensures that failover mechanisms are in place and are working correctly
- Configures and maintains hw and software monitoring solutions to send alerts in case of service problems
- Works with all other departments in the company to gather requirements and coordinate activities
193
Site Reliability Engineer Resume Examples & Samples
- 3+ years of professional software development in Scala
- Production-level operations experience across JVM applications
- Experience monitoring distributed systems application architectures
194
Senior Site Reliability Engineer Resume Examples & Samples
- Experience using AWS, especially provisioning EC2 nodes, and setting up CloudFront distributions
- Unix/Linux administration
- Scripting (Bash/Python)
- Http server configuration (NGNIX/Apache)
- Configuration management (Terraform/Puppet)
- Experience deploying apps using Docker containers
- Node health monitoring (DataDog or similar)
- Web traffic analysis with Splunk or similar
- Continuous deployment using Bamboo, Bitbucket Pipelines, or similar
195
Senior Site Reliability Engineer, Hipchat Resume Examples & Samples
- Scripting and software development across multiple programing languages
- Network optimization and troubleshooting: TCP/IP, UDP, ICMP, MAC addresses, DNS, OSI layers, and load balancing
- Building, automating, and maintaining infrastructure in Amazon Web Services
- Development and maintenance of configuration management systems responsible for thousands of hosts
- Leading a team of engineers in troubleshooting service outages affecting millions of users
- Implementing system and application level telemetry for large distributed cloud architectures
- Diagnosing and resolving capacity problems in high-throughput web applications and network services
196
Site Reliability Engineer Resume Examples & Samples
- Work with engineering teams to design and write code to create systems which are highly available and able to scale seamlessly
- Plan for and eliminate any potential threats to stability, availability or security
- Improve monitoring, alerting and resilience of systems
- Write tools to assist work such as capacity planning or improving the ability to debug production issues over distributed systems
- Contribute to a culture of learning and responsibility by writing detailed postmortem reports
- Tackle live issues on production when on-call with assistance from the rest of the teams
197
Site Reliability Engineer Resume Examples & Samples
- Work with product teams on design and implementation of large scale distributed systems
- Automate, automate, automate and then …. automate!
- Bring ideas to life to help make the engineers' lives better
- Help developers with some of their hardest problems
198
Senior Site Reliability Engineer Resume Examples & Samples
- Troubleshoot complex production issues in a distributed environment
- Work on projects that make our network more stable and faster
- Work with our other L3 engineering support teams to troubleshoot complex problems our network for our customers
- 5 years of relevant experience and a Bachelor’s degree Computer Science or its equivalent or
- 3 years of relevant experience and a Master’s degree Computer Science or
- 1 year of relevant experience and a PhD Computer Science
- Education: Bachelor's Degree in Computer Science or equivalent
- Minimum of 5 years of experience in troubleshooting and reading complex code using debuggers
- Minimum of 5 years of experience in DevOps, SRE or similar roles
- 3+ years experience with AWS tools and methodologies
- 3+ years experience programming in Python or Java-Script
- 3+ years experience with MySQL and NoSQL databases
- Education: Master's Degree in Computer Science or equivalent
- Highly responsible, self-disciplined, self-managed, self-motivated, able to work with little or no supervision
- Experience in distributed systems or security products
- Passion to understand, learn, and dissect new technologies quickly on your own
- Extensive experience working on multiple projects at a time in a fast paced, results oriented environment
- Pluses: Security domain knowledge, Web Application Firewall
199
Site Reliability Engineer Lead-digital Resume Examples & Samples
- Ensure the secure availability and 100% uptime (reliability) of all JPMC digital properties from an application delivery perspective
- Be part of a team which proactively monitors all application flows into JPMC (Web and Mobile)
- Recognize false positives in flow and recommend course of action
- Understanding the security implications of a certain pattern and socializing it with a sense of urgency and course of action
- Troubleshooting issues and quickly determining the root cause in working with other development / networking or security teams
- Document root cause and solutions in an concise manner and build a knowledgebase
- Shadow and teach/lead colleagues with skills to ensure the availability of JPMC digital properties securely
- Liaise with other organizations within JPMC to manage IT compliance/audits/security with National and International laws and regulations, as well as contractually enforced industry standards
- Interface with IT Security and Risk, Audit, and privacy to coordinate related policy and procedures, and to provide for the appropriate flow of information regarding risk
- 4+ years of experience in developing, deploying and supporting commercial and custom software solutions with an emphasis on identity and access management framework, Security, integration and support
- Expert knowledge of the HTTP protocol, response codes and behavior
- Expert knowledge of Application Security ( App Level Firewalls)
- Knowledge of ASM (F5) or equivalent application firewall and policy
- Creative and inquisitive professional with excellent interpersonal and cross functional/divisional collaboration skills able to handle work smoothly under stress, managing multiple assignments concurrently, adjusting easily as business needs change, and acquiring necessary new working knowledge quickly
- Highly analytical with strong research skills, able to discern key issues and information in complex situations and resolve issues quickly
- Advanced communication (including group presentations), problem solving, and conflict resolution with internal and external stakeholders including senior leaders
- Ideally has held positions in software development, networking, operations and other technical areas in career demonstrating well round command of other technology disciplines
200
Site Reliability Engineer Resume Examples & Samples
- Scripting and software development across one or more programming languages (ideally Java and/or Python)
- Deep understanding of Linux systems
- Hands on experience with cloud infrastructure such as AWS, Google compute, Azure, Rackspace cloud (minimum of 2 years)
- Expert level troubleshooting skills across different levels of the stack
201
Lead Site Reliability Engineer Resume Examples & Samples
- Create, maintain, own and operate your team’s services that supporting fundamental capabilities within Grubhub’s products
- Tackle some of the most challenging problems you can face developing high availability services in a distributed cloud environment that needs to scale exponentially
- Manage / Lead a team of 2 to 3 direct reports
202
Site Reliability Engineer Resume Examples & Samples
- Experience as either a Systems Administrator with Programming experience or an application-focused DevOps Engineer
- Substantial experience managing Windows and Unix/Linux infrastructure
- Self-starter who is able to take ownership of technical issues and be a productive member in the on-call rotation and certain off-hours shifts
- Strong troubleshooting skills that span systems, network, and applications
- Strong scripting ability in at least one of the following languages: Bash, Ruby, Perl and/or Python
- Experience with virtualized environments
- Intermediate knowledge of networking and load-balancing concepts
- Ability to write clear and thorough documentation
- Prior experience in an Internet-facing technical operations role with high uptime requirements.Demonstrated ability to successfully work with Cloud architectures such as AWS, Azure, CloudStack, or OpenStack
- Strong personal and professional initiative with a focus on the success of the team and organization
- Expertise with configuration management systems such as Puppet
- Experience with package management in multi-datacenter environments
- Experience with monitoring systems, such Nagios and Sensu
- Experience collecting and aggregating log data in an ELK stack
203
Senior Site Reliability Engineer Resume Examples & Samples
- Developing, engineering and operating API-driven platform services that provide the foundation for our micro services
- Supporting Cloud platform services from architecture, through development, deployment and production operations
- Developing automation and orchestration to improve the deployment and management of our environments
- Developing and maintaining build platforms, automation engines, micro-services, and compute platforms
- Maintaining automated test suites using CI/CD tools
- Participating in troubleshooting, capacity planning and analysis, performance analysis activities
- Advising management on service onboarding strategies and execution
204
Site Reliability Engineer, Senior Resume Examples & Samples
- 5+ years of experience with Microsoft systems administration tools
- 3+ years of experience with LoadRunner
- 3+ years of experience in scripting with PowerShell
- Experience with Cloud Service Providers (CSP), including AWS and Azure
205
Senior Site Reliability Engineer Resume Examples & Samples
- Analyze, diagnose, replicate, troubleshoot and resolve technical issues reported by customer using the Fiserv PaaS platform
- Take ownership, manage and maintain status on support requests from the business unit customers
- Escalate unresolved issues that require more in-depth knowledge in a timely manner
- Report and submit product defects and collaborate with other engineering disciplines to triage customer/business unit issues
- Create and peer review knowledgebase articles and product documentation
- Willing and able to learn new technologies
- 4-year college degree + 5 years of experience in applicable field OR advanced degree in applicable field + 4 years of relevant experience
- Software engineering skills: experience writing maintainable reusable software. You use automation to make your job more efficient. You have an obsessive need to automate. You are offended when you have to do anything manually more than once
- Operations or Systems Administration experience, particularly on UNIX. You know how page cache works and feel very comfortable at the command line
- Experience with configuration management. You have managed an infrastructure with hundreds or thousands of servers and dozens of technologies
- Strong networking fundamentals. You understand TCP/IP, subnetting and the difference between socket and connect timeouts. NSX, Panorama experience a plus
- Knowledge of distributed systems (Windows and Linux) and virtualization software (VMware)
- Automation/Continuous Integration (Salt, Fabric, Jenkins, Octopus Deploy, etc.)
- Concurrent Versioning Software (Git, GitHub)
- A knack for troubleshooting tough problems
- Meticulous and cautious. You identify and consider all risks, and balance those with performing the task efficiently
- Experience and/or interest in agile methodologies
- Background in DevOps
- Solid communicator
- On-call experience
206
Site Reliability Engineer Resume Examples & Samples
- Bachelor’s Degree preferred, Associate’s Degree with 1 or more years of experience, 3-5 years of experience in lieu of a degree
- Experienced in at least one script language (Bash, Python, Perl)
- Experience with configuration management systems (Chef, Puppet, Salt)
- Experience configuring and supporting Jenkins or Hudson
- Linux system engineering expertise
- Networking Knowledge (strong VPC knowledge is a PLUS)
- Experience with Artifactory (or Nexus)
- Excellent written communication, problem solving, and process management skills
- Experience with Application Server platforms (JBoss, EAP, EWS, Wildfly)
- Experience with Cloud Computing platforms (e.g. Amazon AWS, Eucalyptus, VMware, Docker)
- Experience with Java build tools such as Ant, Maven, Gant, or Gradle
- Experience with agile development, continuous integration and automated testing
207
Site Reliability Engineer Resume Examples & Samples
- Work closely with engineering stakeholders to define platform requirements and underlying service implementations
- Rapidly iterate on existing platform features
- Strong understanding of scaling systems reliably in AWS (global experience a bonus)
- Exposure to containers (Docker)
- Experience with automation/configuration management tools such as Ansible, Chef or Puppet
- Proficiency in the use of code and script (Python, Ruby and/or Go)
- BS in CS or 5+ years of comparable experience
- Expertise and experience in site performance profiling and tuning
- Familiarity with cutting-edge open source libraries and experience contributing to projects of personal interest a plus
- Familiarity with Amazon Web Services administration a plus
208
Site Reliability Engineer Resume Examples & Samples
- 1 year of experience developing software for Windows (2000, 2003, XP, VISTA) or UNIX/Linux (redhat versions 3-5) operating systems
- Hadoop Distributed File System (HDFS)
- JSON or BSON
- Restful services
- Requirements analysis and design of at least one Object Oriented system
- Developing solutions integrating and extending FOSS/COTS products
209
Senior Site Reliability Engineer Resume Examples & Samples
- Applies full understanding of the business, the customer, and the solutions that a business offers to effectively design, develop, and implement operational capabilities, tools and processes that enable highly available, scalable & reliable customer experiences
- Utilizes deep knowledge of operations engineering, connected services, and system administration plus knowledge of industry best practices to innovate and influence operational approaches and solutions
- Maintain and improve the availability, performance, scalability and efficiency of the services by implementing monitoring, automation, redundancy, capacity and business-continuity planning
- Automation and development of operations tools, application dashboards, etc
- Metrics reporting on applications performance, availability, reliability, etc
- Develop and implement automated deployments
- Implementation and design of configuration management
- Troubleshoot issues and participate in 24x7 on-call support, ensuring the stability and performance of the production environment
- Review and development of performance and capacity plans (operational capacity and load requirements)
- Facilitate the creation of the operational readiness documents
- Coaches and mentors other application operations engineers
- Coordinates technical dependencies with other teams
- Technical lead for complex projects
- Drive the end-to-end incident management process
- Develop and automate monitoring processes
- Oversee change management and configuration management operating mechanisms
- Drive root cause analysis (RCA) and risk management processes
- Drive ongoing improvements and efficiencies in operational practices, tools & processes BU and Intuit-wide
- B.S. or higher in Computer Science or 3+ years of equivalent knowledge and experience
- Passion and talent for advanced scripting and automation (e.g. Python, Ruby, Perl, Golang, Java, C, etc.)
- Expert knowledge of managing services both in the Cloud (e.g. AWS) and in traditional data centers
- Expertise in Linux/Unix system administration
- Significant experience managing highly available services at scale
- Significant experience with configuration management (e.g. Chef, Salt, Puppet, CloudFormation, etc.) and automated deployments
- Experience with CI/CD process and technologies (e.g. Jenkins, Travis CI, etc.)
- Strong knowledge and experience with metrics, monitoring and alerting tools (e.g. New Relic, Splunk, OMD/Nagios, Sensu, PagerDuty, etc.)
- Experience in support/troubleshooting/operations of relational and NOSQL databases (e.g. MySQL, Postgres, Oracle, RDS, Cassandra, Dynamo DB, etc.)
- Operational mindset with ability to do incident, problem, change, and SLA management
- Expert problem solving capabilities and ability to think outside the box
- Works well in a fast-paced, dynamic operations environment with geographically distributed teams
- Experience with Tomcat, JBoss, Mule
210
Principal Site Reliability Engineer Resume Examples & Samples
- Develop principles, patterns, and tools to improve our ability to rapidly deploy and effectively monitor application services in a large-scale and complex environment
- Being able to multi task and deliver in a fast paced, rapidly evolving technology landscape and participate in an on-call escalation for incident resolutions
- Responsible for Lights Out Management of our services and advocating and contributing toward the best operability of the services in production
- Spends approximately half the time with the core engineering teams advocating and contributing towards making our application services resilient and with “illities” and the remaining half the time with core site operations teams
- End-to-end user flow profiling – create requirements and implement solutions to optimize E2E customer interactions, and model customer transaction flows for long-term analysis and short-term troubleshooting
- Mentoring DevOps and Site Reliability Engineers
- Communicate with Engineers, Architects, and Executives across various organizations to convey ideas and influence outcomes
- BS degree or higher in Computer Science (or equivalent experience)
- Significant experience with cloud hosted apps/service, AWS experience preferred, and able to translate business requirements into securely implemented capabilities in the cloud
- Expert level experience at building, deploying and operating services at scale
- Hands on experience in working with distributed systems and ‘illities” of the services
- Systematic problem solving approach, coupled with a strong sense of ownership and drive, leading solutions that span across CTG and partners
- Internally motivated, self-starter with ability to plan, organize and establish priorities to meet goals and achieve results
- Must work well under pressure, balancing multiple priorities and objectives, handles conflict well
- Demonstrated technical leadership and ability to communicate in a complex cross-functional environment
- Design reviews of operational approaches and solutions
- Defines and contributes to Operational Standards and Requirements
- Risk Analysis and root cause analysis
- Technical feasibility and approach decisions
211
Site Reliability Engineer Resume Examples & Samples
- Automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement
- Build scalable infrastructure to manage metadata for hundreds of billions of files, hundreds of petabytes of user data, and millions of concurrent connections
- Drive the company through “Disaster Recovery Tests”, where we manually turn down pieces of infrastructure to test Dropbox’s overall resiliency to failures
- Design the system and processes that Dropbox engineers use to deploy their software into production
- Build an auto-remediation system to automatically resolve production incidents before escalating them to on-call engineers
212
Site Reliability Engineer Resume Examples & Samples
- Enable scaling by providing tools, developing training and/or augmenting processes
- Build tools/automate to prevent re-occurrence of problem to mission critical products/services
- Develop a deep understanding of the various services and applications that come together to deliver Walmart e-commerce products
- Design new tools to monitor and smart alerts that help discover failures/issues in a timely fashion and work with engineers to identify root cause and fix issues
- Influence, design and create new architectures, standards and methods for large-scale enterprise systems
213
Site Reliability Engineer Resume Examples & Samples
- Design customized hosted managed solutions that are performant, cost effective, and delight our customers
- Implement solutions by defining database physical structure and functional capabilities, database security, data back-up, and recovery specifications
- Design, implement, and maintain integrating monitoring and alerting tools, leveraging existing tools and logging. Monitoring is happening at all levels in the application stack
- Expand existing automation tools to streamline onboarding
- Collaborate with various internal teams to provide the best service possible for our customers
- Evaluate new technologies to enhance the level of customer service
214
Site Reliability Engineer Resume Examples & Samples
- Drive the design, deployment and maintenance of Hortonworks HDP in multiple production environments
- Architect and build redundant, multi-site monitoring toolsets and software
- Automate repeatable processes and build tight full-stack integration with Linux and Hortonworks HDP
- Build positive relationships with Hortonworks customers, demonstrating your leadership in the Big Data industry
- Familiarity with network design and architecture principles
- Development experience in Java, Scala, Python or other languages
- Experience with PXE, kickstart, or Linux from Scratch
- In depth understanding of Hardware and Storage technologies
- Experience with streaming technologies, such as Kafka and Storm
- Published Open-Source software or public contributions a great asset
215
Senior Site Reliability Engineer Resume Examples & Samples
- Design, write and build tools to improve the reliability, latency, availability and scalability of Walmart e-commerce products
- Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure
- Participate in capacity planning, demand forecasting, software performance analysis and system tuning
- Root-cause analysis complex problems involving multiple parties,
216
Site Reliability Engineer Resume Examples & Samples
- Software engineering skills: experience writing maintainable reusable software
- Operations or Systems Administration experience, particularly on UNIX. (You know what a daemon is and how to restart one. When the daemon won’t start because some other process is listening on the port it needs, you can find and kill the errant process.)
- Experience and/or interest in Test Driven Development (TDD) and agile methodologies
- Ability to dive into a polyglot codebase and contribute while learning
- On-call experience: we build production systems, and believe the best way to understand what the means is to support real systems in the wild. Our ops teams write code, and our development teams help operate their code in production
217
Site Reliability Engineer Resume Examples & Samples
- You will be responsible for maintaining and scaling production services and servers across multiple data centers for complex and data-intensive cloud services
- You will improve scalability, service reliability, capacity, and performance
- You will write automation code for provisioning and operating infrastructure at massive scale
- You will work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up
- You will work with QA on building pipelines and automation for delivering and deploying applications to production
- You will roll up the sleeves to troubleshoot incidents, formulate theories and test your hypothesis, and narrow down possibilities to find the root cause
- Hands on experience in building fault tolerant and scalable systems
- 5+ years of Unix/Linux experience, with some experience in managing 100+ nodes
- Experience with AWS and AWS APIs
- Experience with Configuration Management and CI/CD. Salt and Jenkins preferred
- Familiar with web servers (Nginx preferred) and HA Proxy
- Preferred experience: Hadoop, Kafka, RabbitMQ, Spark, HBase, Elastic Search, Containers, OpenStack
218
Senior Site Reliability Engineer Resume Examples & Samples
- You will be responsible for designing, building, maintaining, and scaling production services and server farms across multiple data centers for complex and data-intensive cloud services
- You will design and enhance software architecture to improve scalability, service reliability, capacity, and performance
- You will write automation code for provisioning and operating infrastructure at massive scale. You are not an operator, you’re an experienced software engineer focused on operations
- You will work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up. You will work with QA on building pipelines and automation for delivering and deploying applications to production
- You will participate in the occasional on-call rotation supporting the infrastructure
- You write postmortem reviews and remediation recommendation
- Strong sense of architecture and design for fault tolerance, scale, and stability
- Strong development/automation skills. Must be very comfortable with reading and writing Python code. Java is a plus
- 10+ years of Unix/Linux experience (shell/tools/kernel/networking)
- Tools-first mindset. You build tools for yourself and others to increase efficiency and to make hard or repetitive tasks easy and quick
- Subject matter expert in one of these areas: Big Data: Hadoop 2.x, Kafka, Spark, HBase, Elastic Search.:Data Center Virtualization: Containers, Mesos, OpenStack, SDN
- Familiar with middleware software such as Nginx, HA Proxy,RabbitMQ, and typical AWS components, as building blocks of implementing services
- Knowledgeable about collecting metrics, measuring systems and interpreting data to make decisions
- Organized, focused on building, improving, resolving and delivering. Good communicator in and across teams, taking the lead
219
Site Reliability Engineer Resume Examples & Samples
- Time series databases such as Graphite, InfluxDB, or OpenTSDB
- ELK Stack or equivalent log monitoring technology (Sumo Logic, Splunk, Graylog)
- Linux CLI
- SNMP
220
Site Reliability Engineer Resume Examples & Samples
- Drive immediate relief and provide a sustainable resolution to issues within the ServiceNow platform
- Use knowledge and experience in software development, application support, systems engineering and networking to proactively prevent issues from reoccurring
- Drive internal stakeholders and partner teams to improve the reliability, scalability and performance of the infrastructure through improved system design
- Drive and contribute to a culture of intolerance to manual activity, which results in an automation environment delivering repeatable and scalable response to system issues
- Deep knowledge of Linux systems
- Coding in any development/scripting languages like Bash, Python, C++, Java or Javascript
- Networking skills and IP addressing
- MySQL database administration
- Monitoring of performance/availability in systems, applications and networks
- Uncompromising attention to detail
- Ability to work in shifts that cover one weekend day
- Ability to live and work full-time in Australia
221
Site Reliability Engineer Resume Examples & Samples
- BS degree in Computer Science, or related technical field, or 3+ years industry experience
- 3+ years experience working with Unix/Linux systems from kernel to shell and beyond with experience working with system libraries, file systems, and client-server protocols
- 3+ years experience in one or more of the following languages: C, C++, Java, Scala, Python, Go, Ruby, or scripting experience in Shell and Perl
- 3+ years of experience in network theory e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing
- Proficiency with configuration management frameworks such as Puppet or Salt
- Proficiency with Docker and Kubernetes
- Familiarity with one or more common web frameworks
- Experience working with data processing, loading, and transformation systems
- Experience building high quality distributed systems or backend services
- Experience with Google and Amazon cloud services
- Comfortable working on Linux-based systems
222
Site Reliability Engineer Resume Examples & Samples
- Design, write and deliver software to improve the availability, scalability, latency, and reliability of Edge services
- Solid understanding of networking concepts
- Solid scripting and automation skills (Python, Shell, Go, Perl, etc)
- Configuration Management experience (SaltStack, Ansible, Chef, etc)
- Experience with Continuous Integration tools (Jenkins, Buildbot, Travis, etc)
- Experience troubleshooting large scale/high performance systems
- Strong communication skills and enjoy working in a highly collaborative environment
223
Site Reliability Engineer Resume Examples & Samples
- Design, write, and maintain software to improve the availability, scalability, latency, and efficiency of Thumbtack's services, incorporating third-party open-source tools when available
- Create new designs for a growing number of distributed systems
- Design and implement the tools and processes used for deployment and change management
- Plan and execute configuration management
- Own, maintain, and continuously improve all systems provided as a service, such as monitoring and datastores
- Engage in service capacity planning and demand forecasting, anticipating performance bottlenecks
- Run software performance analysis and system tuning
- Plan and execute disaster recovery drills
- Participate in rotating on-call duties
- Fluent in one or more of: C, Python, Go
- Minimum of 4 years of industry experience in engineering
- Familiarity with algorithms, data structures, and complexity analysis
- In-depth knowledge of operating systems (processes, threads, IPC, concurrency, locks, mutexes, semaphores, etc.)
- Experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols
- Experience with network protocols and theory (TCP/IP, UDP, ICMP, MAC addresses, IP packets, DNS, OSI layers, and load balancing, etc.)
- Experience with Puppet, or some other configuration management tool
- Systematic problem solving approach
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
- Experience with PostgreSQL tuning and performance
- Learn more about our culture, benefits, and perks
- Learn more about engineering at Thumbtack
- Follow Thumbtack on LinkedIn
224
Site Reliability Engineer Resume Examples & Samples
- BS/BA preferred or equivalent experience
- 5+ years of relevant experience in Linux systems administration, provisioning, configuration, troubleshooting, and monitoring (Nagios, Zenoss, SNMP)
- Programming experience in one of the following: Python, C, and/or Perl
- Deep understanding of the Linux OS, internals, kernel & file systems tuning, protocols and services (PXE, DNS, HTTP, NFS, CIFS, FUSE) and troubleshooting utilities (strace, sar, vmstat, mpstat, tcpdump)
- Extensive experience in shell/bash scripting and automation tools (Salt, Puppet, Chef, FAI, Kickstart)
- Working knowledge of virtualization technology (KVM, Xen, ESXi, OpenStack) and Dell hardware (DRACs, Livecycle Controllers, BIOS, RAID controllers)
- Must possess strong documentation skills and be able to work with rapid change, Configuration Management utilities (SaltStack, Puppet, etc), and Source Control (Git, SVN)
- Prior experience with CDN, ISP, 2,000+ server environments, or production Internet-facing services for large enterprise customers is highly preferred
225
Site Reliability Engineer Resume Examples & Samples
- Leads the on boarding of teams to the Gannett Cloud Platform, including writing the proper Continuous Integration Automation Scripts and optimizing the applications cost and performance inside of the Cloud Platform
- Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering or related technical field
- Minimum of 3 years of progressive experience in Linux systems administration position
- Experience must include deploying to AWS or other clouds, auto-scaling and the architecture of stateless applications, and using Chef or other configuration management tools
226
HBO Site Reliability Engineer Resume Examples & Samples
- Demonstrable knowledge of cloud based software deployments on the cloud stack of your choosing (Amazon Web Services, Google Cloud Platform, OpenStack VMWare)
- A high degree of familiarity with Linux containers and container orchestration tools like Kubernetes or Mesos
- Strong understanding of HTTP, REST, networking concepts and global load-balancing
- A passion towards automation - we’re looking to automate anything possible
- Work cross-functionally within a service team and be a core contributor in every significant engineering solution that is delivered
- Debug production issues across services and levels of the stack
- Participate in on-call rotations, along with every member of the engineering team
- Solid understanding of system design, including the operational trade-offs of various designs
- Solid programming and troubleshooting skills. You may be called upon to help with systems written in Go, Python, Node.js. You won’t be expected to know everything, but we are looking for people who can dig through a codebase for debugging and commit tactical fixes as needed
- 1 – 10+ years of experience. We’re looking for talented engineers at all experience levels
227
Site Reliability Engineer Resume Examples & Samples
- Expert knowledge of the Oracle Enterprise Linux operation system (OS, networking, process level)
- Fluent in at least two languages (Bash, Python, Ruby, golang)
- Hands on experience with CloudStack or OpenStack, Mesos, Marathon
- Familiarity with KVM, Qemu, and Docker
- Familiar with Software Defined Networking, Open vSwitch
- Able to troubleshoot issues across the entire stack
- Familiar with monitoring and log aggregation tools such as Splunk
- Recent experience in a large (5,000+ hosts) computing environment
228
Senior Site Reliability Engineer Resume Examples & Samples
- Develop new and enhance existing features for DDE's massively distributed system
- Work on performance-critical data processing system
- Work on data collection, processing, storage and access subsystems
- Work on projects that focus on system scalability, performance, and security
- Drive feature development from idea inception through design and testing to operational deployment
- Follow SW development methodology best practices, including collaboration with QA departments to successfully deploy high quality new system components
- BS in Computer Science or equivalent, MS preferred
- 2+ years of experience developing SW on C/C++ or Java
- 3+ years experience with Linux/Unix environment
- 3+ years experience with computer networking
- Knowledge of networking principles, including TCP/IP, SSH, SSL and HTTP protocols
- Ability to troubleshoot complex network problems and customer issues
- Proven track record of delivering large amounts of high quality, complex code
- Highly responsible, motivated, able to work with little supervision
- Experience with BigData systems (Hadoop, Spark, Appache Cassandra, etc) and principles (Map/Reduce, Stream Processing, etc)
- Experience with scripting, e.g. Perl, Python, bash, and RESTful API
- Experience with DBMS, e.g. PostGRE SQL, MySQL, etc
229
Site Reliability Engineer Resume Examples & Samples
- Make sure our systems distributed around the global are designed and deployed securely
- Make sure we are designing systems that will scale far into the future
- Be paranoid and think of all the ways an attacker could compromise our systems, all the ways hardware and software could fail, and mitigate
- Help design and build the systems that enable us to scale our data centers without scaling the number of people required to manage them
- 4+ years of C++ experience
- 4+ years of experience in high-performance distributed systems, software development, networks, security
- 2+ years of experience with at least one major IaaS provider (AWS, Google Cloud, Rackspace, Azure)
- Bachelors of Science Degree in Computer Science, or 4+ additional years of related industry experience
- Strong skills in debugging, performance optimization and unit testing
- Experience with distributed computing
- Effectively worked as a team member and in large code bases
- Experience designing and implementing security for live infrastructure
230
iXp Intern, Site Reliability Engineer Resume Examples & Samples
- Proactively monitor availability and performance of the Ariba Cloud using key performance tools
- Effectively and quickly respond to monitoring alerts, incident tickets and overall technical support for the Ariba product suite
- Perform extensive application and web site troubleshooting to quickly resolve issues
- Work closely with subject matter experts within various Engineering teams
- Ensure user tickets and monitoring alerts are handled according to pre-defined SLA's for response time, updates and closure
- Develop and automate manual tasks to improve day-to-day monitoring and scalability of time critical operations
- Handle communication and notification on major site issues to executive management teams
- Document standard operating procedures to effectively utilize ITIL best practices
- Ensure effective shift turnovers for continuous 24/7 support
- Requires candidates to currently be enrolled in an undergraduate, Masters, MBA or PhD degree program which is applicable to the position
- Experience working in a Unix environment
- Triage and support system applications including but not limited to Apache, DNS, Sendmail, SSH, TCP/IP, NFS and common Internet protocols
- Excellent knowledge of operating system internals, file system structures and machine architectures in a Linux operating environment
- Some knowledge of Oracle database administration
- Interest in writing Perl and Shell scripts to automate processes and enhance productivity
- Willing to work in a dynamic, fast-paced environment with well-developed practices and procedures
- Outstanding interpersonal, analytical, and communication skills
- Must be reliable and dependable with ability to multi-task across various functions
- Candidates must be local in the Silicon Valley to be considered
- Must be able to work onsite in Palo Alto, CA during summer 2017
231
Senior Site Reliability Engineer Resume Examples & Samples
- Experience with NoSQL databases (especially MongoDB)
- Hands on experience with AWS - minimum of 1 year
- Serious troubleshooting skills across different levels of the stack
- Experience troubleshooting a continuous integration pipeline
- Understanding of Linux systems
232
Site Reliability Engineer Resume Examples & Samples
- At least a Bachelor’s degree in CS or relates field
- Preference for a mid-level lead with hands-on developer skills
- 5+ years of progressive experience in the technical support space; ideally in the retail, hospitality or consumer goods industries
- 3+ years of J2EE platform experience
- 2+ years’ experience with supporting EA/middleware technology platforms
- 2+ years’ experience with release management
- Solid knowledge of shell scripting and at least one scripting language
- Ability to actively participate in infrastructure design and implementation
- Must be adaptable and able to focus on the simplest, most efficient & reliable solutions
- Practical knowledge of various aspects of service design, including messaging products & behavior, catching strategies and software design practices
- Experience in Linux based shell scripting a BIG plus
- Experience with DecOps and Release management is a plus
- Experience working with Jira
233
Site Reliability Engineer Resume Examples & Samples
- Demonstrated experience with data structures, and software design
- Demonstrated experience programming in one or more of: C, C++, C# Java, Python, Go
- 5+ years of full Software Development Life Cycle experience
- 3 or more years of experience working as a programmer
- Demonstrated experience with PL/SQL development
- Ability and willingness to assume periodic “on call” duties for issues/escalations
- Experience with running web services at scale
- Understanding of Unix/Linux systems from kernel to shell and beyond
- Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing)
234
Site Reliability Engineer Resume Examples & Samples
- 4+ years experience in network design and implementation
- 4+ years supporting high-volume customer facing networks at an L3 level
- Expert level experience with multiple network equipment vendors
- Expert level understanding of TCP/IP, IPv4/IPv6
- Expert level BGP and OSPF
- Administrator level knowledge of and comfort working in *nix environments
- Proficiency in modern development languages such as Python, C++
- Experience developing tools and automation to drive efficiency in operations tasks
- Experience developing metrics and telemetry to measure and improve availability
- Expert level troubleshooting skills
- Experience with L2 technologies such as MLAG and VPC
- Participation in on call rotation
235
Bluemix Site Reliability Engineer Resume Examples & Samples
- At least 2 year experience in software development or engineering
- At least 2 year experience in application operation
- At least 2 year experience in problem determination
- 3 Years experience in multiple coding skills such as;Representational State Transfer (REST)/web services, distributed systems, messaging, knowledge of open source tools used in Cloud Foundry development (e.g. Git and Jenkins)
- 3 years experience in software development or engineering
- 3 years experience in application operation
236
Senior Site Reliability Engineer Resume Examples & Samples
- Someone with a passion for learning
- Someone gutsy who isn’t afraid to try new things
- Someone who is comfortable failing, learning and trying again
- Someone who is data obsessed
- Someone with an automation mindset
- Someone who hates black boxes
- Security focused position working with a Security Team
- Application Security exposure
- Proficient in a high level scripting language
- Continuous Integration Knowledge
- Experience with RDBMs and NoSQL implementations
- Knowledge of distributed systems design and constraints
237
Senior Site Reliability Engineer Resume Examples & Samples
- Experience administering Linux systems
- Experience working in a SaaS environment at scale
- Fluency coding in either Python, Ruby, Java, or Go
238
Site Reliability Engineer Resume Examples & Samples
- React to monitoring alerts and lead efforts to fix system issues in a quick manner
- Work with technology partners to resolve issues and push improvements in our ecosystem
- Develop and contribute to internal knowledge base
- Be a champion for our customers and insure change management processes are followed
- Automate Site Reliability Engineering operations by developing software applications and API Integrations to connect disparate systems
- Work in a highly skilled a 24/7 environment. Provide support on weekdays and also off hours as needed
- At least 3+ years in an IT environment with preference given to operational center environments
- Strong background in interacting with relational database environments this includes constructing complex queries to proactively identify system trends and troubleshooting application issues
- Familiarity with open source and 3rd party Monitoring Systems (Nagios, kafka, etc.)
- Experience using scripting tools (Perl, Powershell, php) to facilitate the creation of utilities to facilitate day-to-day workflows
- Systems/Network administration background
- Experience with a broad set of system tools for troubleshooting and analysis including but not limited to: -SOAPui and LOADui for API testing and simulation
- Knowledge of AppDynamics, Splunk, Microsoft System Center suite for operational monitoring,
- Familiarity with Linux Operating Systems
239
Site Reliability Engineer Resume Examples & Samples
- Track our cloud customer SLAs and be on-call to ensure total conformity to these customer commitments
- Create and maintain complete and accurate documentation for the purpose of operational audits including security and compliance
- Coordinate internal activities across company and our cloud partners, ensuring achievement of the above responsibilities
- Continuously review and enhance processes and operating procedures needed to maintain the most cost effective enterprise-grade cloud infrastructure
- Innovate and automate improvements to our Cloud Operations
- Identify and promote best practices and patterns for the setup, configuration and management including databases, servers, and network and storage systems
240
Senior Site Reliability Engineer Resume Examples & Samples
- Become a subject matter expert (SME) of several internal highly-distributed systems
- Provide expertise to engineering and operations team on the use of these systems
- Tune systems to perform better and operate more reliably
- Manage the rollout and activation of new features and platform changes
- Assist Akamai engineering team and operations staff who rely on our system to perform their job
- Write Bash, PHP, SQL, Perl and Python code to enhance and fix the functionality of the system
- Bachelor's degree in engineering, computer science or equivalent experience
- 5+ years experience in the industry
- 3+ years experience with networking systems
- Experience with alerting, monitoring and performance management systems
- Experience in Business Intelligence systems/tools (analyzing data and identifying trends)
- Experience working with DNS, HTTP, and SSL protocols
- Experience with Web servers, Perl, Python, SQL and PHP
- Experience analyzing and optimizing security systems
- Excellent problem solving/troubleshooting skills
- Ability to working closely with other engineers to understand problems and work towards solutions
241
Site Reliability Engineer Resume Examples & Samples
- 5+ years of Linux systems administrator experience
- Demonstrated programming skills in two or more of: Bash, Perl, Python, Ruby, PHP, Java, C
- Solid understanding of operational principles, such as capacity planning, monitoring and incident handling
- Experience with Hadoop, HDFS, Kafka, Spark, Docker, Mesos, Marathon, AWS or Azure desired
242
Site Reliability Engineer Resume Examples & Samples
- Primary support for outages affecting production environments
- Provisioning, configuration and maintenance across all our service environments, including writing software to automate repetitive tasks
- Working collaboratively with your peers to solve complex issues
- Measuring and improving service quality and reliability while remaining cognizant of the need for economic and operable solutions
- Participation in code reviews, willingness to take time to help others grow and succeed
243
Site Reliability Engineer Resume Examples & Samples
- Ensure reliable operation of processing pipelines that analyze terabytes of scientific data
- Apply industry best practices to the field of scientific programming e.g., (nose), continuous integration and deployment (buildbot), and code quality best practices (prospector)
- Provide hands on support to facilitate the scientific mission of users, including debugging, performance analysis, and training
- Convert user-written programs into more robust automated pipelines
- Maintain and operate existing applications via configuration management (Ansible) as well as implementing for new systems as needed
- Respond to incidents that cause outages or threaten data integrity, while planning to implement new systems that might prevent reoccurrence
- Perform periodic on call duty
- Bachelor’s Degree and 3 or more years of general IT experience including
- 2 or more years of experience in software engineering
- 3 or more years of experience in UNIX/Linux system administration, technical troubleshooting, and performance tuning
- Bachelor’s degree in computer science, engineering, or related field
- Experience working on small cross-team technical projects with direct contact with stakeholders
- Ability to work well independently and with a team, with a strong desire to learn and support the scientific community
- Exposure to software development best practices including architecture, documentation, and testing
- A background using technologies from most or all of these areas: UNIX/Linux systems administration (e.g., CentOS, Debian), programming languages (e.g. Python, JavaScript, shell scripts), scientific computing (e.g. Matlab, R), protocols and standards (e.g. HTTP, NFS), debugging tools (e.g. strace, firebug, pdb), web application technologies (e.g. Nginx, Tomcat, NodeJS), database technologies (e.g. MongoDB, PostgreSQL), resource management facilities (e.g. OpenPBS, Slurm)
- Strong written and spoken English language skills
244
Site Reliability Engineer Resume Examples & Samples
- 1+ years of experience with Linux systems administration, including CentOS or RHEL
- 1+ years of experience in scripting with Bash, Python, or Ruby
- Experience with Hadoop
- Knowledge of Puppet, Chef, Ansible, or Salt
- Knowledge of Docker
- Experience with working in Agile or DevOps environments
- Experience in working with distributed data stores, including Accumulo or HBase
- Experience with continuous monitoring tools, including Kibana, SolarWinds, and Splunk
- Experience with service discovery platforms, including DNS, Zookeeper, or Consul
- Experience with Cloud Service Providers (CSP), including AWS and private Cloud implementations, including Open Stack
- Experience with working in and maintaining PaaS environments, including Kubernetes or Mesos
- DoD 8570 Compliance Certification, including Security+ CE or CISSP
245
Site Reliability Engineer, UK Resume Examples & Samples
- Migrating daemons and services to a 64-bit architecture while minimizing service disruption
- Identifying local and distributed performance bottlenecks and evaluating whether they can be allayed by caching, precomputation, or other similar techniques
- Adding network topology awareness on a code deployment pipeline to increase performance using less bandwidth
- Improving service monitoring to include automated anomaly detection
- Automating Ruby scripts within a Unix environment and building large systems out of small components that each do one job and do it well
246
Site Reliability Engineer Resume Examples & Samples
- Ensure user visible uptime and quality, providing operational and development expertise in making our systems fail rarely, and are fast to fix when they do fail
- Participate in architecture and design reviews to provide recommended improvements to the development teams to improve the reliability and performance of applications
- Minimize manual involvement by imagining & implementing continuous improvements that create an operating environment, including the development of new tools, dynamically monitoring, alerting, & automated self-healing & recovery
- Identify and/or analyze problems relating to mission critical services and implement automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions
- Engage in application performance analysis and system tuning, and capacity planning
- Perform root cause analysis to identify & implement continuous improvements
- Capable of presenting analyses and recommendations to leadership or discussing the technical merits of solutions with engineers and architects
- Own the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
- Practice Agile and Scrum methodologies
- Strong experience with SharePoint 2013 platform and/or software engineering
- Working knowledge of Azure Services, especially ARM templates
- Strong experience with TFS 2010+, VSTS, or similar ALM tool
- Strong experience with PowerShell
- Experience developing in a software development language (e.g., preferably C#/C++)
- Knowledge of virtualization and its benefits for improving reliability
- Strong experience with instrumentation, monitoring, alerting, and responding relative to performance and availability of applications
- Capable of technical deep dives into infrastructure, databases, and application, specifically in designing, coding, operating, and supporting high-performance, highly available services and infrastructure
- Experience in designing for failure, including disaster recovery and business continuity planning
- Experience operating and supporting mission-critical applications (e.g. incident and outage management)
- Passionate for making things better and driving action with a sense of urgency
- Experience problem solving issues on globally distributed systems and critical product service environments
- Brings new thinking to challenge existing technology and processes
- Excellent at building relationships across teams
- Firm sense of accountability and ownership
- Understanding of the concepts and principles behind DevOps, Continuous Delivery, Agile, Lean, etc
- Use of DevOps tools to deliver and operate end-user services a plus (e.g., Chef, New Relic, Puppet, etc.)
247
Site Reliability Engineer Resume Examples & Samples
- Solid understanding of Software Engineering and Computer Science principles
- Solid foundation in Linux administration and troubleshooting
- Fluent in the English language both spoken and written
248
Site Reliability Engineer Resume Examples & Samples
- Engage in the entire lifecycle of services—from inception through deployment, operation and continuous integration
- Acquire expertize on Splunk and similar log management tools to establish casual relationships between issues and trends
- Manage tasks in Jira from production support standpoint
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health
- Scale systems sustainably through mechanisms like automation, and spearhead improvements that help to improve reliability and velocity
- Build knowledge base around common production support issues
- Timely communication of Release Notes and QA and other teams
- Able to build prototypes
- At least a Bachelor’s degree in CS or related fields
- Solid knowledge of shell scripting and at least one scripting language
249
Senior Site Reliability Engineer Resume Examples & Samples
- Excellent troubleshooting skills that span systems, network (TCP/IP), and software
- Comfort working with senior management to allocate and prioritize engineering energy in support of the SRE mission
- Incredible attention to detail – plan ahead before making changes
- Comfort with multiple customers and high complexity network and software servers
- Effective communication with solid writing skills
- Ability to extract details from design documentation, engineering release notes and extract possible points of contention for instrumentation
- Familiarity with ITIL service management concepts
- ESB and JMS systems administration
- Experience with working with geographically-distributed teams
- Experience with Monitoring infrastructure tools such as Prometheus, Grafana, etc
- Experience with Log Management tools such as ELK Stack, Splunk, MongoDB, etc
- Experience with Big Data technology, Hadoop, Cassandra, is a plus
- Experience with Zookeeper is a plus
- Experience with Docker is a plus
250
Site Reliability Engineer Resume Examples & Samples
- Web and Application Server Technologies – Understanding of the configuration and management of the Apache, IIS, Nginx, TomCat, SendMail, EXIM, ProFTPd, etc. Will implement full LAMP stack configurations
- Will perform debugging analysis and provide an engineered solution based approach for resolution of issues
- Will work with API's both consuming and providing
- Experience with mainstream Open Source technologies
- Experience with database technologies (MySQL preferred)
- Intermediate experience with monitoring technologies and methodologies
- Experience working with San and NAS technologies (iSCSI, FCP, CIFS, NFS, etc.)
- Experience providing solutions/automation via code/development using industry standard languages (Python, PowerShell, C#, Ruby, etc.) and code management methodologies (Git, SVN, TFS, etc.)
- Experience with RESTful API's