Hewlett Packard Enterprise
SiteReliabilityEngineer
Neural analysis suggests this role is
optimal for Mid+ candidates.
“Site Reliability Engineer at Hewlett Packard Enterprise. Skills: Site Reliability Engineering, AWS, Infrastructure Automation, Kubernetes, Distributed Systems, Observability, Python, Golang. Engage in and improve the whole lifecycle of services - from inception and design, through to deployment, operation, and refinement.. Support development of services from planning phase before they go live through activities such as system design consulting, developing software platforms and frameworks, capa”
Industry & Context.
problem-solving and debugging skills with a high sense of ownership.; Troubleshooting skills across network, application, and distributed services layers.
Be on an on-call rotation to respond to incidents that impact platform availability.
What They're Looking For.
Must Have
Experience building and running reliable and fault-tolerant production cloud systems at scale on AWS., Coding infrastructure automation with Terraform, Terragrunt, Packer, CI/CD, and knowing how to use configuration management systems like Ansible., Hands-on experience with Linux/Unix operating systems internals, file systems, system tuning, administration, and networking., Deep experience in microservice technologies, container orchestration, and continuous deployment (Kubernetes, Docker, Helm, GitOps with Flux)., Experience in designing, building, maintaining production services, and troubleshooting large-scale distributed systems., Experience with technologies like Apache Kafka, Apache Storm, Apache Flink, Apache Airflow and Spark, Postgres, Redis, Elasticsearch, Arango, Cassandra., Experience with observability tools and methodology (monitoring, logging, tracing, SLOs/SLIs) for detecting and diagnosing issues in advance before causing service impact or performance degradation., Possess programming skills in Shell, Python, Golang and/or Ruby., Deliver efficiently and effectively., problem-solving and debugging skills with a high sense of ownership.
Nice to Have
10+ years of engineering or systems experience., Experience leveraging cloud architecture, applying site reliability principles, and/or demonstrating sensitivity to operational concerns., understanding of network design and architecture., Scaling and managing distributed systems., Significant experience with monitoring and observability platforms., Demonstrated ability to debug, fix, and optimize code., Troubleshooting skills across network, application, and distributed services layers., The ability to learn quickly and adapt to new technologies is essential., Excellent communications skills, both verbal and written.
What You'll Do.
Engage in and improve the whole lifecycle of services - from inception and design
through to deployment
Support development of services from planning phase before they go live through activities such as system design consulting
developing software platforms and frameworks
capacity planning and launch reviews.
Maintain services once they are living by measuring and monitoring availability
and overall system health.
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
Capacity planning the growth of cloud infrastructure.
Improve operational processes such as deployments and upgrades.
Manage execution of project priorities
Be on an on-call rotation to respond to incidents that impact platform availability.
Use your on-call shift to prevent incidents from happening.
Experience in incident response
including conducting post-mortems and implementing lessons learned
enhances system reliability.
How You'll Work.
Team & Collaboration
Provide technical leadership and guidance to other team members on managing availability and performance of mission critical services, on building automation to prevent problem recurrence, and building automated responses for non-exceptional service conditions.
Communication Scope
Excellent communications skills, both verbal and written.
Process & Methodology
Manage execution of project priorities, deadlines, and deliverables.
Full Job Description
Site Reliability Engineer This role has been designated as ‘Remote/Teleworker’, which means you will primarily work from home. **Who We Are:** Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today’s complex world. Our culture thrives on finding new and better ways to accelerate what’s next. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career our culture will embrace you. Open up opportunities with HPE. **Job Description:** We are looking for a highly motivated, self-driven, and dedicated Site Reliability Engineer possessing hands-on experience with: • Experience building and running reliable and fault-tolerant production cloud systems at scale on AWS. • Coding infrastructure automation with Terraform, Terragrunt, Packer, CI/CD, and knowing how to use configuration management systems like Ansible. • Hands-on experience with Linux/Unix operating systems internals, file systems, system tuning, administration, and networking. • Deep experience in microservice technologies, container orchestration, and continuous deployment (Kubernetes, Docker, Helm, GitOps with Flux). • Experience in designing, building, maintaining production services, and troubleshooting large-scale distributed systems. • Experience with technologies like Apache Kafka, Apache Storm, Apache Flink, Apache Airflow and Spark, Postgres, Redis, Elasticsearch, Arango, Cassandra. • Experience with observability tools and methodology (monitoring, logging, tracing, SLOs/SLIs) for detecting and diagnosing issues in advance before causing service impact or performance degradation. • Possess st
Applying for this Site Reliability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about Hewlett Packard Enterprise?
Real rants from real employees. Read before you apply.