Site Reliability Engineer, Cloud Platform
Posted on: August 16, 2019
DescriptionWe are seeking a highly motivated and talented Site
Reliability Engineer to work on Qualys' Cloud Platform & Middleware
technologies. Working with a team of engineers and architects, you
will combine software development and systems engineering skills to
build and run scalable, distributed and fault-tolerant systems.
The ideal candidate will write software to optimize day to day work
through better automation, monitoring, alerting, testing and
- Co-develop and participate in the full lifecycle development of
cloud platform services from inception and design, deployment,
operation and improvement by applying scientific principles.
- Increase the effectiveness, reliability and performance of
cloud platform technologies by identifying and measuring key
indicators, making changes to the production systems in an
automated way and evaluating the results.
- Support cloud platform team before the technologies are pushed
for production release through activities such as system design,
capacity planning, automation of key deployments, engaging in
building a strategy for production monitoring and alerting and
participate in testing/verification process.
- Ensure that the cloud platform technologies are maintained
properly by measuring and monitoring availability, latency,
performance and system health.
- Advice the cloud platform team to improve the reliability of
the systems in production and scale them based on need.
- Participate in the development process by supporting new
features, services, releases and hold an ownership mindset for the
cloud platform technologies
- Develop tools and automate the process for achieving large
scale provisioning and deployment of cloud platform
- Participate in on-call rotation for cloud platform
technologies. At times of incidents, lead incident response and be
part of writing detailed postmortem analysis reports which are
brutally honest with no-blame.
- Propose improvements and drive efficiencies in systems and
processes related to capacity planning, configuration management,
scaling services, performance tuning, monitoring, alerting and root
- 2+ years of relevant experience in running distributed systems
at scale in production.
- Expertise in one of the programming language: Java, Python or
- Proficient in writing bash scripts
- Good understanding of SQL and NoSQL systems
- Good understanding of systems programming (network stack, file
system, OS services)
- Understanding of network elements such as firewalls, load
balancers, DNS, NAT, TLS/SSL, VLANs etc
- Skilled in identifying performance bottlenecks, identifying
anomalous system behavior, and determining the root cause of
- Knowledge of JVM concepts like garbage collection, heap, stack,
profiling, class loading, etc.
- Knowledge of best practices related to security, performance,
high-availability, and disaster recovery.
- Demonstrate a proven record of handling production issues,
planning escalation procedures, conducting post-mortems, impact
analysis, risk assessments and other related procedures.
- Able to drive results and set priorities independently
- BS/MS degree in Computer Science, Applied Math or related
Bonus Points if you have:
- Experience with managing large scale deployments of search
engines like Elasticsearch
- Experience with managing large scale deployments of
message-oriented middleware such as Kafka
- Experience with managing large scale deployments of RDBMS
systems such as oracle
- Experience with managing large scale deployments of NoSQL
databases such as Cassandra
- Experience with managing large scale deployments of In-memory
caching using Redis, Memcached, etc.
- Experience with container and orchestration technologies such
as Docker, Kubernetes etc
- Experience with monitoring tools such as Graphite, Grafana and
- Experience with Hashicorp technologies such as Consul, Vault,
Terraform and Vagrant
- Experience with configuration management tools such as Chef,
Puppet or Ansible
- In-depth experience with continuous integration and continuous
- Exposure to Maven, Ant or Gradle for builds
Keywords: Qualys, Raleigh , Site Reliability Engineer, Cloud Platform, Professions , Raleigh, North Carolina
Didn't find what you're looking for? Search again!