ABOUT THE ROLE
Reporting to the Director of Cloud Operations, this position will be responsible for the development and maintenance of automation, tools, and configurations, and systems & application service uptime in a high-availability customer-facing business critical 24x7 SaaS environment where uptime is critical and requires immediate response to service impacting issues. You will have or will develop skills in assessing the tradeoffs in installation, configuration, and diagnostics in open source Linux systems in a large scale DevOps environment. The right candidate will have excellent verbal and written communication skills with demonstrated ability to work across departments towards a common goal. Passion for implementing open source tools, systems / network / application diagnostics frameworks, CI/CD environments for a SaaS enterprise with a structured approach to achieve high-quality sustainable production operations will be required. Candidate will have knowledge of deployment of Java and/or Node.js and/or other typical enterprise application frameworks and languages.
- Identify, diagnose, and resolve complex technical issues efficiently in a live production environment and drive to quick resolutions – as well as – leverage those events to improve current technology & processes towards prevention of such issues.
- Work closely with the Engineering teams to escalate and/or triage issues to resolution.
- Review tickets and diagnostics with a post-mortem to identify trends/chronic issues.
- Hands-on implementation & upgrade of tools for monitoring, trending & diagnostics.
- Audit proactive monitoring of all systems to detect and resolve problems to ensure uninterrupted operation of all infrastructure systems.
- Update corresponding documentation on installation process & configurations
- Use and modify small SQL and NoSQL queries.
- Modify and fix issues in configuration management and continuous integration
- Modify existing log parsing configurations and create new ones
- Write new scripts/tools to automate common tasks
- Provision servers
- Make modifications to monitoring and alerting systems and create new monitors and alerts
- Set up environments
- Participate actively in releases
- Considers security concerns with all work
- Automate, Automate, Automate everything.
SKILLS AND REQUIREMENTS
- Requires at least a four-year bachelor’s degree in a computer related field, including but not limited to Computer Science, Information Technology, Electronics Engineering, or Computer information Systems.
- Understanding of cloud systems
- Working knowledge of a iAAS provider like AWS
- Basic working Knowledge & curiosity around Unix, Networking, Load balancers and similar cloud required technologies.
- Solid knowledge of bash scripting and a higher-level scripting language like python.
- Some SQL and NoSQL knowledge.
- Understanding of configuration management systems
- Understanding of monitoring and logging systems
- Skilled in source code control systems
- Knowledge of software engineering best practices (DRY, etc.)
- Proclivity for troubleshooting and triage of incidents, bringing issues to rapid resolution.
- Strong verbal and written communication skills, with the ability to work effectively across organizations
- Excellent problem-solving skills with the ability to analyze situations, identify existing or potential problems and recommend solutions
- Ability to take on-call escalation rotation & co-ordinate work under production critical situations is essential.
Extensive working knowledge of as many of following technologies and areas as possible:
- Systems – Linux, Unix, Docker, OpenShift & open source software
- Automation using Ansible in a cloud environment
- Working knowledge of databases
- Good Networking fundamentals with Protocols, Load Balancers, VPN, switches/routers/firewalls, LDAP, SNMP, SMTP
- Good understanding of filesystem Technologies – to build and/or troubleshoot filesystem issues
- Virtualization/Cloud technologies – Strong working knowledge of AWS with a good understanding of other technologies like OpenStack, OpenShift, Google Cloud
- Web servers/reverse proxies such as apache, nginx and haproxy
- Web application frameworks in node.js, python, etc.
- Monitoring, trending & diagnostics tools
- Logging tools such as Splunk, ELK stack, etc.
- Using source code control systems such as git (or similar)
- Work/defect tracking & Wiki systems such as JIRA / Confluence
- Knowledge of the use and maintenance of continuous integration and continuous deployment systems.
- Ability to prioritize & balance activity between projects for longer-term impact –and- immediate production critical requirements with a customer focus.
- Be a self-starter and require minimal guidance.
Bluescape is an equal opportunity employer. In keeping with the values of Haworth, we make all employment decisions including hiring, evaluation, termination, promotional and training opportunities, without regard to race, religion, color, sex, age, national origin, ancestry, sexual orientation, physical handicap, mental disability, medical condition, disability, gender or identity or expression, pregnancy or pregnancy-related condition, marital status, height and/or weight