Job Title : InfraOps Reliability Administrator
Location : Hybrid
Regular / Temporary : Regular
Full / Part Time : Full-Time
Job ID : 60506
Department
This position is within FSU’s Department of Information Technology Services (ITS)
Click here to see what the current team has to say about this role.
Responsibilities
The FSU College of Medicine Infrastructure and Operations team designs, builds, and manages infrastructure and servers to support other IT teams, faculty, staff, researchers, and students within the college. The team leverages the latest in automation and observability solutions to make complex work easier to accomplish. Design, build, automate, and optimize infrastructure using modern tools and site reliability engineering practices. Manage primarily Windows servers in a hybrid cloud environment, with a focus on reliability, observability, security, and continuous improvement. Collaborate across teams and leverage automation, scripting, data-informed decision-making, and self-directed professional development to deliver secure, scalable, and customer-focused solutions.
Infrastructure and configuration as code : Use tools such as Terraform, Azure DevOps, Visual Studio Code, and scripting languages like PowerShell and Bash to manage infrastructure as code (IaC) and configuration as code (CaC), ensuring consistency, repeatability, and auditability of systems. Use observability solutions, such as Elastic, to monitor deployments and support data-informed decisions and rapid experiments, that drive continuous improvement. Work with CI / CD pipelines to automate deployment, validation, and testing processes, ensuring systems are secure by design, mitigate vulnerabilities, and are compliant with security policies and standards. Follow secure coding practices, adhere to coding standards, and leverage version control, automated testing, and test-driven development to produce high-quality, secure, and maintainable code. Use AI-assisted tools to accelerate development, validation, and troubleshooting. Participate in pair programming sessions as appropriate to write code and resolve deployment issues.
Provision and manage server infrastructure : Deploy and manage Windows and Linux servers across a hybrid environment that includes Microsoft Azure and over a dozen geographically dispersed on-premises locations. This includes ensuring that all systems are secure by design, follow zero trust principles, and are scalable, observable, and aligned with business needs. Provision infrastructure with reliability, maintainability, and consistency in mind, and implement observability prior to production to support proactive monitoring and data-informed decisions. Collaborate with cross-functional teams and stakeholders throughout the infrastructure lifecycle to ensure solutions align with customer needs; prioritize high-value work, assess feasibility, and conduct security reviews of new systems and applications; deliver exceptional customer service and maintain clear communication to support successful outcomes.
Automation : Automation is not just a task, it is a mindset and a strategic enabler of reliability, consistency, and scalability. Design and implement solutions that make work easier, reduce manual effort, improve system reliability, and streamline operations across provisioning, configuration, monitoring, and remediation. Use AI, scripting, workflow automation, or robotic process automation (RPA) tools to reduce operational overhead and accelerate delivery. Use observability tools to monitor automation performance, ensure reliability, and identify data-informed opportunities for continuous improvement. Collaborate with peers and stakeholders to prioritize high-value automation opportunities and ensure that solutions are effective, secure, and aligned with business needs.
Network administration : Manage and troubleshoot enterprise-grade network infrastructure, including wireless access points, switches, routers, load balancers, and next-generation firewalls. Diagnose and resolve network issues using packet captures, OS command outputs, diagnostic consoles, logs, or other tools. Leverage network observability tools to make data-informed decisions and identify opportunities for improvement. Implement and maintain security measures to protect data, systems, and network availability. Collaborate with network and security teams to validate new systems and configurations, expand observability, reduce exploitable vulnerabilities, implement security controls, and enhance system resilience and usability for customers.
Documentation and process improvement : Create and maintain clear, concise documentation for knowledge sharing, process repeatability, and operational continuity. Develop system diagrams, deployment guides, and standard operating procedures (SOPs) that support usability, compliance, and reliability. Continuously refine documentation and processes as systems evolve, incorporating feedback and lessons learned. Ensure all procedures align with FSU ITS Security Policies and Standards. Participate in peer reviews to validate documentation for accuracy, clarity, and usability.
Support and incident response : Respond to system alerts, outages, and support requests in accordance with established incident management procedures, collaborating with peers and stakeholders to ensure rapid resolution. Use observability tools to support rapid diagnosis and resolution, and create new monitoring as needed to improve visibility. Participate in post-incident reviews, highlighting key data points and observability insights to identify root causes and opportunities for system or process improvements. Implement improvements to prevent the recurrence of issues and to enhance system reliability. Participate in an on-call rotation, typically one week per month, which includes after-hours support for deployments, changes, or incidents, including on holidays and weekends. Actively work to reduce the need for after-hours assistance by leveraging automated deployment solutions, improving system reliability, and lowering the risk and complexity of changes. Assist with IT security investigations as needed. Ensure incident response processes align with the expectations of IT management, technical teams, and customers.
Professional development : Continuous learning and technical curiosity are key expectations of this role. Complete both assigned and self-directed professional development to stay current with evolving technologies, tools, and practices. Explore technical subjects that interest you, even beyond current projects. Use provided learning platforms, such as LinkedIn Learning. Participate in the ITS Professional Development Bonus Plan by completing manager-approved certifications. Pursue relevant training, certifications, and conferences aligned with team goals, subject to approval. Approved training resources will be paid for by the organization. Research and validate emerging tools, including AI, automation, observability, and other innovations, to assess their value for our organization. Apply a mindset of rapid experimentation using data to guide decisions, improvements, and the next experiment. Participation in knowledge-sharing sessions, communities of practice, and collaborative learning opportunities is encouraged.
Qualifications
Bachelor's degree in Computer Science, MIS, or other appropriate degree and two years experience or a high school diploma or equivalent and six years of experience. (Note : or a combination of appropriate post high school education and experience equal to six years.)
Preferred Qualifications
Helpful
Why is this role important to the organization and its mission?
This is a high-impact position where your work supports clinicians caring for patients today, researchers working to improve health outcomes in the future, and students preparing to serve in clinical settings after graduation. You will have the opportunity to shape infrastructure strategy, contribute to architectural decisions, and lead automation efforts that improve reliability and efficiency. Whether you are already working in an InfraOps or SRE role, or you are a systems administrator ready to take the next step, we encourage you to apply.
Who is an ideal candidate for this position?
The ideal candidate is someone who is genuinely passionate about IT, someone who has probably built a home lab (or cloud lab) just for fun, and gets excited about scripting, automation, and making systems run better. They love learning new technologies and finding better ways to get things done, whether that is writing a PowerShell script, setting up workflow automation to schedule a calendar appointment, or integrating AI into their workflow to save time and reduce toil. They are the kind of person who starts the day looking forward to pushing code to the git repo, watching it flow through the deployment pipeline, and seeing it make a real impact on clinical, research, and education efforts.
This is a role for someone who loves solving puzzles, thrives in both independent and collaborative settings, and gets excited about working with modern tools to make infrastructure smarter, faster, and more reliable. They know how to communicate clearly, whether it is writing documentation, contributing in team meetings, or breaking down complex technical ideas for someone who just needs the bottom line. They understand how to capture and interpret network traffic when things go sideways, and they know how to turn that data into action. They are curious, self-driven, and motivated by the idea that their work helps support something bigger than just infrastructure, it enables clinicians caring for patients today, researchers working to improve health outcomes in the future, and students who will one day serve in clinical settings after graduation.
What is a typical day in this position?
A typical day begins with hands-on work such as writing scripts, deploying infrastructure, or addressing tasks that have not yet been automated. You may manually configure a Windows Server 2016 system as part of legacy support, or deploy a new Windows Server 2025 instance using the latest automation tools. After completing any manual work, you will often begin designing an automated solution to replace it, write the necessary scripts, and submit a pull request to improve future efficiency for the team.
You may participate in a pair programming session focused on scripting or automation, helping to refine logic, improve readability, or troubleshoot unexpected behavior. Throughout the day, you might respond to an automated alert or assist another IT team with a technical issue. When incidents arise, you will work alongside the rest of the team to investigate and resolve the issue, often using tools like Elastic to gather context and identify root causes. This triggers a post-incident review, where the team reflects on what did not go as planned and then takes action over the following days to prevent similar issues from occurring again.
Toward the end of the day, you will join a team meeting to share updates, align on progress, and discuss priorities. You will also take time to plan and prioritize your work for the next day. Once a week, you will meet one on one with your supervisor to check in, share feedback, and work on removing any roadblocks that may be slowing progress.
What can I expect in the first 60-90 days?
In your first few weeks, you will work through onboarding tasks, get familiar with our environment, and shadow team members to learn how we approach infrastructure, automation, and reliability. You will begin writing scripts in your first week and gradually take on more responsibility as you gain context. By the end of your first month, you will be contributing to deployments, writing automation, and participating in troubleshooting efforts using modern tools and practices.
You will be integrated into broader initiatives and ongoing projects early on, contributing where your skills align and learning from the team along the way. While our project list is always evolving, one constant is the hands-on work of provisioning and configuring new or replacement infrastructure; we always use the latest supported Windows Server or Linux systems, whether on-premises or in Azure. You can expect to be involved in meaningful, high-impact work from the start, with opportunities to shape and automate the infrastructure that powers our organization.
We value the unique experience and perspective each team member brings, and we look forward to learning from you just as much as you learn from us. Our collaborative and innovative culture encourages finding better ways to accomplish work and achieve goals. You will play a key role in helping us improve our processes and remove barriers to success. As you build confidence and context in your first few months, you will be fully supported as you prepare to join our on-call rotation, where your contributions will directly support our reliability goals. We are committed to fostering a goal-oriented, easy-going environment where you can thrive.
University Information
One of the nation's elite research universities, Florida State University preserves, expands, and disseminates knowledge in the sciences, technology, arts, humanities, and professions, while embracing a philosophy of learning strongly rooted in the traditions of the liberal arts and critical thinking. Founded in 1851, Florida State University is the oldest continuous site of higher education in Florida. FSU is a community steeped in tradition that fosters research and encourages creativity. At FSU, there’s the excitement of being part of a vibrant academic and professional community, surrounded by people whose ideas are shaping tomorrow’s news!
Learn more about our university and campuses.
FSU Total Rewards
FSU offers a robust Total Rewards package. Visit our website to learn more about our Compensation, Benefits, Wellness, Recognition, and Employee Development programs.
Use our interactive tool to calculate Total Compensation options based on potential salary, benefits and retirement contributions, earned leave, and other employment-related perks.
How To Apply
If qualified and interested in a specific job opening as advertised, apply to Florida State University at https : / / jobs.fsu.edu. If you are a current FSU employee, apply via myFSU >
Self Service.
Applicants are required to complete the online application with all applicable information. Applications must include all work history up to ten years, and education details even if attaching a resume.
Considerations
This is an A&P position.
This position requires successful completion of a criminal history background check .
This position is open until filled.
This position has been designated as eligible for primarily remote based on the position functions. Employees are required to live in the Tallahassee area and report to campus as needed.
Equal Employment Opportunity
FSU is an Equal Employment Opportunity Employer.
Reliability • Tallahassee, Florida, United States