DevOps Journeys - Guest Blog with Tom Haynes

Guest blog with Tom Haynes

“The transformation from traditional infrastructure management to self-service platforms represents one of the most significant shifts in modern DevOps practices.”

This perspective shapes our latest guest blog, where Tom Haynes delves into how this evolution has redefined the role of Platform Engineering, enabling developers to take control of their application infrastructure while maintaining governance and security.

Tom highlights how self-service platforms, like ITV’s Internal Developer Platform, have empowered developers to manage workflows and observability via simple configuration, increasing agility and reducing time-to-market. He also explores the challenges of balancing developer autonomy with operational excellence, emphasising the importance of “making it easy to do the right thing” while fostering sustainable practices and innovation.

DevOps has transformed the way organisations build, deploy, and manage software. In your experience, how has the DevOps landscape evolved in recent years, and what do you see as the most significant shifts driving its growth and adoption today?

A common phrase describing the DevOps approach is "You build it, you run it", with developer teams actively building and deploying their applications. I have found that over the years the definition of what "it" is has been expanding, initially for many teams this meant compiling and packaging their application code, and deploying this via provided CI/CD Pipelines. Underlying infrastructure was still the domain of dedicated DevOps or Platform Engineers, that were embedded into developer teams. As supporting PAAS and SAAS solutions have improved it has become simpler for developers to own more and more of the infrastructure supporting their applications, and to define their Application workflow and observability functionality via simple configuration files.

To this end "Self Service" has been a focus for our Platform Engineering teams over the last 5 years. We aim to reduce dependencies on DevOps Engineers via an Internal Developer Platform (IDP) that makes it easy for them to do the right thing, and speed up Time To Market without reducing operational excellence. Infrastructure remains conceptually complex, however, despite the improvements in support tooling. I see the role of our Platform Engineering becoming less about directly building and configuring infrastructure for development teams, and more about developing the IDP that enables developers to do this. They serve to guide developers on architectural choices and debug difficult issues.

These days most developer teams do not have embedded Platform Engineers, they are able to build and run most of their Application infrastructure without them, and are able to lean on a central Platform Engineering team for more nuanced requirements.

How are you leveraging platform engineering to enhance developer experience and streamline operations? What challenges and benefits have you encountered in implementing this approach?

At ITV we have followed Platform Engineering paradigms for 10 years. Over this time the platform has evolved from a fairly static estate provisioned by Platform Engineers to a rapidly scalable, secure, easy to change environment that developers are able to directly control. The latest version of our platform, released at the end of last year, focusses on developer self-service via an Internal Developer Platform (IDP). We enable developers to quickly create and manage their application infrastructure, including workflow and observability functionality. This is done via a catalogue of Reference Architectures that can be instantiated via a standardised set of CI/CD Pipelines. These run within RBAC and ABAC IAM Permissions Boundaries to grant least privilege permissions.

A majority of infrastructure change is now made by non Platform Engineers, and we have seen significant increases in metrics such as Frequency of Infrastructure Change (5x). A key challenge to Platform Engineering has always been the need to cater for a wide range of developer appetites and abilities. They are rarer these days, but we still get developers that don't want to leave the comfort of their IDE. We also have developers that want root access to manage everything. Accordingly, an IDP should aim to make it as easy as possible to manage common application use-cases, while allowing for "power users" to go beyond these use-cases (and require that they accept accordingly greater responsibilities).

With the increasing emphasis on developer productivity and autonomous teams, how do you balance self-service models with governance and security requirements?

Balancing autonomous self-service with operational excellence is a central tenet of a good IDP - at a high level it should “make it easy to do the right thing”. Our catalogue of Reference Architectures provide a set of approved, security scanned implementations that can be quickly self-served by developers. Infrastructure change processes are mediated by standardised CI/CD workflows that bake in Policy-As-Code security scanning, and require peer-review from an approved list of reviewers.

In practice ensuring that security controls do not interfere with delivery requires being judicious about where and how these controls are implemented. Our platform gradually enforces more controls the closer a developer comes to production. These controls vary from root access to structured sandbox accounts, to deploying from branches in development, to only deploying peer-reviewed, security scanned, released versions into production. We aim for a “fast start, consistent finish”.

In the era of GitOps and Infrastructure as Code (IaC), how do you ensure best practices for version control, automation, and consistency across environments?

We try to follow the “everything as code” principle - we control the configuration of our IaC repositories using IaC! The safe-settings project is a great tool for enforcing Github policy-as-code and repo configuration. Our environments are built from the ground up to be as similar as possible, using the same Terraform modules throughout. Deployment workflows are designed to use the same promotion logic, whether we are deploying infrastructure or application changes.

We also include checks on environment consistency when attempting to promote an application version into production - ensuring that the testing it has been through is representative. We include validation to ensure that these standardised approaches are followed.

As sustainability becomes increasingly important in tech, how are you integrating GreenOps practices into your DevOps workflows? What strategies are you using to optimise energy efficiency and reduce environmental impact while maintaining high performance?

There is a convenient symbiosis between GreenOps and the pay-as-you-go costs of public Cloud providers. As we strive to reduce the costs of our estate we are also reducing its environmental impact. Recently considerable work has gone into optimising our Kubernetes usage, focussing on a few areas. Our clusters are now managed by Karpenter, allowing us to use more spare AWS Spot capacity, even in production. It has also enabled us to migrate our support workloads onto graviton instances, which are considerably more efficient. Finally, we have introduced Goldilocks to automatically right-size our support workloads. We have a dedicated Cloud Financial Management (CFM) function that drives optimisation across the platform. They gather usage efficiency data, including the estimated slack cost, and surface this information to the developer teams to incentivise rightsizing of application workloads. CFM are introducing anomaly alerting to pick up unexpected spikes in cost as soon as possible.

We have introduced InfraCost to our infrastructure Pull Request workflows. This tool estimates the expected cost of a given change, effectively “shifting left” efficiency considerations by highlighting the impact to developers. We plan to also use this tool to alert our CFM team of proposed expensive changes.

What are some of the most common reasons DevOps initiatives fail, and what lessons have you learned from these failures that could help others avoid similar pitfalls?

A key ingredient to successful adoption of large DevOps initiatives is ensuring that the correct Business Change processes are in place. In the past excellent work has failed to realise its potential because developer teams were unaware either of its existence, or its value. Recently we rolled out the new generation of our internal platform with support from a Business Change team, who used the Prosci ADKAR model to drive the initiative through a successful rollout. Another important lesson that we've learned during the evolution of our platform is that interfaces have to be drawn in the correct places.

For example, developer teams all want to configure their build, test, deploy pipelines slightly differently. Initially elegant attempts to allow self-service CI/CD Pipeline configuration grew bloated with configuration options, and became cumbersome, confusing and hard to maintain.

Now, we instead look to identify areas where standardisation is either required (i.e. for compliance reasons), or beneficial (i.e. serving common use-cases). In the above example we now allow development teams freedom to configure their build and test pipelines as they like (with Github Actions), and instead have specifications that the build artefacts must meet and mandate security scanning checks. The deploy stage of the pipelines remain more opinionated, ensuring consistency in how write actions are made to our environments.

Looking ahead, how do you envision the future of DevOps evolving over the next few years? What emerging technologies, practices, or cultural shifts do you believe will have the biggest impact on the way we approach DevOps?

It is hard to see past the current surge in AI tooling when looking to the future. For us its usage to date has largely been in assistant-tech, such as coding assistants that are clearly able to significantly increase the output of our engineers. Such a huge new technology domain unsurprisingly comes with a lot of hype, and a lot of potential snake-oil salesmen! That said, I expect the AIOps marketplace to mature over the next few years, with tooling that could revolutionise infrastructure management. An interesting debate continues to be that between SAAS and Open-Source tooling approaches to Platform Engineering. The market of SAAS solutions has grown to enormous proportions, and with Platform Engineering concerns being so disconnected from direct business value it can be very difficult to judge whether these solutions provide value for money.

We continue to believe that there remains space for both of these approaches within a Cloud Platform solution, with both having pitfalls and advantages. While massively simplifying operational overhead, SAAS tooling can lock teams in and costs can spiral. Conversely we have seen examples of brilliant Open Source tools that "just work" for years with very little overhead, versus sensitive, complicated services that require significant investment from the teams to operate. The strength of the community behind the tool is often the most important metric to assess.

The incredible success of Kubernetes shows that Open Standards can bridge the gaps between SAAS and Open-Source - it will be interesting to see whether Open Telemetry will continue to drive widespread standardisation across Observability domains. We are aligning our Open Source Observability stacks onto OTEL standards, and have an aspiration to offer a "tiered" solution, with both "Silver" Open Source and "Gold" SAAS options available, depending on customer requirements.

If you could automate one aspect of your job or daily routine - no matter how mundane or outrageous - what would it be and why?

Meetings! In the post-Covid world I still have a very busy calendar. I look forward to AI Assistants that can attend those meetings that are only semi-relevant, report back to me anything that I need to know, and speak up when someone there has forgotten that subnets are still a thing.

To hear Tom's insights compared with 7 other industry leaders, download DevOps Journeys 4.0 today. DevOps Journeys provides a roadmap to navigate evolving challenges and stay ahead of the curve.

Whether you’re advancing your DevOps skills or initiating digital transformation, this resource is invaluable for every DevOps enthusiast.

Latest opportunities

Principal Platform Engineer

Permanent | £90,000 - £110,000 | UK Remote

Are you looking for a role where you can lead teams, manage stakeholders, still get hands on, and be the go to person when things get tricky?

View job

Data Operations Engineer

Permanent | 60,000€-80,000€ | Spain, Germany

The role suits someone who enjoys both the thinking and the building.

View job

Lead Platform Engineer

Permanent | £85,000 - £100,000 | London

In this role, you will be recognised as a subject matter expert in core technologies.

View job

View all jobs

DevOps Journeys - Guest Blog with Tom Haynes

Guest blog with Tom Haynes

Recent Posts

AI Exchange March @ Dawn Capital

AI Exchange March Ft. AWS, WeBuild-AI, and Sky

Latest opportunities

Register

Get in touch