The latest webinar from RIS tackled the volatile relationship between tech innovation and stability, and how to unlock the secrets to IT flexibility.
Tim Denman: Welcome everyone to today’s “How adidas’ IT Resilience Fuels Its Digital Growth,” which is hosted by RIS and presented in partnership with Infosys. I'm Tim Denman, editor-in-chief of RIS.
Innovative forward-facing brands like adidas have experienced enormous digital growth both before and during the ongoing health crisis, which necessitated a reimagining of tech stacks to handle the massive influx of e-commerce traffic and orders. These industry leaders continue to invest in next-gen technology throughout the organization to ensure consumers continue to enjoy a seamless, memorable, and uninterrupted omnichannel experience, regardless of how they choose to shop.
With us today to explore how leading manufacturers and retailers can advance their omnichannel strategy by utilizing nimble IT infrastructure, and continue to provide a consistent shopping experience amid meteoric growth are adidas's senior director of IT, Vikalp Yadav, and Infosys' industry principal, Vishal Vithlani. Gentlemen, thank you for joining us today. Why don’t you each take a few moments to introduce yourselves and describe your role at your respective companies.
Vikalp Yadav: Hi, this is Vikalp Yadav here and I'm an engineer by heart. I currently spearhead the digital tech organization within adidas, and come with experience in driving transformational change across multi-geography engineering tech setups for various clients. For the last 22-plus years, I’ve been working in retail and CPG logistic segments. I'm glad to be here today.
Vishal Vithlani: My name is Vishal Vithlani and I'm an industry principal at Infosys. I'm based in Germany, working closely with customers in continental Europe. In my role, I support our customers in a variety of topics, including DevOps, agile transformation, and digital operation excellence. My focus has been to help customers transform IT organization processes and tools toward the high performing teams and enable the respective business partners. Happy to be joining this webinar.
Denman: Great to have both of you. The theme of today's webinar is IT resilience. This is a very broad subject that involves elements such as security, reliability, stability — the list goes on and on. Given today's e-commerce landscape, is there a need for us to re-look, reimagine, and ultimately, reinvest in these key areas?
Yadav: That's an interesting question especially during these COVID times. What we have seen is the macroeconomic situation has transformed from previous years.
Now, they are given a limited amount of time on your website, you have the responsibility to deliver a seamless experience so that you don't lose that consumer. With lost time or lost consumers, you lose revenue. It's important for organizations to rethink how we ensure the consumer sticks to us. How do we make sure to deliver the most seamless experience and ensure that we have a share of their wallet?
Vithlani: I was reading through the Cisco Reserve, and it was stating that by next year, 66% of the global population will have access to the internet, and per capita network devices will increase to 3.6% from 2.4% in 2018. That means more consumers for digital services and more options. Now, to give more options, the organizations are under pressure to make sure they're providing the resilient infrastructure back to the consumers and the end customers.
The Google State of DevOps 2021 report states that by having matured stability resilience initiative (SRE), the elite groups have 6,570-times better, faster time to recover from the incidents as compared to the organizations that are doing little towards the resilience initiative. This extra recovery time is leaving money on the table, deflecting customers as Vikalp mentioned, and stretching your developers and organization.
Denman: Let's spend a little time getting deeper into this idea of a new IT structure. Can you help us define what site reliability engineering is, what it does, and how it fits in a traditional IT approach?
Yadav: This is a term that was coined by Benjamin Treynor from Google. It has been a practice for quite some time: site reliability engineering. Traditionally, this is similar to when an engineer thinks from the mindset of an operations guy.
How do you leverage the best as part of which maturity in the general level you are in. For example, if some of the best practices of operations are further supplemented with the best-case tools and technologies, how do we do that? How do we move more and more toward automation elements? We can talk about more details on the pillars, but this is where you see site liability engineering coming into play.
Denman: A follow-up question on the value of SRE: Is there a significant value-add that SRE brings even to legacy monolith systems?
Vithlani: Be the modern application landscape or a monolith. If the operational KPIs are showing all green and the KPIs that the business cares about are showing a different picture, you can apply the SRE principles. If the business is looking at a revenue bleed happening in the system, if the business is looking at NPS, CSAT, time to market — if those KPIs are having red lines and the organization is showing everything looking great from the stability resilience perspective — that's the first indicator that the SRE principles are needed. It's an application of software engineering to the operations problem and measure the KPIs that matter to the business.
If I look at two or three key elements of the SREs, one important element is shared responsibilities.
Now, are we able to apply the same modern cloud SRE principle in the monolith? Probably not. We don't need that, but a lot of other principles can be relevant to the monolith organization set of teams.
Yadav: In some of my previous conversations, this question gets asked often and it's a myth that the SRE principle or the concepts of site reliability engineering are only applicable from a microservice-based architecture. That's a myth. Ultimately, what matters is under what agreements.
What service levels are you interacting with the end consumer, whether it be internal or external, or whether I'm a monolith. For example, if my architecture is based completely on an SAP-based ecosystem or Apple-based ecosystem, whether I'm a monolith on that front or if I have a microservice-based architecture, the concepts are applicable across.
Denman: We talked a lot about the idea of having strong IT fundamentals being vital to the organization's bottom line. How can retailers best match this idea of having resilient IT initiatives and also dovetailing that with the idea that a business initiative is business first? How does the IT and business department work together under this idea of resilience?
Yadav: Typically this also connects with the thought process. In some of the convocations where I met a few people, they asked: I'm in the process of transforming my operations organization, uplifting the value to how I contribute to my business. The challenge I'm facing is how do I speak the same language as my business?
The most important thing is to identify a strong sponsor. The sponsor needs to understand the reason behind why it’s needed. An example of a traditional “why” that was worked with in the past is that the tech organization and business organization are totally disconnected. It's an over-the-fence setup with the operational organization, so the business delivers or the engineering organization delivers, and operations just support it.
They don't talk to each other. That's the most unmatured state of setup.
A clear set of KPIs needs to be identified in order to evolve, which would drive them together. In a prior journey, we evolved from putting the five nines to work for us: with that thought process and telling the business that, this month my platform was up 99.9%.
The answer will be: What’s next? That's how it evolves. From there, we work toward the next set of KPIs and try to speak the same language as the business. We evolve toward the mindset of: how do you connect to the exact revenue bleed that the business associates it toward?
It was a journey of, in the interim, following best practices for time detection and also the golden signals of latency. Nonetheless, those KPIs are how we comprehend the business and make objective decisions.
I need to invest additional things to build resilience — buy tools, buy licenses. What will the cost versus benefit ratio be if I invest that much?.
Denman: Vishal, you have a lot of experience as a consultant, working with a variety of clients throughout your career. How do some of your clients view operations over this time span, and has this view remained the same, or have there been any drastic changes in the perspectives across the industry?
Vithlani: The keyword here is drastic changes. There’s been many changes from 2015 onward, from the end consumers to the internal stakeholders. Now, the expectations of a new digital solution are very high. Consumers want highly personalized solutions, frictionless experiences, and highly reliable and stable capabilities.
Look at some of the bigger retailers or the organizations. What they have been doing since 2015 or 2016 is carving out dedicated digital organizations by going through two core transformations at the same time. The first being a product-led organization, and the second being shifting toward a cloud-enabled microservices architecture at the various stages.
If you fast forward two or three years, this resulted in fast-paced feature teams focusing on a specific product, but we saw that there was a need and opportunity to ensure these products are operating in a frictionless manner.
A lot of legal blocks were being built together, however, as they’re brought together there are still friction points from the end-consumer perspective.
Ironically, applying the traditional ITIL service transition methodology was slowing everything down because there was a stronger transition to operation processes, which created and elevated tension across the organization. That was the inflection of 2016-2017 because we were already saying the traditional operations model would no longer work for a bigger organization going through the district transformation tier.
In general, some retailers saw the early symptoms of the resilience data. The resilience data relying on the traditional availability KPI became questionable. We can't necessarily say that the web server availability reflects the customer experience — maybe servers are up, but the number of successful transactions that customers experienced were wrong.
This was a trigger point of the resilience transformation journey on the customer side at the same time because we started keying in based on the principle of the service team. That's a major shift.
Denman: Let's get back to the core topic. Before we get into all the details, if you had a 5,000-foot view of IT resilience, what would be the key factors that comprise it? How would you do it? What would those factors be? On a high level, looking down IT resilience, what are the key factors that make it work?
Yadav: You touched a very important point: what are those key pillars that help? One of the most important things as an SRE is having a certain failure in the system, in the setup from a resilience point of view. The first thing I like to know is: When am I failing? How am I failing? ‘
Observability as a pillar becomes a critical element. How fast am I able to detect, measure it, and make sure that point in time is automated? Then, subsequently, are AI-driven operations driving those elements of observability?
For example, evolving from a retrospective picture of payment mechanisms that have failed in a payment gateway, to identifying, based on a certain pattern of failures happening in a particular service or payment provider, if a system could go down based on the reversal rates happening. That observability is one of the key pillars.
The second aspect was that as traditional operational organizations, we are driven through hard governance norms. If an important event or deliverable is being reached, I stop releasing. I stop making changes so that my system stays stable, but that has impacted the business’ speed of flow with newer feature changes.
Another element is that as SREs, we are tuned to think on the lines of what happens. We go in and take for granted that systems would fail in such a complex environment. My boss always says, “When you have a system that assumes that it would fail, now what would you do?” That's the mindset we go in with, from a site reliability perspective, where we say, “If something fails, how do we have a graceful degraded mechanism of how the service can look.”
That's the third pillar: having a strong, gracefully, degraded setup in place.
Of course, there are elements around how to bring security significant control, especially in the cyber world or e-commerce business where there is a consistent need to be on top of some of these threats. That becomes the fourth crucial pillar.
Last, but not least, the basic operational excellence in terms of the hygiene of blameless post mortems. How do you circle back and feed them into your future roadmaps?
- Release excellence
- The graceful degradations
- Security elements
- Operational excellence across all these parameters.
— Vikalp Yadav, Senior Director of IT, adidas
Denman: Great, thank you for those pillars, they'll be very useful to the audience. I'd like to dive a little bit deeper around a couple of the concepts. Can you explain the role that observability plays in the digital world? Observability is synonymous with customer insight.
Vithlani: They definitely have a lot of analog analogies that will evolve in the coming days. Seeing edge computing, etc. in the current landscape is bringing solutions closer to the customers. The earlier world of monitoring — if you are just monitoring technology or the web infrastructure — is no longer sufficient.
That means the edge computing you have should look at the behavior in that time it's catching to the hands of the end-consumer. This is further increasing the surface area for observability across the globe. It may also overlap with some of the standard analytics capability that the organization evolves.
Additional value chain. Look at it from the product-led organization — how we see the additional value chain now consist of the various products. I took an example of the various legal blocks coming together. While this transformation has happened, some of the organizations are at various maturities of resilience using SRE or any other principle. The resilience has been gaining attention at the product level, but the friction continues to exist between the product interaction at a value chain level.
Every day on my LinkedIn page, I see different organizations have gotten to the metaverse or gotten into the metaverse tier. Look at the value streams going beyond the organizations, the regular technology tax tier.
Intra-value stream frictions are going to become a norm in the distributed agile technology products tax tier. To solve this situation, the whole set needs a complete look, and is going to drive up to three to four key pillars. Some of them may already overlap with them being part of the core pillar. One is the business alignment tier.
Apart from the core observability focus in the respective product for the product owners, we need to look at a value stream level and a larger perspective. The KPI alignment will happen, you'll see it already aligning with some of the consumer insight and analytics because that's what the SRE should look at to make sure the respective value chain is successful.
Historically, in the organization, there would already be an analytics initiative aggregating a lot of data, which is put in front of the business decision-makers. The need has become immediate, real-time, meaning all the digital components should start creating that aggregation of the data for the immediate response or alerting at the complete value chain level. That's where another difference will come up, it is becoming real-time when we talk about the resilience observability in the resilience context.
The third important element is to invest in the AI/ML model. You may have an AI and ML model running, which is going to give you a lot of interesting data to consume and make a business decision. However, the decision making is set after two, five, or seven days.
Yadav: A very interesting point. While you talk about observability from a technical complexity point-of-view, in a value stream, how would you set that observability up? One of the dimensions of observability is how to make sure that the end customer, end consumer, is aware of how this observability is impacting him or her. How do they realize this value?
To that part, the consumer dimension of this observability is in and around, how do you have the right tools in place? How do you have the right touchpoints in your e-commerce consumer journey where they can feed in the inputs around what works, what doesn’t work, or how the experience is?
Leveraging those inputs coming in from — there are several social listening tools available that you can directly connect into this ML operations model that you are highlighting and then impulse, etc. — how do you use those tools, connect them into your data setup, and then provide a better experience to your end-consumer?
Denman: You both touched on this, but what are your views on the future observability, and how do you think the customer journey will evolve to take it to the next level, specifically in respect to observability?
Yadav: We have been intensely discussing that recently.
In one of my experiences, we were doing some statistics around the topics in a semi-matured organization — the DORA report, which Vishal mentioned from Google — 70% of the inputs that you detect are the bleeds that are happening.
These are the issues you're detecting more than 150 minutes after they have happened. The faster you detect, the lower the bleed is. In a future state scenario, how are we leveraging these machine learning models, and training them to predict these golden signals, which would help derive or foresee some of these forthcoming interruptions in the consumer value stream?
Vithlani: Vikalp already covered some of the elements. In the meantime, to detect at a fraction of a second in the deep-rooted service interruption, at an integration point — if we are able to do that within seconds, that's the future and organizations are already on that journey.
This is very different from the observability and the monitoring phase. My server has gone down, I'm able to identify my endpoints are not working, and identify these problems are far from that. We are talking about a fraction of consumers being impacted across the globe in some specific geography subset. I’m faced with the situation after many days — that's the reality. The systems are getting extremely complex.
By the time your system becomes more interconnected, you are there to support those changes.
Denman: Well said. Obviously, IT stability is very important to recharge. You can't have the systems falling apart, but innovation is also just as important as what we're talking about now and what we're doing in the future. Both stability and innovation are very important and somewhat contradictory to each other. How can organizations handle the trade off between the two?
Yadav: Innovation is something that helps you derive a fast-pace adoption, but there are some watch points, there are call outs for that. Let's take an example of one of the toughest concepts that we have seen in the past to implement an error budget concept. How do you innovatively drive the thought process, the mindset of error budget? What is the error budget?
Out of a certain amount of proportion associated — I'm oversimplifying for the sake of this discussion — that out of an amount of downtime, proportionate or appropriated to you, how much of the downtime are already using or are part of your service interruptions? Example, 99.99% is what I've committed as part of my service-level objectives. That remaining 0.1%, how many minutes does that account to? How do I further keep reducing that?
If I'm better than what I've committed, how do I leverage that remaining amount? These are some of the topics that have triggered thoughts on how to use the error budget. One of the elements is why to leverage the remaining time to innovate and bring in chaos engineering practices. How can I be prepared to be failing in future?
This also drives the team to think about how to make those tests in the release change much more automated. These are some of the potential areas where we explore innovation and bring the community of practice together to help as well as innovate.
Denman: Vishal, sticking with the topic of release excellence, in your career you've been the center of a large number of releases. In your experience, can you tell us what leading organizations have done to optimize these releases and achieve excellence?
Vithlani: Again, I will share some of my experiences on how organizations were transforming them to the product-led and cloud-enabled microservices environment that created a further value acceleration. During the course of the journey, a couple of elements have gone into the release maturity level. Of course, making sure the OLS pipelines are automated and testing is integrated.
Even now, the security elements are part of developmental technology operations, which are becoming the norm. What is the norm is to have the strong mindset of simply trying the subset of the releases with a specific set of a group market or geography-specific segmentation. That's becoming a norm in the matured organization. There’s no discussion of how, just when or what is that conversation happening.
Similarly, the blue or the green deployments are also becoming normal in the matured environment. Some of the customers are still in the process, so they have completed the standard automation into their legions, and are trying to bring up the excellence around QA elements and integrating. Those are also becoming normal in the next 12-18 months.
The organizations still need to further strengthen. Although they're moving toward the faster deployment, that requires a faster roll back procedure when things go wrong.
That also goes back to the innovation element. If a company is ready to innovate, does it have the right observability and the right rollback procedure if something goes wrong? It’s not when to do that, but how we do that?
The reproducible pairs and immutable infrastructures for a quicker recovery of systems when things go wrong — these should be there for your rescue. Make sure that there's no planned down times; they should immediately go out of the plate as part of the 2022 or 2023 strategy.
Denman: What challenges do you think the e-commerce industry will face as more and more businesses across the world adapt and embrace the digital marketplace? What does the future hold for SRE? We probably could break that into two, let’s start with the first one: challenges that the e-commerce industry faces.
Yadav: In 2022, the biggest challenge is that the supply chain is going to be tremendously constrained because of where most of the organizations are focused, or leveraging the inventory. How do you circumvent these inventory challenges?
The consumer has a limited amount of money to spend on buying things, but as the supply chain is constrained, the cost of it would subsequently be associated with the supply chain element. If I'm able to buy four shoes for €100, I may only be able to buy three shoes because of the cost associated with calling.
The second challenge is that many organizations are evolving at a very fast pace — how do you ensure the amount of traffic that you have is the right share in that market? These are a couple of challenges through best-in-class experience for an end-consumer, through an uninterrupted and premium experience to a consumer. This is how we can drive a strong e-commerce business in 2022 and beyond.
Denman: So, what does the future hold for SREs, and are there any benchmarks from the path that you're on going forward that you'll be able to use to track success?
Yadav: The future of SRE. As things evolve, as the maturity of how a product- or a project-led organization is evolving, there’s a shift more toward how the site liability engineers are getting closer and associated within the team setup. How do they drive value for a product or a project? These are some of the elements that gain importance in the coming year, especially from a value stream perspective. How would they play a crucial role in serving and uplifting the value stream? These are some of the elements that are getting recruited now.
In the future, that fair balance between a T-shaped organization — what would that look like? What kind of depth of a skill would you need?
Denman: We are almost out of time, but I would like to give each of you the opportunity for a closing statement. Vishal, anything in the way of a wrap-up or last thoughts?
Vithlani: The systems are going to become more interconnected. Value streams are going to go beyond the organization boundaries. To complete a transaction, organizations will hop in and out from their technology systems to the SaaS solutions, open source solutions, and come back. We are seeing commerce begin to integrate with Instagram, TikTok, Shopify, Pinterest — I’m just using examples into a retailer's world. They're trying to sell things on the metaverse already.
The whole landscape complexity will increase multifold in the coming days and years, so the skill requirement to have the T-shaped organization will become more important. The other element that will converge into this space is the data and privacy, which will also start converging back into the standard SRE capabilities. These are the two important takeaways that I would give to everyone.
Denman: Thank you. Vikalp, any closing thoughts?
Yadav: The only takeaway for me is that these transformation journeys are consistent. They will keep continuing year after year after year. Everything is interconnected with a complex micro-system. Start small and with the right level of MVP, then get buy-in and a sponsor, and then expand. That’s my key takeaway as a practitioner — that and speak business language.
Denman: Well, unfortunately we are out of time. It was a great conversation. Thank you, Vikalp and Vishal. I appreciate you joining us, and of course, thank Infosys for the continued partnership.