WeTransfer's journey to cut costs, not corners

EPISODE 1 49 mins Nov 25, 2024

0:00 / 0:00

About this episode

Join Simon Elisha and Dr. Werner Vogels, as they kick off a new mini-series, “The Frugal Architect,” with Dan Conti, whose journey from the constrained world of embedded systems to WeTransfer’s cloud architecture is a masterclass in frugal innovation. In his early career, Dan wrestled with MP3 players where 48K of RAM was a luxury. This experience cultivated a deep appreciation for efficiency that would prove crucial at WeTransfer. As CTO, Dan applied this resourceful mindset to cloud computing, where easy scaling often masked hidden waste. His team’s efforts to optimize storage, improve observability, and align technology with environmental values showcased that in the cloud, frugality isn’t about penny-pinching—it’s about squeezing every ounce of value from your architecture – and doing it sustainably.

HOSTS

Dr. Werner Vogels — CTO, Amazon

Simon Elisha — GM, AWS Podcasts

GUEST

Dan Conti — Former CTO, WeTransfer

Episode Transcript

This transcript was generated automatically and may contain minor errors.

Simon Elisha: Good day everyone, welcome to the official AMS podcast. I’m your host here, and I’m happy to bring you a special series with Werner Vogels. Here’s Werner to tell you all about it.

Werner Vogels: Thanks Simon. Welcome to the Frugal Architect podcast where we dive into the journeys of technology leaders building cost-aware, sustainable and modern architectures. These are longer form conversations where we explore these topics in depth, and I hope you enjoyed them.

SE: And coming to us from a very glamorous remote island location that he may or may not reveal during the episode, Dan Conti, who’s the former CTO at WeTransfer. Good day, Dan, how are you going?

Dan Conti: Howdy, Simon, thanks for having me.

SE: Thank you for joining us. Now we are talking about frugality, or one of the topics near and dear to all Amazonians’ hearts as one of our leadership principles, but more importantly, something really important to architects, and Werner, of course, you introduced this concept at re:Invent back in 2023. And we continue to dive deep on this topic and really the goal of this series is to really talk with some of our most amazing customers who have done this and want to share the lessons. As we go into this conversation with Dan, what’s the mental model that you want listeners to have in their minds as they’re thinking about what we’re gonna be talking about today?

WV: Well, I think most importantly is that, we hear the stories of people that have, that have the scars. I mean, before cloud, and I think Dan can also talk about that, is, our constraints breed a lot of creativity. Now, we always had to live within constraints. Cloud was amazing in that it sort of removed all those constraints and suddenly moving fast became way more important and innovating really fast and doing all the things that you couldn’t do when you were constrained. But that sort of left the art of being creative within constraints off the table. And I think in the past year since, since 2020, 2022, we clearly saw that a lot of our customers started to scratch their head and say, is this really the amount of money that I should be paying for this architecture. Are we architected in the right way? And that’s why we sort of took a wee step back and sort of try to lay out a bunch of principles around what we then called the frugal architect. Now, frugal doesn’t mean cheap. Frugal means absolute value for your money. Yeah, and I think that’s sort of really starting to think also as architects that we should have the idea of cost and with cost, I also think as a proxy for sustainability in mind. So. That was what set this all off and uh one of, one of the documents that we highlighted on stage was WeTransfer and as being one of the oldest uh European companies on AWS, they had a long history and great stories to tell.

SE: Absolutely, and it’s a great call out about the, I guess longevity of of WeTransfer. I was thinking myself back to when was the first time I ever used WeTransfer, it’s one of those services that, I think all of us in IT have been like, darn it, I’ve gotta send a really large file to someone. I have no way to do it. What’s available? And then this WeTransfer thing exists, and you sort of look at it and go, wow, this is alarmingly easy for me to use and I can just do it and it’s OK. Dan, tell us the origin story of WeTransfer, for, for me the origin story is I found it on the web and I used it many years ago, but there’s a lot more to it than that. Tell us a bit about it. Give us the, the context of, of where WeTransfer came from.

DC: Yeah, for certain, so WeTransfer definitely predates my time at the company, it started about 2009 and back then, if you’re a creative studio, you had large files you needed to send to another studio or to a client, the easiest thing to do is to put it on a drive and send it by courier uh or send it next day FedEx. It was fast, reliable, everything got there, but certainly for a lot of the studios in the time in Amsterdam, like there is an idea that maybe there should be a way to just, upload files and have people download at a later date. And so this particular studio built it as sort of a utility for themselves first, hey, can we just distribute files to our clients, can we make it easy to upload, can we make it easy for them to download at a later time? And the natural virality of all of these customers downloading kind of led to this growth over time or very quickly like people were trying it out using WeTransfer as a utility to move files. So from that early beginning, a little bit of, press happened, a little bit of, explosion of virality, and suddenly it goes from a simple utility that they were using internally to kind of a full company and a product and the way it was designed where it was so simple to actually distribute a file and upload or download and so much of the canvas of the product is actually left for advertising and made it a very attractive product for advertisers too. So free, easy to use, beautiful high quality ads, and just a really steady growth rate starting from 2009 all the way forward, so.

WV: So in the early days, it was also really driven by the creatives, much more than by the technologists. How did it become a technology company?

DC: Yeah, actually it’s a great call out because it started within a creative studio, like the, I think the first developer that worked on it was, the part-time IT person building something in Flash. It actually became more of a technology company out of necessity. So as you start to scale. you start to get more users, some of the choices you make initially just to get something working, the scaling choices break down. One of the ones that was made was, hey, like there’s, there’s never going to be more than 4 billion files on this platform, so we’ll use, integer indices and that’s all going to be fine. And then after a few years, like things start taking off, there’s no database migration strategy, how do you update the schema, like no one’s ever dealt with that. So this kind of situation where, a creative company that’s built this tool that’s actually growing and getting a lot of use. Sort of necessitated some engineering internally to go build up, tooling and systems to go scale. So, 2014, 2015, like the investment in engineering where they started taking off and starting to build a culture of how to build, a scalable web application on the cloud.

SE: How hard could it be is always the question, but as you say, You make certain assumptions at the start and uh whilst we all want application to be successful or useful, we can’t always predict the level of that success. And I, I think what’s interesting is that you came into WeTransfer with a very distinct frame of reference in your own career that I’d love us to unpack a little bit because you came from that most constrained environment. As Werner talked about the fact that we know with cloud, we’re kind of unconstrained developers. But there is a world where you have very finite amounts of memory and processor, etc. and that’s what you started in, so maybe tell us a bit about that cos I think that helped inform your mental models that you were able to use at WeTransfer.

DC: Yeah, for sure. My first jobs straight out of college were all in embedded systems and this was, I’m gonna date myself here, this is 99, 2000, I’m dealing with systems where there’s, 48K of RAM, no virtual memory, you have single threads, single core processors with no adjustable clock rates, so everything is very fixed. If you go over the budget you have for memory or CPU, there’s no safety net, that’s, that’s it. And so that, that forced, and I think kind of Werner talked about this a little bit before, it forces this mindset of how do you creatively solve within the constraints you have. There were products they shipped where I knew every byte of memory and how it was being used because I had to use every byte to try and, get performance over here and make those trade-offs between compute and space. So that was kind of the very first work that I did out of school for 4 years or so and then even after that, working on, desktop operating systems and working on Windows application development, you still have constraints and trade-offs. You can allocate as much memory as you want, but then, at some point you’re going to get into the page pool, the system slows down, you can run in a more intense workload, but if you’re on a laptop, it’s going to get hot, battery life kind of, So there’s always these trade-offs and constraints that for you to think, force you to think about how you’re using resources that you have.

SE: And that, that clearly changes the way you approach the problem domain, and I, I guess from a frugal architect perspective really that it starts with the thought process, not so much the tools, doesn’t it? Like I think, I think Dan’s story is a really great example of how, how constrained you can be but still build stuff.

WV: And I also think that constraints — let’s say a true constraint can be a negative on you, but, to be able to drive creativity, we should, can also impose on ourselves sort of virtual constraints just as an exercise for your mind, and, and I think also as, as an engineer, one of the bigger things I think is that We have this tension between what the business wants and what we can deliver as technologists and so where we need to expose these kind of topics as constraints or what do we want to give to the business to make sure that they understand the decisions that they make instead of always just accepting oh the business wants this new feature, let’s go get it out as quickly as possible and not thinking about the cost. Now where, especially when things start to scale, where the business needs to understand, decisions that are being made, whether it’s with respect to performance, whether it’s with respect to reliability, actually come back as a cost. And so that’s kind of, I think it’s not just a matter of that we as engineers are more constrained and and enjoy sort of building things as cheaply as possible or as efficient as possible. It is also making sure that we give the business the tools to actually make the decisions for us, where we as engineers shouldn’t make the decisions, but the business should be the one making the decisions. We need to make clear what the constraints are for the business.

SE: Dan, I was gonna come to you on that one, is it’s, how do you explain to the business what the constraints are? I, I don’t think in your case they were really thinking about the meaningfulness one way or another of the integer choice for a counter, for example, that has a business outcome.

WV: Well, let, let’s, let’s start. The business actually is pretty good in, in sort of uh when we think about sort of regulatory constraints, kind of things you’re allowed to do and which you can’t do. And many of those things need to find a way back into the domain specific systems that you build. And so the business is known about constraints, do you really think about sort of Amazon web page on Amazon? Does the bestseller list absolutely be 4 9s available? Because if it needs to be 4 9s available, it comes at a certain cost. And so those kind of trade-offs where reliability or it’s probably one of the easiest ones to to to think about. Does everything need to be close to 100% available? If you say all that to the business the first time, they’ll say, yes, of course, everything needs to be 100% available, always under every circumstance. And then you have to tell them, and that’s going to cost you this much. Under these conditions. And so then maybe if you decompose and create smaller building blocks and reason differently about those building blocks, then suddenly the business can actually make decisions about uh under these failure conditions, that maybe, maybe that one can have a bit lower priority and I’m willing to spend less on it.

SE: What about from your

DC: perspective? Yeah, I was gonna say I think there’s also this view that the ability to, to spend to go solve the problem. I think it’s become the norm in cloud development in some ways rather than a tool that you can exercise should you need to. So to go back to your earlier point on frugality, there are certainly situations that that we reached where it was an option, like, do we need to go spend more to buy more capacity or go up in instance size, or can we make an investment on the engineering side to be more efficient or to reduce kind of this digital waste, the sort of unused capacity that we were consuming in some way. We hit this in situations where, the easiest thing would be to, consume more S3 storage, but as we were looking at that, we also found places where, it can sort of maybe like an OE versus Cogs discussion. We were looking and saying, hey, we could spend a little bit of time over here and clean up some of these cases where we’ve leaked storage over the years, that’s, 2 petabytes, 4 petabytes, and just kind of remove some of that waste. And so I, I do think that there’s definitely an aspect where it’s with the business, like, hey, do we need to go invest more money to get the outcomes that we want. But for me I found that those discussions got so much easier when we were also demonstrating that we were pursuing a reduction of waste, when we were trying to be frugal about the times that we use money and how we use money, cause people knew when I went to go ask for money, it was because like, hey like we actually really need it for this problem and we’ve pursued these other avenues to go clean up waste and reduce costs, so.

SE: Well and what, what did you find, like you’ve come into WeTransfer, you, you’ve got sort of fresh eyes and, As in all roles, you only get one chance to have fresh eyes and then you, then you’re part of it. But you’ve come from this embedded system background into a world of, highly distributed, high capacity, using lots of stuff. What did you discover from an architectural standpoint? Tell us a little bit about it and, and how did you start to apply some of that reasoning to it?

DC: Yeah, for sure, I, I think one thing that’s key to note, I joined in February of 2021 and I actually joined as an IC. My intention was to come in and just build for a year. I just wanted the experience of, meeting everyone, building with a team and like learning the customers and the products, that gave me a sense of just who everyone is and how are the systems created and how is engineering working, but it also put me in the team at a time where everything was scaling up because, February 2021, like usage of WeTransfer had gone up dramatically during 2020 and the pandemic. And so the, the company was growing, there was a lot of hiring going on, the business was growing, and so it’s sort of a time of change. To your point, like what I learned is that I think this is very common across, any company that’s gone through scale up, the problems that really had to be solved had been worked on and solved. And then all of the other problems that hadn’t yet become big enough problems yet were kind of left, and it’s intentional. We always focus and prioritize on the things that we have to go address, we can leave this kind of trail of other things that hopefully don’t catch up to us later on, but because the company had run in such a lean way for so long, because, it was always in this place where it’s really like a creative and a brand place and the technology was always kind of coming in second, it had run very lean and there were certain areas where the investment just hadn’t caught up with where it needed to be. And especially as adoption grew in the product and a lot of new customers were coming in and there was a lot of pressure to build, build new products and build new features. We really had accrued some debt that was actually causing some challenges and so when I moved into the management position in 2022, like there was a period of time over like two months where it was major security issue, significant reliability issues where the site was going down for hours and hours at a time. We also had a cost overrun that was going on where we were just leaking storage, on a month over month basis and couldn’t trace it down. So a lot of the, the debt of like, hey, where did we not have observability, where did we not have tests, where didn’t we have the reliability culture kind of all came due at the same time. I think the other part that happened as part of that is there was always this part of the company that was focused on what is our role in society, what is our role with people, like how do we impact the planet. The company became a B corporation in 2020 and so a lot of focus on people, planet, and how to do that and still be a profitable company. And so there was always this undertone of how can we be frugal in what we’re doing, but the challenge is that when you go look at an architecture and you build a system and you’re thinking from this, this perspective of how do I do this in the most efficient way, you have to go test the theory and see how it actually works in practice. You see how customers use it, you see how your theories and your ideas and your assumptions play out, and then you have to iterate and move from there. And so as we were going through this growth phase and we were rolling out systems like how we manage storage, we started to learn which things didn’t work the way that we expected and which systems where we thought we were doing the most efficient, frugal thing. It turned out we were actually generating a lot more complexity and a lot more problems and so, so certainly like I, I kind of came into this in a time of transition and I think that opened the door for Us to go figure out what’s the mindset we want to have at the company moving forward, how we want to approach, architecture moving forward.

WV: So how did you prioritize what to address first?

DC: In the security reliability cost overrun discussion, it was unfortunately, it was, it was security because the, the issue that we had was, it was existential to the business at that time, uh, so we had to take that one first and then we had to do reliability and then we had to do cost, which feels very icky, especially as we’re having a discussion about frugality, but it was one of those where again the, the money one is one that like it could kind of slush for, 4 or 6 weeks if we needed to while we were sorting out these kind of more existential problems. As we got to like why were costs growing on storage so much, it really highlighted just places where we didn’t have enough data, enough observability, we didn’t understand the patterns of how the storage was being consumed and the interrelationships between systems and what the failure points were, relatively simple issues like the idea would be we need to go delete a set of files after 7 days because, transfers expire and so the files you upload after 7 days, they kind of Self-destruct so that they’re no longer available. Just the amount of data that we had to delete was far beyond what any of the systems were designed to handle, so we were running out of memory, querying databases to go delete blocks and files. Those types of issues came up and just sort of inopportune timing, but yeah, it’s definitely security, then reliability and then, and then cost at that time, so.

SE: And and that’s an interesting one where you talked about the fact that there were practices and processes in place. But In the case of sort of the, the cleaning up type situation, it was kind of failing silently, and this kind of ties into that world of observability and, and understanding all, all the assumptions you make. Help us unpack that a little bit more cos I think the thing that you, you guys have that’s fascinating is the just the sheer scale that you’re operating at and the fact that this stuff just has to work and running very lean, what that meant from an operational standpoint and then how you went to diagnose and solve that problem.

DC: Yeah, for sure, and so certainly like, as things scale up, you start to see new usage patterns, new behaviors emerging. We also had kind of a confluence of other factors, so we had implemented a new solution for how we manage storage pretty recently as the company was hiring certain people were seeing the company change and so there was a fair amount of attrition. And we ended up in this place in May, June of 2022 where we had a relatively new storage solution in place. Everyone who had actually worked on building that storage service on top of S3 had left the company, so we had a 100% turnover and then we were seeing on a month over month basis that our costs were starting to increase at a pretty steady clip and 1st, 1st month we looked at it and it’s like maybe that’s some seasonal stuff, maybe that’s the, usage is changing a little bit. 2nd month, hey, that’s, that’s a real problem, and we started to dig in and by the time we got to the 3rd month, we were talking, real money in terms of the, the impact of this. What had happened is that in the original design of our new storage service, there is this intention around being as efficient as possible with our use of S3, and so, as you upload files and a transfer, so you might be uploading 1000 photographs that you’ve taken. We recognize that some of those may then be sent in another transfer to someone else and then another transfer to someone else, so we’d end up with 34 or 5 different copies of the same file being stored. So we implemented a duplication checking layer where we would do reference counts to files and only store them once. In addition, there was an idea that because when you upload a file and you basically don’t upload it as a single stream, you would split it into these 10 megabyte chunks and try and run them in parallel, that we could potentially just store these chunks directly on S3 and then deduplicate across those. So that original assumption and intention, which was very much from this mindset about how we can be as efficient as possible with storage, what turned out to happen is that as we were scaling up and seeing more usage, The reference counting overhead and managing all of that became very intense and just the volume of 10 megabyte chunks as we also saw people sending larger and larger files became a huge challenge and so We were trying to, like the systems that we built, we’re trying to stay on top of all of these new creations of files as people upload and then deleting them as a transfer expired after 7 days. And literally getting to a place where we would run out of memory querying for all of the chunks of files that we needed to go delete, and this was just a pattern that we’d get stuck in and then periodically system load would drop and we’d be able to delete some of them, but after 3 months we ended up in a place where it’s always a little bit embarrassing, but it was something like 2 billion blocks that we had leaked in storage that we thought should be deleted that we were still storing and paying for in S3. And effectively wasting and this is a, imagine a new team that’s just coming in trying to understand how the system works and how the thinking was and You know, having to go push like a big red delete button on all of the storage

SE: because everyone’s super calm when it comes to deleting customer data, like it’s really not a problem.

DC: Totally, I mean what could go wrong, right? But there’s one particular dev after 4 months, on the job, like this was his task and sweating, right? Takes like 3 days to delete everything and kind of clean up databases, but so we had situations like that, and again it wasn’t that there was the wrong idea in place, like maybe a little bit more iteration on it, we could have seen where that would or wouldn’t work. But it was certainly a case where we didn’t have the observability to see that these tasks were crashing in the background, running out of memory, failing to delete. We didn’t have the data to tell us were these assumptions that were made in the original architecture for the storage solution, right? Like are we actually saving, like we measured later on and found that we saved 7% of our storage costs due to the file duplication and something like 0.03% of the storage costs due to the block to duplication. And it made it really clear like, hey, like this, this is a lot of overhead for something that’s not saving a lot of actual storage and probably costing us more in terms of database because we had to bump up instant sizes and those types of things.

SE: And just dead time and complexity and being able to understand the system, the juice wasn’t worth the squeeze.

DC: That’s right, and it, it’s one of those things where um like we kind of felt it both on COGS and OE, so. I guess to, to bubble up from that, for me in that moment in time, it just became really clear like, yeah, there’s the intention around like how do we build the most efficient system we can, but then there’s the maintenance about observing and learning from how that works in practice, seeing what the patterns are and then seeing how you then go iterate on that to adapt to what’s coming in. I think about it in the context of S3 too when S3 launches is relatively straightforward, right? But now you look at it, there’s tiered storage, you have caching, like, it’s because as you look at the usage patterns built on top of this, you understand which capabilities you need to go to go meet those.

WV: So, well, I mean, the biggest example I think probably is when we launched, we launched with eventual consistency for a long time, long time. And we saw many of our customers building strongly consistent layers on top of it. Just because they, creating a bucket and then from another process trying to store in it, but the bucket wasn’t there yet. Sort of was not a model that many of our customers could, could really deal with. And yeah, at some moments we decided to drop that inside S3, but there was a major overhaul of all the different pieces in there because it’s not just the fact that you need to add the functionality, you need to prove to yourself that you have covered all the edge cases. And if by that time you’re at 250 microservices, there’s a lot of edge cases to cover.

DC: Oh yeah.

SE: Absolutely, absolutely, I, I think, I think one of the interesting things here too is that we talk about frugality, uh, but the, the flip side of that is, is the speed and the rapidity that you can operate in, when you’re not applying the frugal lens at the right time, so you can, Deploy something, see if it has uh value, use, etc. and then as it starts to grow, you can start looking at how to do that, that better, and I think, as, as software engineers are often told, don’t optimize too early. It’s so tempting to do it, and I think your discussion about the block deduplication is a great example of it, like, It feels like the right thing to do. It feels like the right design decision, etc. but the data tells us a different story, and I think what’s interesting as well here is that the cost that you were seeing not match your business growth was a proxy for some degree of, of waste or, or lack of attentiveness to what was going on there. Which is again really useful because in the old world, you didn’t know, like someone bought you a storage array, you just happened to fill it up quicker than you thought you were going to. So you just thought, well, my capacity planning’s wrong, not maybe I’m not cleaning up what I should be cleaning up. There’s also assumptions that are well placed, and I think this is also relevant in terms of sort of the evolution of our own services that we go through, where you desire something, something is, is the case, and then the case changes, and, and I think one of the great things about the work that you and the team did, Dan, is, is you dove deep into the code, like, into the code, like the lines that make the difference. And you found stuff, and you found stuff that made sense at clearly at the time that didn’t hold true anymore. Tell us a little bit about, there’s a story I think you share about these janitor type jobs and how long they run for, etc. but also when they get run and what impact that had on the cost side of things.

DC: Yeah, for sure, I think certainly some of the things that you do that impact costs are like major large architectural decisions and some of them are subtle smaller things and over time the costs kind of add up and catch up with you. Early in the days of WeTransfer, there was this strong concern about losing customer data, like you never want to lose someone’s files accidentally, like, and we would get, customer complaints, my transfer expired just before I was going to download it, how can you get these files back for me, like those, those things actually happened and Certainly as people were learning like, hey, like the when you get the email that there’s a transfer there that starts a seven-day timer you really need to download in that window, those type of situations happened and so early in the days of WeTransfer there was this idea of, well, maybe we’ll hold on to the files. A little bit longer than 7 days and I had heard some rumor of this and I was like, well, that’s interesting, but you know, whatever.

SE: It’s almost like an apocryphal story.

DC: Yeah, yeah, it’s one of those things, you’re learning like, where did the name come from or like what it was back before. So, so as, as we’re going and building, more observability and just understanding how the systems work and a new team is ramping up. One of the engineering managers working on storage started to track like how long does it take to actually go delete a transfer after it’s been sent and what we found is, most transfers the expiration is set for 7 days, they should delete after 7 days and he was finding that, most files lived on the platform, 8.5 to 9 days. It didn’t make sense at the surface like, hey, like this should be gone after 7 days, ready to go. But it turns out that that legend of like holding onto files afterwards had made it down into, a line of code to basically hold files for 36 hours after they expired in the in the storage layer, and this was, independent of the products buried way down deep. No one really knew about it, the team had changed and And when we did the math on it at the scale that we were operating at, this is something like 15% of our storage costs, and it was actually pretty substantial just because, instead of something being on S3 for 7 days, it was now there are 8.5, 9 days like stretching out. And the other thing that had happened is that no one in any of the product teams was aware that we were doing this. No one in customer support knew that. So even if a customer had an issue, they were still being told like, oh, I’m sorry, your files are gone, even though they were still somehow there, maybe recoverable. And so it’s one of those things where every engineering team has had this where you kind of get to the root cause of something and you kind of like scratch your head and look at your shoes and Everyone’s like, well, and then you go, you go fix it and move on, but um we really did have that moment where it was a substantial change in terms of our overall costs and the city-state storage that we had that really just came down to, an idea or an assumption or a scenario that was, started long ago and had been lost track of, no longer used.

SE: And it was well-intentioned, I mean you got one line of code that that did a whole lot of good for the business, for customers, and was the no doubt was the right call at the time. But it had to be revisited.

DC: Yeah, and I, I think that’s, that’s really an important part is like I think that as engineers we’re all wired to try and do like the right thing and build the best things and so I look at all these decisions and none of it’s incriminating. It’s all basically the team was doing the best possible thing at the time to make, to solve the customer problem, to enable the business to grow, to work through issues that were happening, and it’s really about how does the landscape change around that over, over the years. What happens as, things start to scale up. We talked about not optimizing too early and that’s really because part of it is like, maybe you make the wrong choices, you need to learn which things actually need that energy and effort, where to go focus that, but it’s also just the reality that as the problem changes over time, like you may find certain things phase in, certain things phase out, you need to adapt. So in this case, understanding of WeTransfer as a product to change, no one really hit this issue of like I need my files back, the value of that had deteriorated, had gone away, but, no one was aware that was still there, so.

SE: And, and do you find that, cos, cos obviously a big part of this journey for, for yourself before we transfer was that observability piece. The classic question of if you did it over again or if you’re building something new. How would you think about instrumentation, how would you think about again, not over instrumenting, but what’s, what’s the quote unquote right amount to go with so that you can detect some of these things, maybe a little bit earlier or just as part of day to day operation?

DC: Yeah, so the fascinating part with that is we had a whole observability stack in, 2017, 2018, there was a whole solution in place. It’s just that the adoption within the teams was relatively low and I think it comes down to how much do you need to be watching, like how much needs to be in place, like what’s the minimal level of sufficiency. And the set of things that you don’t solve as a business because you’re so focused on whatever your highest priority is, like there’s some minimum level of observability that you have to have in place, otherwise you’ll lose track of just how does your, how does your service operate, and I think that we had that minimum level sort of a, at a platform level, but I think in individual service teams like we were lacking that and so For me, like the one of the, the big pushes that we made over the past few years just to establish what that minimum bar is, like if you have a service and production, you’re building a new feature, of course, we have to be able to see this type of data, we have to have these metrics in place, we need to have this level of observability. We went and we found a few scenarios where we would go make kind of what’s the best in class example, so we were having issues where super large, file transfers sometimes like the reliability would drop off and we realized that’s such a critical part of our customer base that we were going to go get an A on observability for that, we’re going to go all in. And the value of that is now like, we had a system where it’s clear like the minimum bar for everyone, but then we had a very clear understanding of like what does it look like to get an A in observability for a customer scenario or problem space or a set of services. I think those two powers kind of, the minimum and then exceptional and being able to bring those in from the start, I think that’s, it’s important and like I said, I think engineers are all wired to want to do the right things. If you have an environment where you say, here’s the requirement and here’s what like, kind of the A looks like, I think people will gravitate to what the right balance is between those for the, the scenarios they’re working on.

WV: It’s interesting to see how you can do this, let’s say for new services and new products that you’re building, but how, how did you go about, let’s say refactoring or let’s say, adding observability to the existing code base.

DC: Yeah, I think about this one a lot because certainly if you’re starting something brand new, starting a new company, new product, like it’s a completely different landscape than it was 1015 years ago. For us, we started with centralizing on a, a single, tooling solution, so we, we had kind of a makeshift observability stack, we moved over to Datadog. We built a reliability functions, so the company didn’t have an SRE team until 2022. So we established a reliability team that owned the tooling that would be champions that would do sits with, with different teams that would help them ramp up. We identified a few different teams to go demonstrate how to build high quality SLIs and SLOs, how to build dashboards. It was a, a fairly substantial effort, but it really came down to a combination of single tooling solution that everyone could go deep on and, and felt like there was like durable value, like to really understand. Center of excellence within the organization that would do sits with education, training, ramping up and then, the one or two like kind of Halo example teams that would get that kind of best in class and demonstrate what was achievable and what the value of that would be. I’ve been in 3 different situations where just kind of broadly I had engineering teams that weren’t deep on the use of data to inform how they operated. And in all three cases, like the, the most important thing was, having the tooling, the education, and then that halo effect where someone was like, wow, I could have that value in my system. Oh my goodness, I need to go make this investment. I could be seeing this type of information or understanding what failures are happening, but that was really the shift that we started in kind of mid 2022.

SE: Mm, mm. It’s interesting that mindset shift, and maybe Werner, can you talk a bit more about, I guess, culture and developer culture, we, it’s, it’s easy to be unkind as developers and go, well why are you wasting that, or what a boneheaded decision it was to have this value set this way, etc. and certainly younger Simon behaved that way cos younger Simon. Simon wasn’t as nice as today, Simon, cos younger Simon hadn’t learnt the hard lessons that he’d since gone through. But there’s a sense of, I guess, humility and celebration of optimization and learning, etc. How do people think about that finding and discovering and returning waste is not bad, it’s actually good. Help us decode how to think about that, how folks should be thinking about that.

WV: We all understand the evolution of systems, yeah, and, and you mentioned premature optimization. There’s often the case also that you have no idea how your customers are going to use your product or your feature. And until you know that, there is, it’s very hard to build a cycle around that. Having a first version that may be extremely inefficient. Yeah, but for example, because you’re, you’re experimenting with a new UI and you’re using Ruby on Rails for that, which is not necessarily The most efficient or the most scalable platform, but it is a great platform to experiment on. And so, but after a while, we understand and when you know how your customers are going to use your service, you may start to transition to a more scalable or a more efficient approach. But not until you know how your customers are using your systems, there is, there is no fault for any. And, and often it is actually, I think more important to get things in the hands of your customers really quickly than sort of already have optimized to the minimum upfront because often you have made the wrong decisions there. I, I do think also there needs to be this culture of, mistakes are OK, they happen, um, it’s more important, how do you learn from your mistakes? Yeah, but do mistakes become a badge of honor or is it going to hamper your career within the company? Now, if it’s going to hamper your career within the company, you’re probably going to shut up about it. Uh, but if it’s a, if it’s a badge of honor, uh, because there’s a learning associated with it that actually everybody else can, can learn from, uh, that I think is, is the way to go. Actually, one of our customers in Italy and now has an internal TV station. And they have this program there that is called My Biggest failure. And basically where engineers and program managers come on to talk about the things that went wrong. Now or that whether it’s the public level or whether it’s an engineering level, and I think that’s sort of and everybody wants to go in that program because everybody has a story to tell, where everybody can can learn from. So, having a, a no blame culture. is crucial, I think in any engineering. After all, we’re building new products, we’re not doing the same thing over and over again. So we’re highly creative people as engineers and as such, sometimes, and especially if you build new things, you don’t know how your customers are going to use it.

SE: So true. Dan, how did you find that pervading through uh WeTransfer in terms of that, that concept of yes there was waste, it’s OK, we’re, we’ve, we’ve done good here.

DC: Yeah, I think the way that I look at it, it’s very similar in terms of, thinking about no blame culture, and I really think about how a healthy reliability function works. If you’re operating at scale, like you have reliability teams, you have postmortems, everyone’s trying to learn from the issues that happened and figure out how do they build and prevent against this in the future and uh you get this mindset that’s really not about blame or, finger pointing, but it’s really about how do we learn and go improve as engineers based on that. And I think for costs like initially, I kind of joke about it, but people were a little like, oh my goodness, I can’t believe we, we wasted this much money and certainly like the company was profitable, it wasn’t, disaster making or anything like that, but it was definitely something that people were, were conscious about. What I realized were really like two things, so, back in the very beginning we talked about how adoption of the cloud is more of a business discussion and it comes to a conversation with your finance department. It turns out like my finance department was really excited that we cared about costs and that we were making investments and how to reduce and manage those. So as much as individual engineers would be like, oh my goodness, we did this thing and it and it cost 1000 over the past three months or whatever, the fact that we found it and then we talked about it and we fixed it and we were open about that actually bought a lot of credibility and trust in those relationships. So I, I think that was the first part of it. The second part of it is within the engineering team, especially because frugality is, this is not about like, setting your priorities for the year, it’s about what’s the culture you want to create. As these things were found and we made progress and people realized like, hey, like we’re saving these costs but we’re also consuming less and when we consume less that reduces our, environmental impact and improves our sustainability. It actually became this very empowering thing where engineers would say like, hey, I could do this thing that actually directly impacts this goal that I care about. A lot of the engineering team came to WeTransfer specifically because it’s a B corporation. Specifically because they loved the role of the company in the world and the things that we did for creators and that we were conscious of the environment. And all of a sudden, some of the efforts we were making made that very tractable and directly impactful for them in their day to day job as an engineer. We really got quickly out of the, oh my goodness, this was a mistake. I think this happened like in the 1st 6 months of finding these issues and then it got into a place where people were like, hey, I think I found some, some waste over here where we’re storing a little bit that we don’t need to, or, hey, I think I can make this system a little bit more efficient. And in fact, we were spending a lot more time on, if you think about kind of the spectrum of, on one end you have kind of growth at all costs and like, don’t worry about it, and in the other, you have sort of deep cost optimization, like we’re not growing and we’re just gonna try and like drive everything down. We really found ourselves just trying to strike that balance in the middle of how do we reduce waste and where do those, those boundaries exist so we don’t slide into one camp or the other. I found that, we got past that kind of uh that concern about how this is going to be perceived very quickly and it got into a very empowering place. So,

WV: actually, there’s uh it’s another interesting story where one of our larger enterprise customers after the whole frugal architect thing actually have um not uh book hunt begins, but cost hunt begins. Where basically gamification is being used for, let let’s go through our code base and see where, where we’re wasting uh money. and not from the idea that that was bad in the past. I mean, the systems all work and do their job perfectly fine, but, can we look at it with a new fresh set of eyes, um, where there is a, a, a set of prices at the end for who, who saves the most money. It’s just a realization also that we’ve gone through a phase where Growth and innovation was more important actually than really keeping your, your, the purse closed. And it’s not about closing the purse, it’s just making sure that, you don’t waste anything. And, and I think that’s sort of, the realization that we’ve done some of that in the past for different reasons, mostly for speed, speed of execution. And now we’re just taking a step back and take a look at it.

SE: That’s the beauty of the elasticity of the cloud, and, and I, and I guess um as we come to the end of our time here, Dan, when, when you’re thinking about sort of the balance between short-term fixes and long-term solutions when thinking about sort of architecture from a frugal lens, how do you balance? Do you have a, a percentage ratio? Is it time of the business, like how do you like to sort of rationalize about that?

DC: I think that that’s, that’s such an interesting challenge and the climate that we were in, I think that we were heavily focused on building new products and features. It made the pace at which we could go undertake kind of large architectural investments a little bit slower than I would have liked just to be candid. But we were trying to take on kind of 1 to 2 like major investments a year. 2022, we moved to tiered storage and we started using Cluster Autoscaler and those were huge lifts in terms of reducing waste and inner consumption. This year we were working to co-deprecate uh image previewing services, moving over to kind of on-demand previewing through lambda. And we were also completely removing all of the kind of the last traces of block storage and reference counting that we were doing and simplifying the storage solution. So I think that we were trying to do like one or two major bats just because we didn’t want to exhaust all the dependent teams and kind of, go through these major rearchitectures all the time and then to pepper that and I, I really love kind of the cost hunting kind of exercise, to pepper that with opportunities we found here and there that were kind of small ones, not huge investments of time, not huge investments of energy, but Hey, one person, one week can go make a difference that’s, that’s meaningful in some way. But some of those small ones was if someone would spend a week on something and it was enough to cover, hiring another developer, like it was, it was substantial enough that we could kind of ground it in those terms. That balance where, yeah, for any given year, kind of one big push, maybe two big pushes if we had the capacity and then a lot of smaller ones we found was, was starting to work really well for us.

SE: That’s really interesting. Werner, are you seeing that as a, as a trend amongst customers, that ratio of, of big to little, or how do you see it play out, or again, is it, is it a factor of the life cycle of the organization?

WV: Well, I think this, this works for every other company different, of course, but indeed the, the, especially for a younger business or a smaller business like WeTransfer, where your engineering resources are limited. And where you apply them to — where their energy is limited as well. Indeed going through major changes in background that do not have to do anything with building new features or new, new products. We’ve seen the same thing in the early days when I joined Amazon. Uh, we observability in the early days of Amazon wasn’t that great either. And so we established a whole new culture around. How to measure, and what does measurement mean. Also in the understanding of, of not only engineers, but everybody that’s looking at the numbers, P50 latency of a web page doesn’t mean anything. It means half the customers are getting a 50% worse experience somewhere. Um, just the whole culture around it. We did also a whole year on removing all single points of failure. But that was next to sort of the teams also working on, building new features and building new services and things like that. But we didn’t do all of those things at the same time. We did also do a whole year on, on efficiency and that went nowhere. Mostly because Amazon engineers are, are very much focused on, let’s say, customer focused. Yeah, they love building things for that are customer facing. Now, and efficiency, definitely in those days was much more a sort of a bottom line kind of thing. I mean, after all, this was retail margins are razor thin, any, any impact on the bottom line that we had with capacity was sort of immediately hurting the business. But nobody could get really enthusiastic about sort of working on bottom line kind of things. And I think it isn’t until we we later on actually got to a point where we’re much better at that and much more thinking about decomposition, smaller building blocks, tiers, what cost for which and things like that. But it wasn’t until we got the visibility into that and the architecture to go with it that we could do these kind of things. But yeah, we would have a search service, we had 32 different search services or something like that, where some of them were twice the cost of another service, but nobody really knowing exactly why, oh yeah, that was actually still a 32 bit box laying around somewhere that when that moving it to a 64 bit box actually reduced the cost by, was it 50% or more? Things like that, simple things, and there’s a lot of these kinds of simple things. And next to that, yeah, there’s big projects, uh, but you can’t overload everybody with big projects because also you need to make sure that you complete them to the end. I mean, starting or removing all single points of failures doesn’t help anything if you stop halfway through.

SE: Yeah, we’ve, we’ve removed half the single points of failure.

WV: So for example, we introduced game days where we would take out the data, pretend to take out the data center. But yeah, those things are, are big, big events and big projects that you cannot necessarily always be in the way of, new feature development or things like that. But it’s crucial enough to the business to to get all noses in the same direction.

SE: Dan, thanks so much for spending so much time with us, really, uh, diving under the covers and giving us a different perspective on, on frugality and how it can be applied. We really appreciate it.

DC: Uh, thank you, it’s great.

SE: And Werner, it’s always a pleasure to sit together and hear from our customers. It’s a, it’s a fun thing. We have customers all around the world, so we get to experience different perspectives.

WV: Thank you. Thanks, Dan for uh for talking to us, it’s great stories.

SE: And of course we do love to get your feedback. That’s our own version of observability. AWS Podcast at Amazon.com is the place to do it. And until next time, keep on building.

Laws of Frugal Architecture

WeTransfer's journey to cut costs, not corners

About this episode

Episode Transcript

Listen to this episode

Share this episode