If you weren't hiding behind a rock lately, you probably already know this Tuesday AWS had a major outage in their us-east-1 region.
As usual, at least in the past decade, such downtimes have massive repercussions on the modern internet as we know. A bunch of popular websites and services experienced severe issues for hours, including – but not limited to – Netflix, Snapchat, Reddit, Venmo, Doordash, banks like Lloyd, games like Fortnite and Roblox, not to mention several IoT devices1. Platforms like Vercel, built on top of AWS infrastructure, had issues as well.
If you ever did my job for more than 2 weeks, you probably know how much DNS can fuck up everything. Outside of all the memes, it's quite true that a bunch of infrastructure issues can be caused by DNS. Even on ISPs, several times the actual fiber and infrastructure is working, but you can't navigate of the internet just because their DNS servers are down2.
Now, what's really bad about DNS is that particulary difficult to debug in complicated infrastructures, and once you fix it it might take a while to see the actual resolution taking place due to TTL, caches, and more broadly speaking the time to propagate DNS changes. So I can't really complain or be harsh with AWS folks for the half a day it took them to actually fix the thing.
What I can complain about though, it's the fact it was not DNS by itself, but rather a race condition in the AWS software actually managing the DNS that caused the issue, and the absurd amount of cascading failures caused by such a thing, within AWS itself and everything built on top of it. Does it really makes sense that something causing issues mainly with a single AWS product – specifically DynamoDB – can almost turn down an entire region – which, we all should remember, is constituted by several datacenters – and almost everything running in that region? Not really. So why all of this happened?
Because of abstractions. Abstractions over abstractions over abstractions. Definitely too many abstractions for my taste. You see, the thing about AWS – and big cloud providers in general, like GCP and Azure – is that they use their own products to run other products, and – as a consequence – your workload as a customer. While this might make sense in a company like Amazon for a bunch of rationales, at the same time, if for any reason, at any point in time, any of the abstraction layers has an issue, the blast radius of cascading failures become massive. Which is quite absurd: the compute hardware, the fiber, and all the infrastructure is there, but it cannot be used because the software layer above all of that is broken3.
Even more absurd: since AWS hosts ~30% of the main internet we all use, the blast radius now can go completely out of control. How did we get to this point?
In fact, it was never designed to be centralised into few hops. Internet was designed to be a network of computers, and nobody ever said something like
but those computers are not all the same; you see we have some computers that deliver content, and some computers that consume content. We'll call the latter clients and the first servers, and we gonna concentrate all the servers into 3 spots or companies.
But that's more or less where we ended up. And I find it particularly frustrating nowadays, given the amount of compute and connectivity the vast majority of people on the planet now has access to for such a small price. The CPU and the amount of RAM your laptop has right now is more than sufficient to serve probably 90% of all the web shit we have out there. Your home internet connection has probably enough bandwidth to serve hundreds of people simultaneously.
Does that mean you can serve millions of requests per second from your home server? Probably not. And even if you could, is probably not a good idea. But here we're talking about a total different scale, and how many applications out there actually need this? Do you really need thousands of Infrastructure as Code lines from day zero to actually serve your NextJS application? Do we really need the amount of complexity Amazon reached to motivate the existence of AWS in all of our software?
When did, exactly, launching a VPS and run a few commands in a terminal became harder than using AWS?
And here we comes to the influencers part. Theo Browne made one of its videos on this, and – oh boy – I can't really quantify the amount of silly things I had to hear.
Now, I wrote already about Theo previously and not in a particularly positive way. I don't really like to reactions to other people content, but this time is really really hard for me not to. Before watching the video I was like "bro, that's my shit, so you'd better be cool". But he wasn't. And even if nobody will read this, I feel someone has to speak here, because the guy has ~500k followers, and the amount of people he can influence is huge.
The whole video narrative is set as a reaction to a tweet from Elizabeth Warren, a US senator from Massachusetts, which suggested to break AWS as it's too big. And this is also the only thing I will agree with Theo with4: it's fucking stupid.
And while this was quite an unique opportunity to inform such a vast audience, spreading knowledge and explaining how things work and break, he instead chosen to discredit and insult anybody using a different set of technologies from the one he likes – something that actually happens quite too often for my taste. The result? An absurd amount of non-sensical, dumb and useless takes from him, that might be even worse than a stupid tweet. And it's worse because all of this is intentional. Elizabeth Warren simply has no fucking idea what she's talking about.
You can immediately tell when the non-sense starts:
But that leads us to a question. Does using the same servers as Amazon increase or decrease risk?
which is a very bad and silly question to "ask". It clearly shows a complete lack of knowledge about risk assessment. That's not how risk works. First of all, what risk? Downtime is never a risk per se, the risk should be calculated on the consequences of being down for a certain amount of time. And at that point you will have several different risks, economical, reputational, etc. Using provider A or provider B – unless one of those is a actually some random indian dude basement – won't change a damn thing about risk. Using the biggest provider out there, or the most famous one, won't decrease any risk your company might incur into in case of downtime of the infrastructure you rely on.
But he goes all in for it, of course:
The fact that people are currently unironically saying that it is riskier to use Amazon servers than it is to use some random company hosting oneoff servers [...] It just no. It makes no sense. If this was the obvious correct path, you would have companies like Netflix doing it. Why would Netflix, a multi-billion dollar company, do this wrong? If you genuinely believe it is easier and safer to host VPS's, why is it only random hobbyists and people who have a service with a 100 users talking about it ever?
I've never seen real production apps talking about this outside of a little bit of stuff that's slowly dying from the DHH camp [...] It's just kind of silly that none of the people building real software are building this way. I've never seen someone proudly bragging that all of their stuff is on Hetzner or is on some random modern VPS solution that doesn't have endless problems or zero users. Like it's just always the case [...] I just I cannot take anyone seriously who is sitting here saying that AWS, GCP, and Azure are bad bets because you could just use a server. None of these people are building real software.
The amount of ignorance – served with that special arrogance topping, of course – in these few sentences is just unbearable. There are very few things in this world that can sustain the words always and never. Thus, I put a massive red flag on whoever has such takes, whoever express that level of certainty on anything. There's not a single obvious correct path which covers every possible service out there, not for software, nor for infrastructure.
There are plenty of cases in which using a single VPS or managing several racks of servers makes perfectly sense. Cases in which using AWS, GCP, Azure or whatever wrapper on top of those will be a worse option, cost wise, operationally wise, compliancy wise. There are 5 companies like Netflix out there. You should not design your infrastructure based on those, as you will never, ever, have the same requirements, needs, number of customers and capital.
And let me provide some examples, just from my personal experience – on which I guess Theo himself wouldn't give a shit about, given his consideration over DHH's software, but (guess what?) I don't care.
I'm a Hetzner customer since 2006. Yes, you got it right. Almost 20 years. I was 16 when I got my first server with them. You can take a look at their website from that time, and see the prices were amazing also then. I hosted a bunch of stuff on Hetzner servers5, and not just personal stuff.
In the company where I had my first proper job6– between 2015 and 2018 – we run everything on Hetzner servers.
Like, oh, cool. So, you're going to go reinvent everything yourself. Let's see how that goes for you. Let me know when you can actually work on your software again.
We did. We built real software – a fucking CRM is not a joke – and we had tents of thousands of customers. We were in the top 20 in the business category of the fucking Apple Appstore. To be honest, feels more real than a web chat interface acting as a wrapper above some API calls to some LLMs7.
Up until we did our exit in 2018, we had 1 issue with our servers. 1 disk broke. We got it replaced in 15 minutes, no issues at all in the meantime – because you know, RAID exists. We had 0 downtime caused by infrastructure in 3 fucking years, running a bunch of servers ourselves. I can list a whole bunch of issues I had on AWS in a shorter period of time – including downtime.
And why we did that? First of all: it was fun. Developing the software wasn't just about shipping code somewhere, but also knowing how to run that code, in production. The amount of good decisions we took in designing the software were also because of the fact we knew the infrastructure and how the code would perform on that infrastrcture. There was no one else to blame if the thing was slow! But also, we were a small company, and the amount of investment in Italy is nuts compared to US. Thus with the money we actually saved from not running on AWS, we did a bunch of stuff, including hiring talented people. It totally made sense for us in that specific context.
But we also run stuff on Hetzner in almost every other company I worked with. Sentry is one big exception, but I can assure you we actually have some workloads that would definitely make more sense to run on bare metal servers, rather than on GCP as we do today. There are very few cases in which we can actually take advantage of the scaling of providers like GCP and AWS. Running directly on our servers would simplify such many things.
So when Theo says:
This is not a time for us to be like, "Well, maybe we should move to servers." No, that's just fucking delusion. It's cope. It's people who don't know what they're talking about. It's a bunch of web devs hosting shitty PHP that pretend they know how services work. I'm sorry. just not the real world.
I would rather suggest he's the one who don't know what he's talking about.
This is my main argument. If you spend the vast majority of your time streaming, writing Typescript, tabbing on Cursor and promoting solutions that abstract the infrastructure away, you shouldn't spit out your opinions on infrastructure to half a million people, especially if you sell those opinions as the only truth about the world – to the ones ready to throw the freedom of speech card at me, shut the fuck up, you're being moronic.
Because to anyone who actually does work on this shit, everything that comes after just sounds silly.
Sure, you can sit here and shit on me all you want and say, "Well, Theo, you could put T3 chat on a VPS", a really big one, but yeah, maybe. And the moment we have slightly too many users, we're screwed or we have to start putting a Kubernetes layer in front to distribute it. Fun.
Yes you could, and yes, it would be fun. You can literally launch a kubernetes cluster with one curl command. You can provision a PostgreSQL instance, or even a cluster, with a single helm command. It's not that hard! I'm pretty sure you already do that locally with docker for you dev environment. And, it's just a chat, how much compute you would need to serve event thousands of people? It's all I/O where you wait for Antrophic and OpenAI API calls to send you data back, c'mon.
You should praise for having more people doing this, not the opposite.
Personally, I think it's pretty fucking cool that some random vibe coding kid has access to the exact same infrastructure as Netflix [...] it's pretty cool that anybody can use the exact same infra as the biggest companies in the fucking world. That I have the same level of reliability, of scalability [...] as companies like Amazon do
Why is that cool? Wouldn't be more cool to have kids actually learning how this shit works, forming their experience not on marketing BS but actually knowing what it takes to run their code? We should all praise to have more knowledge spread, not aiming that a single company runs the whole internet glorifying capitalism – also 'cause that's the worst possible implementation of capitalism.
And, once again, these takes screams ignorance all the way, as that's not how you should think about reliability and scalability. Using AWS doesn't make your code magically reliable. Nor it makes it scalable. As a matter of fact, it will probably push you towards the opposite direction: when you don't have to think about infrastructure at all, chances are higher such code will be less performant, less failure prone, and less efficient. You achieve reliability and scalability with good software, not by using a specific provider for the infrastructure you run that software on.
Claiming that whatever provider you choose you will experience downtime, no matter what, is purely moronic:
even a company like Cloudflare, which is basically essential [...] if you're hosting on a VPS, you fucking need DDOS protection [...] if you have [that] as the layer in front, there's a very good chance that layer is going to go down at some point because it has in the past due to the fact that they're using GCP still for the storage for things like KV
because, guess what, during the specific incident you're referring to, all of my shit running on servers behind cloudflare was perfectly working. You would know that if you were able, for a single moment, to learn how these things are designed: the KV storage in Cloudflare plays zero role in their proxy layer. Cloudflare workers were affected, but guess what: that's another fucking abstraction layer you don't necessarily need to use!
But the next take is the most absurd of all:
If anything, I would argue that it is incredibly impressive that there is an outage that can occur at the DNS level that only hits and affects one region with AWS because this region has multiple data centers. If one data center fails, things happen. If they all fail, worse things happen. But if you have DNS routing between those things and that fails, but it only fails in East one, that's almost a fucking miracle. The fact that only one piece of AWS went down for this is genuinely impressive.
What the fuck are you talking about? It is everything but impressive. You could've waited for the AWS article I shared above to came out before recording this video, but of course you didn't, you had to keep your views count high, don't you? Yes, you can say it is impressive that a single race condition in a piece of software is able to tear down an entire fucking region composed of three different availability zones and several datacenters, but in a bad way. Especially given they designed one component to be zonal, but the other parts of the system as regional. A better – and proper – design of that system would've been zonal in all its components: if they did, this outage would've been way less impactful, as it would affect just a single zone. The fact other regions were fine is.. fucking normal to anybody who designs and work with infrastructures? Like, seriously, what are you talking about?
And in all this puzzle of dumb takes, you also have the arrogance to call shit on someone like DHH?
That's it. You might have whatever opinion you want on the person, what it does, what it says and what he writes, but in the specific context of his decision of "leaving the cloud", he was absolutely right. And that's true just from a cost perspective.
Again, this doesn't mean everybody out there should do the same; but it doesn't mean DHH was wrong or stupid either. He did a ton of evaluation on such decision, and within its own context it made perfectly sense, as the promises of AWS were broken:
The truth is, AWS is not designed for the vast majority of services out there. It it designed to run Amazon. The vast majority of workloads out there won't ever need the scaling AWS can provide – and, again, all that software is not designed to scale, regardless of whatever amount of infrastructure you can spawn. The vast majority of startups out there will just spend a shit ton of money without any particular benefit for the vast entirety of their lifespan. And if and when you reach that stage, you will probably be sipping a Mojito looking at a tropical landscape, having sold it already to someone else.
You wont necessarily incur into downtimes running your own infrastructure:
but if your service is not vulnerable to something like this, your service isn't real or you're just wrong. Period. Point blank. End of story. Everyone is vulnerable to something in their chain failing or they don't have a real fucking chain.
because, guess what, you don't necessarily need a hundred pieces chain to operate your software. If you run your own hardware, the chain is composed just of electricity and fiber availability. So yes, if you actually do proper risk assessment, the chances of incurring in distruptive events is lower if your dependency chain is shorter. And yes, that means in specific cases, it's better to run off AWS. Just think about this little fact: is so damn easier to hot-swap a disk into a server, rather than dealing with network storage errors on cloud providers.
And just because you spent your entire life in the worst dependency ecosystem humans have ever created – yes, I'm talking about you, Javascript – that doesn't mean the whole world works the same:
I have seen people sharing this and a bunch of memes in a similar format claiming that companies like Vercel are in the end just AWS wrappers [...] Everything's a fucking rapper. I am so tired of this goddamn argument [...] Welcome to software
Infrastructure and software are not the same thing. Approaching infrastructure the same way you approach your stupid React web application is dumb. It's non-sensical. Again, you don't need those wrappers. And the fact that 30% of the internet can go down because of a fucking wrapper, it is a bad thing. We should approach these problems, discuss about them in proper ways, and stop being slaves of AWS representatives.
We should advocate for more competition, more plurality in the cloud providers scene. We should try our best to spread knowledge about infrastructures, not praising to have a single company that runs everything. People should go back learning stuff, and we should stop abstracting everything away from everybody all the time. That's not progress. It's quite the opposite of progress.
An internet hosted by 3 single companies sucks. We should aim to have a more distributed internet, an internet shaped more like its creator intended it to be. Reliability comes more from distribution, not concentration.
If you praise for the opposite, to use Theo own words, you're just wrong. Period. Point blank. End of story.
it was particularly funny to read people waking up at 2AM 'cause their expensive water-cooled mattress from Eightsleep was warming up without any clear way to stop it as the app wasn't working. As if internet access was at all required to set the temperature on a mattress. Oh boy. ↩
fun fact: while I moved to Vienna almost 2 years ago, I kept my italian mobile number (mainly because it's cheaper, even in roaming), and today in the morning while I was commuting, despite having full-bar 5G signal, I wasn't able to navigate the internet. Enabling my VPN and reading about italian news, turns out the Italian provider DNS servers were down. ↩
quite recently at Sentry we were hit by a major outage of us-central-1 region in GCP, as they deployed a broken version of the software which manage quotas in GCP, practically making impossible to anything to scale. Kinda the same story, infrastructure is there and available, but due to abstractions it's not operational. ↩
actually, there's also another point I'm agreeing with Theo here: the tweet from Hetzner was just bad taste. Several other companies do the same in too many occasions, and even if I can understand the humor, it's just a poor thing to do. ↩
guess where this blog runs :) ↩
remember, I'm Italian. We had and still have some really bad habits in terms of job contracts, especially for young people. ↩
Theo, if you ever read this, I'm sorry, but you know it's true. ↩