>> I'd say the biggest cost is probably hiring a decent Engineering+Product+Test team
Part of it is cost, but a lot is culture and leadership. Streaming (especially live) is one of the toughest areas to maintain a good user experience. I've led Streaming Product teams for years. Product teams almost always needs to deliver growth, which comes in the form of new features, monetization, and other changes. But the user cares most about the core experience - did the video start playing without a delay? Were there buffering issues? Audio playback out of sync? Issues are very noticeable, and sometimes very difficult to test proactively for. Product needs to find this balance, and can not go 100% all in on growth and neglect the not sexy stuff. If the whole Product/Engineering/Test org is not aligned on stability/QoE being a top priority, it can degrade very quickly after a few releases for a streaming app.
Having recently listened to TSMC as well as Costco, LVMH, and Hermes, there's something available to every member of the HN audience in every episode, no matter how "unrelated" you may think it is.
On the other side of the table, as a former sponsor of a season or two: it was the best marketing we (Crusoe) could have done. We had more high quality inbound from our audience than any other channel, by a long shot. They deeply understand their audience and target appropriately. Strongly recommend to anyone trying to reach a decision making audience (customer and investor).
How did you measure that? Podcast ads seem almost like old school TV in that a lot of listening is offline, with no call to action or method of measurement. Despite racing and resonating deeply with an audience.
1. I believe they require that you have an acquired specific landing page (e.g. crusoe.ai/acquired), which we saw directly as conversions through to our waitlist. The volume was lower than other channels, but the signal was much, much higher.
2. To @scarface_74's comment, "you talk to customer and you can ask them where they heard about you from" is mostly why I made the claim. In particular, when fundraising, basically everyone I talked to said, "I heard about you on Acquired, really interesting business model that I hadn't considered, let's talk more about it."
Sponsorship isn't cheap (I would go so far as to call it expensive), but relative to spending an equivalent amount on search or display ads/billboards on 101/etc., I think it was the right choice at the time.
My only analogy for this is what I call "cruise missile marketing" where you're investing a lot of time/money in building something that is very specifically targeted at high value buyers. It works really well for large, infrequent transactions (raising capital, selling GPU clusters, etc.) and less well for commodity SaaS or B2C products where volume >> everything.
It’s not for everyone. Relative to the length, the level of detail is fairly glossy and the narratives about the companies aren’t complex or surprising. It’s frustrating already knowing a fair amount if not in the mood for entertaining banter.
If they had separate behind-the-scenes episodes with B-side detail and research discussion, they’d be the total package.
Not saying it isn’t well done! And it’s clearly landing with an educated audience. I know many fans.
Yeah I was disappointed with their RenTech episode, despite it being hyped up on HN. Think I learned more from the Jim Simmons numberphile (I think) interview.
This is rough napkin math, no need to downvote if anyone knows the real number and this is way off :)
Meta 2023 ad revenue was $131 billion. To make it easy, let's assume an even spread for # of users and ad revenue generation per hour/minute of the day and day of the year (which I'm sure is not the case).
This would be:
$358 million per day
$15 million per hour
$249k per minute
This also assume a minute down won't be somewhat or totally offset by a spike in users when it comes back online.
There are quite some harsh comments here below. You can't plan for every possible failure point, who knows what part of a system/infra out of everything that they have went down and triggered this behaviour. Some things you just can't catch/predict. Especially in huge systems like theirs. I would expect people here to understand things like these and not just call people names for something like this, we all know things seem simple/clear from the outside, but the job of debugging and fixing something like this take quite some effort.
This is a company with one of the largest digital infrastructures in the world. An outage is understandable, inability to tell they're having an outage and inform users appropriately is not. Stop making excuses for people who are literally awash in resources.
> Stop making excuses for people who are literally awash in resources.
This is a pretty weird outlook to have - looking at any group awash with resources, whether it be governments or other companies, and you can clearly see that even with those resources, failures still happen.
You can jump up and down and pretend that this is solvable, or you can look at reality, look at all the evidence of this happening over and over to almost everyone, and conclude with some humility that these things just happen to everyone.
(Looking this reality in the face is one of the things motivating my beliefs around e.g. AI safety, climate change, etc.)
It is always better for the company's rep for the issue to have been on your end. Admitting fault comes with a potential liability. It's gaslighting written as an SLA
You can't plan for every contigency, but you can reserve potentially scary message for situations where you know they are correct. An unpected error state should NOT result in a "invalid credentialiald error".
Pushing people to unnecessarily reset credentials increases risk. Not only does it increase acute risk, but it also decreases the value of the signal by crying wolf.
The argument here is the kind of nonsense cargo cult security that pervades the industry.
- in general, if the system is broken enough to be giving false-negatives on valid credentials, it's broken enough that there isn't much planning to be done here because the system's not supposed to break. So if they give me "Sorry, backend offline" instead of "invalid credential," they've now turned their system into an oracle for scanning it for queries-of-death. That's useful for an attacker.
- in the specifics of this situation, (a) credential reset was offline too so nobody could immediately rotate them anyway and (b) as a cohort, Facebook users could stand to rotate their credentials more often than the "never" that they tend to rotate them, so if this outage shook their faith enough that they changed their passwords after system health was restored... Good? I think "accidentally making everyone wonder if their Facebook password is secure enough" was a net-positive side-effect of this outage.
So your approach to security is to never admit that an application had an error to a user, but to instead gaslight that user with incorrect error messages that blame them?
This is security by obscurity of the worst kind, the kind that actively harms users and makes software worse.
No. My approach to security is to never admit that an application had an error to an unauthenticated user.
That information is accessible to two cohorts:
- authenticated users (sometimes; not even authenticated users get access to errors as low-level as "The app's BigTable quota was exceeded because the developers fucked up" if it's closed source cloud software)
- admins, who have an audit log somewhere of actual system errors, monitoring on system health, etc.
Unfortunately, I can't tell if the third cohort (unauthenticated users) is my customers or actively-hostile parties trying to make the operation of my system worse for my customers, so my best course of action is to refrain from providing them information they can use to hurt my customers. That means, among other things, I 403 their requests to missing resources instead of 404ing them, I intentionally obfuscate the amount of time it takes to process their credentials so they can't use timing attacks to guess whether they're on the right track, I never tell them if I couldn't auth them because I don't recognize their email address (because now I've given them an oracle to find the email addresses of customers), and if my auth engine flounders I give them the same answer as if their credentials were bad (and I fix it fast, because that's impacting my real users too).
To be clear: I say all this as a UX guy who hates all this. UX on auth systems is the worst and a constant foil to system usability. But I understand why.
You are absolutely correct. That would be a much better experience.
That said, getting there strikes me as pretty challenging. Automatically detecting a down state is difficult and any detection is inevitably both error-prone and only works for things people have thought of to check for. The more complex the systems in question, the greater the odds of things going haywire. At Meta's scale, that is likely to be nearly a daily event.
The obvious way to avoid those issues is a manual process. Problem there tends to be that the same service disruptions also tend to disrupt manual processes.
So you're right, but also I strongly suspect it's a much more difficult problem than it sounds like on the surface.
> That said, getting there strikes me as pretty challenging. Automatically detecting a down state is difficult and any detection is inevitably both error-prone and only works for things people have thought of to check for. The more complex the systems in question, the greater the odds of things going haywire. At Meta's scale, that is likely to be nearly a daily event.
Well, in principle, the frontend just has to distinguish between HTTP status 500 (something broken in the backend, not the fault of the user) and some HTTP status code 4xx (the user did something wrong).
The "your username/password is wrong" message came in a timely manner. So someone transformed "some unforeseen error" into a clear but wrong error message.
And this caused a lot of extra trouble on top of the incident.
But there's something off here. I wouldn't expecting to be shown as logged out when the services are down. I'd expect calls to fail with something aka 500 and an error showing "something happen edited on our side". Not all the apps going haywire.
At the scale of Meta, "down" is a nuanced concept. You are very unlikely to get every piece of functionality seizing up at once. What you are likely to get is some services ceasing to function and other services doing error-handling.
For example, if the service that authenticates a user stops working but the service that shows the login form works, then you get a complex interaction. The resulting messaging - and thus user experience - depend entirely on how the login page service was coded to handle whatever failure the authentication service offered up. If that happens to be indistinguishable from a failure to authenticate due to incorrect credentials from the perspective of the login form service, well, here we are.
At Meta's scale, there's likely quite a few underlying services. Which means we could be getting something a dozen or more complex interactions away from wherever the failures are happening.
Isn't this just the standard problem of reporting useful error messages? Like, yes, there are academic situations where you can't distinguish between two possible error sources, but the vast majority of insufficiently informative error messages in the real world arise because low effort was applied to doing so.
Yes, with the additions of sheer scale, a vast number of services, multiple layers, and the difficulty of defining "down" added in. I think the difficulty of reporting useful error messages is proportional to the number of places an error can reasonably happen and the number of connections it can happen over, and by any metric Meta's got a lot of those.
No, in that detecting when you should be reporting a useful error message is itself a complex problem. If a service you call gives you a nonsense response, what do you surface to the user? If a service times out, what do you report? How do you do all this without confusing, intimidating, and terrifying users to whom the phrase "service timeout" is technobabble?
> If a service you call gives you a nonsense response, what do you surface to the user?
If this occurred during the authentication process, I think I would tell the user "Sorry, the authentication process isn't working. Try again later." rather than "Invalid credentials". And you could include a "[technical details]" button that the user could click if they were curious or were in the process of troubleshooting.
> If that happens to be indistinguishable from a failure to authenticate due to incorrect credentials from the perspective of the login form service, well, here we are.
If you can't distinguish those, then that is bad software design.
Come on use a little imagination. DNS lookup for the db holding the shard with the user credentials disappears. Code isn’t expecting this, throws a generic 4xx because security instead of a generic 5xx (plenty of people writing auth code will take the stance all failures are presented the same as a bad password or non-existing username); caller interprets this a login failure.
Same auth system system used to validate logins to the bastions that have access to DNS. Voilá.
> plenty of people writing auth code will take the stance all failures are presented the same as a bad password or non-existing username
Those people would be wrong. You can take all unexpected errors and stick them behind a generic error message like "something went wrong" but you should not lie to your users with your error message.
If you have different messages for invalid username vs invalid password, you can exploit that to determine if a user has an account at a particular service.
"Invalid credentials" for either case solves this problem.
But sure, let's report infra failures different as "unexpected error"
Now, what happens if the unexpected error is only when checking passwords, but not usernames?
Do you report "invalid credentials" when given an invalid username, but "unexpected error" when given a valid name but invalid password?
If so, you're leaking information again and I can determine valid usernames.
So, safe approach is to report "invalid credentials" for either invalid data or partial unexpected errors.
Only time you could safely report "unexpected error" is if both username check and password check are failing, which is so rare that it's almost not worth handling. Esp. at the risk of doing wrong and leaking info again.
If you really want to hide whether a username is in use, then you also have to obscure the actual duration of the authentication process among other things. The amount of hoops you need to jump through to properly hide username usage are sufficient that you need to actually consider if this is a requirement or not. Otherwise, it is just a cargo cult security practice like password character requirements or mandated password reset periods.
In this case, Facebook does not treat hiding username usage as a requirement. Their password reset mechanism not only exposes username / phonenumber usage, but ties it to a name and picture. So yes, Facebook returning an error that says credentials are incorrect when it has infrastructure problems is absolutely a defect.
what if, if one service doesnt respond at all or responds with something that doesnt fit an expected format that it would if working correctly, the whole thing just says "sorry, we had an error, try again later"? if it has to check both at the same time, and cant check them independently, wouldn't that solve the vulnerability? or am i missing something? totally understandable if i am, i just want to learn /gen
My prediction is that to be competitive, companies will eventually need to rely on AI-produced code to some extent or risk being slower and less efficient than competitors. It would be like not using email or messaging and only using snail mail for all written communication.
But AI is nowhere close to perfect now, and will have flaws for a long time. Having AI write code is like having a so-so junior engineer, who can complete the task, but makes mistakes, so needs their code reviewed closely. And is unable to architect anything complex, that still needs to be done by the leads/managers/senior folks.
So more and more of the simple, low complexity coding tasks will be done by AI, while the value of importance of senior engineers will be as high as ever, since they need to oversee the AI's outputs.
What I wonder is how junior engineers, who will be starting their careers out as more expensive or weaker coders to AI, will get the experience necessary to become the senior engineers that need to guide/review the AI's work?
There are an unsettling number of NA startups that seem to aggressively hire remote workers or contractors in Asia\Africa and then a couple mid to senior people in NA timezones.
They build their product and then just seem to fizzle out. My guess that the technical debt and lack of talent retention kills them.
Technical people are more valuable than the code they produce. The good ones have domain expertise and can guide product development in a way that takes advantages of emerging tech, long before it becomes mainstream, and positions the company to capitalize on market movements sooner.
Business that don't have technical leaders in their senior leadership aren't tech companies. They will fall behind quickly because they are busy chasing what's hot yesterday vs what's going to be hot soon.
10 years ago, I worked for a F500 company that fired their research team who was working on generative AI (and made solid progress) because senior leadership was all about investing in blockchain. Remember blockchain? I'm willing to bet those same leaders are all about the "AI future" now that it's in the magazines. But the problem is they are competing with companies who saw the value of generative AI years ago, before it was mainstream. Lucky for them, the company has enough money to buy the startup competition for a few billion.
Agreed. Probably could be a much better UX for handling a mass outage like this. Graceful, clear error messaging that FB login is down would be better than the current UI.
Triggering millions of people to unnecessarily reset their password yet still be unable to login is not a great UX. This seems like one of those cases that's high impact when it does happen, never likely to occur on any given day, but likely to happen at some point; probably just wasn't much focus put on handling a case like this.
From a process/QA perspective I doubt this can ever be properly tested.
Sure you can set up a UX to show that the auth server is somehow down and discourage users from trying to login/reset passwords, but when shit hits the fan, you actually never know the precise error that gets thrown to the client because it could be any layer between the backend and the client that failed...
Roadmaps come in many different flavors depending on the company/culture/processes/etc. For an org where roadmaps are more formal/solidified - a healthy level of "product discovery" should be happening before items are put on the product roadmap. We do it this way at my current company (am a Product Director, leading multiple Product teams). When items are put on the roadmap, scope may not be 100% final, but there is a relatively high level of certainty for what's in/out of scope based on customer & business value, as well as clear prioritization based on value/effort. And for larger scope/high priority items, they've already been aligned to a good extent across key stakeholders & partner product/engineering teams before formally being put on the roadmap.
Thank you for suggesting this. One of the most interesting podcast episodes I've ever listened to. The background of Chang/TSMC is fascinating. Lots of lessons to be learned regardless of industry. While Chang is a technical genius, my main takeaway was that his smartest move might have been seeing the overall strategic picture / value chain of the industry and seeing the long term opportunity. A good reminder that it's always critical to deeply understand your current (or potential) customers and their needs and problems.
This has been my go to weather site since I heard of it last year. I've yet to find another website or app with the variety of detailed radar views that windy has, and seems to be extremely accurate.
Have several years experience dealing with supply chain from the retailer side in the past, and your comment is spot on.
Anything that is: expected to be in demand during Q4/Xmas season, manufactured in Asia, electronics, etc., demand planning and orders are in minimum 3-6 months in advance.
Even if a retailer wants more supply, many manufacturers are extremely hesitant to overproduce.
Here's some rough napkin math to help illustrate the cost of being overstock (caveat: I have no idea actual costs of video games, using best guess), applies to both retailers and manufacturers:
$60 MSRP Switch game, cost to retailer: $48 ($12 profit/unit)
Let's say a retailer orders 100k units of a game in Q2 for the upcoming holiday season, but sells only 70k units during Q4. 70k * $12 = $840k profit. Meanwhile, 30k * $48 = $1.4M of inventory is sitting there, that cash is tied up so the company can't buy inventory of new games releasing in the future, and there's some non-trivial cost for each subsequent month that merchandise sits unsold in the warehouse. That's for under-selling forecast by 30%... now imagine if you're off by 50% or more an any individual game.
Part of it is cost, but a lot is culture and leadership. Streaming (especially live) is one of the toughest areas to maintain a good user experience. I've led Streaming Product teams for years. Product teams almost always needs to deliver growth, which comes in the form of new features, monetization, and other changes. But the user cares most about the core experience - did the video start playing without a delay? Were there buffering issues? Audio playback out of sync? Issues are very noticeable, and sometimes very difficult to test proactively for. Product needs to find this balance, and can not go 100% all in on growth and neglect the not sexy stuff. If the whole Product/Engineering/Test org is not aligned on stability/QoE being a top priority, it can degrade very quickly after a few releases for a streaming app.