They need to fix the addresses. In Stockholm, all of the companies are placed in the old town. At Hopsworks, we are in Sodermalm (hipster) - we are not old school money.
I gave a talk at PyData Berlin on how to build your own TikTok recommendation algorithm. The TikTok personalized recommendation engine is the world's most valuable AI. It's TikTok's differentiation. It updates recommendations within 1 second of you clicking - at human perceivable latency. If your AI recommender has poor feature freshness, it will be perceived as slow, not intelligent - no matter how good the recommendations are.
TikTok's recommender is partly built on European Technology (Apache Flink for real-time feature computation), along with Kafka, and distributed model training infrastructure. The Monolith paper is misleading that the 'online training' is key. It is not. It is that your clicks are made available as features for predicitons in less than 1 second. You need a per-event stream processing architecture for this (like Flink - Feldera would be my modern choice as an incremental streaming engine).
I have to say, it is _extremely_ impressive when a tiktok I watched reminds me of some other tiktok, so I go and search for a very loose description of the tiktok, and the first result is 95% of the time what I wanted to find.
I don't think any single other platform has as good a search feature as TikTok does.
oh wow, you're really lucky. around my friend groups who use tiktok, the main complaint is how bad the search is. unfortunately for us, getting a specific video is almost impossible =(
Thats super interesting (I deleted Tiktok because it was too addicting!), but this is a common complaint about Instagram is that it feels impossible to find a reel based on keywords.
I noticed Youtube shorts also seems to update the feed based on how long the last video you watched. If you're scrolling quickly then stop to watch a dog video long enough the next one is likely to be another dog video.
It creates a weird feedback loop: after I watch video A, it recommends a similar video B, and if I make the mistake of watching that too, it then recommends video C on the same topic. Suddenly my feed is nothing but Stranger Things shorts for two whole days (literally not a single video about anything else). Skipping or disliking didn't help, then somehow it went back to normal after two days.
I’ve noticed the same thing and this creates such a negative user experience. Every short is a reaction test and if I fail, I get slop. Makes the whole experience very jarring (for better or for worse).
For better or worse with regards to my addiction, my subscriptions are all either science channels or high effort / high production comedy skits (e.g. DropoutTV). I still get slop, but I never subscribe and it mostly remains background noise
That’s the point though. It may seem as if you’re not in control when scrolling, but you can adjust your behavior to get the content you’re looking for almost intuitively. That’s actually something good in my honest opinion.
Why is it good that you need self control to not get slop? Its much better if you can just turn that off and relax rather than having to stay alert to avoid certain content that it tries to trick you to serve you more slop.
Distancing yourself from temptations is an effective and proven way to get rid of addictions, the programs constantly trying to get you to relapse is not a good feature. Like imagine a fridge that constantly puts in beer, that would be very bad for alcoholics and people would just say "just don't drink the beer?" even though this is a real problem with an easy fix.
Basically, I want to set boundaries in a healthy frame of mind, and have that default respected when my self control is lower because I’m tired, depressed, bored, etc.
It’s because content curation is inherently impossible to reach the same level of relevance as direct feedback from user behavior. You mix in all kinds of biases, commercial interests, ideology of the curator, etc, and you inevitably get irrelevant slop. The algorithm puts you in control a little bit more.
> The algorithm puts you in control a little bit more.
Why not let you choose to get a less addictive algorithm? Older algorithms were less addictive, so its not at all impossible to do this, many users would want this.
And that is why these algorithms needs to be regulated. People don't want to pick the algorithm that makes them spend the most time possible on their phones, many would want an algorithm that optimizes for quality rather than quantity on the app so they get more time to do other things. But corporations doesn't want to provide that because they don't earn anything from it.
I have YouTube Premium. They should be doing the opposite. Getting me off the platform as quickly as possible so they get to keep a bigger cut of my fixed payment.
I just don’t think that the addiction is exclusively due to the algorithm. There’s really a lack of affordable varied options for learning trade and entertainment. We say in Portuguese: You shouldn’t throw the baby away along with the water you used to bathe.
I don't see any harm that could come from saying "a less addictive algorithm needs to be available to users"? For example, lets say there is an option to only recommend videos from channels you subscribe to, that would be much less addictive, why isn't that an option? A regulation that forces these companies to add such a feature would only make the world a better place.
>I don't see any harm that could come from saying "a less addictive algorithm needs to be available to users"?
consider air travel in the present day. ticketing at essentially all airlines breaks down as: premium tickets that are dramatically expensive but offer comfortable seats, and economy tickets that are cramped and seem to impose new indignities every new season. what could be the harm from legislation that would change that menu?
the harm would be fewer people able to travel, fewer young people taking their first trip to experiencing the other side of the world, fewer families visiting grandma, etc.
As much as people hate the air travel experience, the tickets get snapped up, and most of them strictly on the basis of price, and next most taking into account nonstops. This gives us a gauge as to how much people hate air travel: they don't.
this doesn't mean airlines should have no regulation, it doesn't mean monopoly practices are not harmful to happiness, it doesn't mean that addictions don't drive people to make bad choices, it doesn't mean a lot of things.
I'm just trying to get you to see that subtle but significant harm to human thriving can easily come from regulations.
I agree, but what would be the actual mechanism that would allow that? I believe we’re out of ideas. TikTok’s crime was just be firmly successful because of good engineering. There’s no evil sauce apart from promotional content and occasional manipulation, which has nothing to do with the algorithm per se.
And about whitelisting, I honestly don’t think you’re comparing apples to apples. The point of the algorithm is dynamically recommending new content. It’s about discovery.
> I agree, but what would be the actual mechanism that would allow that?
Governments saying "if you are a social content platform with more than XX million users you have to provide these options on recommendation algorithms: X Y Z". It is that easy.
> And about whitelisting, I honestly don’t think you’re comparing apples to apples. The point of the algorithm is dynamically recommending new content. It’s about discovery.
And some people want to turn off that pushed discovery and just get recommended videos from a set of channels that they subscribed to. They still want to watch some tiktok videos, they just don't want the algorithm to try to push bad content on them.
You are right that you can't avoid such algorithm when searching for new content, but I don't see why it has to be there in content it pushes onto you without you asking for new content.
I don’t agree tbh. This is part of how people wind up down extremist rabbit holes. If you’re just lazily scrolling it can easily trap you in its gravity well.
But you can get into extremist rabbit holes independently of control surface. Remember 4chan? Dangerous content is a matter of moderation regardless of interfacing.
4chan has a lot less extremism than people imagine, rspecially compared to platforms like Instagram or Facebook. It's mostly concentrated on certain boards. The reputation of being extremist did more 'in favour' of its extremism than the original userbase and design ever did.
4chan is only outdone by 8chan. “It’s only concentrated on certain boards” is the same lame excuse Reddit used to ignore /r/thedonald and now /r/conservative.
4chan doesn't use algorithms to push users to certain boards afaik, makes it better than the others in its design. I'm not arguing 4chan is great but it's not nearly as impactful as Facebook, Twitter or TikTok in creating extremism.
Facebook and Twitter are far worse sources of extremism. There are entire groups dedicated to genetic comparisons between races, 'who would you do' groups that do nothing but photos of young women in bikinis farmed FROM facebook/ig.
4chan is where you go too far. 4chan users typically don't foster extremism, they are the extreme. They don't post pictures of young women, they post addresses and walkthroughs of their apartments.
Yes, I feel like it's far less harmful than the other sites for this reason. These bad parts of 4chan aren't the majority of the site either, a large minority maybe, but the site in general is much smaller. Users are also attracted to the image of 'extremism', 4chan in the far past didn't have this as its main audience of newcomers in its early stages.
It's easy to control for governments compared to facebook/reddit/... because it's just some boards, way better than massive amounts of posts creating a personal zone for everyone.
>I'm not arguing 4chan is great but it's not nearly as impactful as Facebook, Twitter or TikTok in creating extremism.
4chan has /pol/. 4chan inspired Gamergate, Pizzagate, QAnon and numerous incidents of extremist violence. Those other platforms mostly just spread and accelerate the toxic culture that originated on 4chan.
I'm not sure if most of 4chan was actually so on board with the whole gamergate thing and all the things which followed. pre-/pol/ 4chan was a whole different thing. It was outsiders joining 4chan which did most of the posting, twitter and facebook were the ones which allowed this to happen.
Internet starting with a 1000 4chans wouldn't create what we have today (you'll just get lots of small fringe groups), internet starting with a 1000 facebooks/twitters/... will always end in extremism of a big portion of the population.
And — this is really shocking — Jeffrey Epstein caused /pol/ to exist, which makes him indirectly responsible for almost all stupid internet politics of the last decade.
I try to react as “violently” as possible to any slop and low-quality crap (e.g. stupid “life hacks” purposely bad to ragebait the comments). On YouTube it’s called “Don’t recommend this channel” and on Facebook it’s multiple taps but you can “Hide All From…”
Basically, I don’t trust that thumbs down is sufficient. It is of course silly, since there are no doubt millions of bad channels and I probably can’t mute them all.
At the risk of going off on a tangent about that maxim; I feel like it's just misusing the word "purpose".
Maybe it would be cleaner to state that a system has no purpose (at least not until it is sentient), instead it has behaviors. Then one can observe that the purpose of the designers or maintainers of a system simply happens to be at odds (or as AI safety researchers would say, are "out of alignment with") the behavior of the system.
That all of course presupposes that one can accurately deduce the purposes of the designers/maintainers.. In the case of TikTok, I'd bet that we are all in agreement that their purpose is nothing more nor less than maximal value-extraction from people wishing to express themselves with videos multiplied against an audience of people who wish to view videos multiplied again against advertisers who want to insert propaganda into eyeballs.
The right way to look at these networks is that people are being trained by the algorithm, not the other way around. The ultimate goal is to elicit behaviors in humans, normally to spend more time and spend more money in the platform, but also for other goals that may be designed by the owners of the network.
On amazon.ie I'm convinced they are running only two ads, because all I'm ever seeing are ads for grime brushes and window squeegees. Literally nothing else.
One of my gripes with youtube at the moment is that they break my adblock filters to remove shorts more often than they break the filters stopping the actual ads.
It comes back. It acts like it executed shortiness+=1 every day, and "show fewer shorts" does shortiness-=10 or thereabouts. The shorts position on the home screen is based on this hidden shortiness variable. It always bubbles back to the top unless you keep pressing "show fewer shorts" whenever you see it.
youtube's algorithm seems to be "oh you watched this video? now here's every other video by this creator, pretty much without a break, until you downvote it"
It never reliably gives me videos similar but not exactly the same, i.e. things I might be interested in.
For me it's the same exact 5 videos on repeat, over and over and over again. I've gotten in a loop a lot of times, where it'll autoplay the same video I just watched, it's absolute madness
If by features you mean tracking state per user, that stuff can be tracked without Flink insanely fast with Redis as well.
If you re saying they dont have to load data to update the state, I dont see how massive these states are to require inmemory updates, and if so, you could just do inmemory updates without Flink.
Similarly, any consumer will have to deal with batches of users and pipelining.
Flink is just a bottleneck.
If they actually use Flink for this, its not the moat.
Yea, the Monolith paper by Bytedance uses Flink but they only say it's in use for their B2B ecommerce optimization system. Maybe this is intentional ambiguity, but I'd believe that they wouldn't rely on something like Flink for their core TikTok infrastructure.
My hunch is we start to learn a lot more about the core internals as Oracle tries to market to B2B customers, as Oracle is wont to do!
Flink is not really a performance choice, it's bloat to throw software as fast as possible at problems. I don't think there's any benchmark demonstrating insane capabilities per machine. I definitely couldn't get it to any numbers I liked, given other stream processing / state processing engines that exist (if compute and inmemory state management is the goal). Pretty sure any pathway that touches RocksDB slows everything down to 1-10k events per second, if not less.
The problem of finding out which video is next, by immediately taking into account the recent user context (and other user context) is completely unrelated to what Flink does -- exactly-once state consistency, distributed checkpoints, recovery, event-time semantics, large keyed state. I would even say you don't want a solution to any of the problems Flink solves, you want to avoid having these problems.
I'm not a TikTok user, but I'm assuming the recommendation engine is there to keep eyeballs on more ads for longer. Maybe we should be regulating how often and how many ads can be shown on social media, especially to teens and kids.
If it was only network effect, then how did TikTok grow in a space where Instagram and Youtube were already much bigger players? How did they gain that user base?
Network effect helps, but it only explains why they stay big, not how they got big
There really isn't that much making TikTok unique. Yes, their app is well designed. Yes, stitches and video replies make for great social/parasocial features because creators are actually interacting with each other and the community, almost like tumblr. But in my opinion those are reasons number three and two why TikTok is successful. Their recommendation algorithm is number one, by a wide margin.
It also provides different opportunities for growth compared to other social media. A video that gets over half a million views on TikTok may not get 5 thousand on Youtube, or even 10 views on Instagram or Facebook.
It's not just different videos, Tiktok is much better at recommending videos by very small creators and people with no followers. On Instagram or Facebook if you don't already have a large following you most likely wont get any views at all no matter how well your video matches the platform. YouTube often pushes big creators that already made it big while Tiktok allows me to discover new and niche ones.
I'm sorry to point out the obvious here, but who is going to perceive their recommended feed as slow or unfresh if it doesn't learn from exactly the last video you clicked on within 1 second? The bar simply is not that high. The special sauce of TikTok is how it chooses the videos, not the speed it does it at. I'm sure the speed helps to give it that "spookily intelligent" feeling, but that's a cherry on the recommendation cake, a cake which is already twice as good as the nearest competitor. I'm sure your talk goes deeper than this, but if this is the main focus, then you've missed the point.
Speed completely changes the game in a few ways. The first is identifying interests. Imagine every possible interest in a tree structure. Let's say you're into kumiko. There are so many levels of the tree to traverse to find kumiko; perhaps Skilled crafts -> Woodworking -> Japanese -> Construction without use of fasteners -> Panels and decorative elements -> Kumiko. The more iterations you can get through, the better you can match people's interests. If someone has 10 interests and each one requires many questions to determine, it can take forever to find exact interests with a system that only narrows down your interests every X videos vs. after each video.
The second is matching current moods. Let's say you just broke up with your girlfriend, or your pet fish died, or you're on vacation in Spain. A rapidly-updating system can capture those trends and get right to the heart of them in time for them to matter. A slow system might only get through a few iterations and capture a vague interest in Spain; a fast-updating one can get through countless iterations of guessing. Spain? What city? Tourist or moving there? What type of tourist? Foodie? What type of food? How fancy? Bam, you're watching the perfect video about an upscale seafood restaurant in Barcelona.
The third is type and flavor of content. Even inside of a small niche you will find many flavors of content. Super-short or long form, fast paced or slow, funny or serious, intellectual, irreverent, political leanings, background music, et cetera. Maybe you like slow long-form woodworking content but like fast-paced travel guides. Maybe you hate background music except when it's in skateboarding videos. To determine this requires an incredible amount of "questioning" of the user.
Now, of course, an algorithm that updates once daily can also make inferences about your interests and preferences. It can certainly learn, with enough time, what you are into and how you like to consume it. But the key thing is that these inferences only enable _predetermined_ changes. Imagine you are a human showing someone TikToks. Imagine that you can ask them any questions about their preferences right as they watch a video. You may not ask a question after every video, but you will ask countless questions over the hours of scrolling that day, and you will get good data. Now imagine a new restriction: you must decide your questions once a day in advance. You will manage far fewer questions; and to follow up on them you must wait yet another day.
Now, why do I partly agree? Well, I don't think speed is everything; I think TikTok has another sort of je ne sais quoi to it. I think it has a unique culture and community. It has a better UI and better features than Instagram. It has a young and cool reputation, far from the Millennial taint of Instagram or Facebook. And I suspect that they are good at identifying _who_ you are and acting on that information. But in my eyes, the speed could very well be the most important part of the puzzle.
I would assume it's time spent on video and they start to build a profile of which users like what kind of content. X liked video 934934 so Y probably also likes that kind of video. Group people in buckets.
I'm sure this is part of it, but I suspect it goes deeper than that. I'd guess they probably have some kind automated categorisation algorithm that can extract features from the videos
I got turned off in the first paragraph with the misuse of the term "back pressure". "back pressure" is a term from data engineering to specifically indicate a feedback signal that indicates a service is overloaded and that clients should adapt their behavior.
Backpressure != feedback (the more general term). And in the agentic world, we use the term 'context' to describe information used to help LLMs make decisions, where the context data is not part of the LLM's training data. Then, we have verifiable tasks (what he is really talking about), where RL is used in post-training in a harness environment to use feedback signals to learn about type systems, programming language syntax/semantics, etc.
The term back pressure actually comes from mechanical engineering in the context of steam engines.
It first appeared in a dictionary 160 years ago.
Words are just words. Mathematicians very well understand that words mean nothing, what matters are definitions and the author provides one.
E.g. natural numbers may or may not contain the number 0, but that's irrelevant, because what mathematicians care for are definitions, so they will state that natural numbers are a given a set of positive whole numbers (including or not the number 0) and avoid arguing about labels. You can call them funky numbers or neet numbers, doesn't matter.
Same applies here. Your comment is pointless because the author does provide a definition for back pressure in the context of his blog post and what matters is discussing the concept he labels in the context of LLMs.
We all live in our own various small circles, in which many terms get misused. Isomorphic in front end circle means something completely different than any other use, for example. This is how languages evolve.
I'm not trying to discount any attempt to correct people, especially when it gets confusing (like here, I was also confused honestly), but we could formulate it nicer IMHO.
It is perhaps more generally known in the plumbing sense of pressure causing resistance to the desired direction of flow, but yeah, a poor word choice...at least it isn't AI written though.
If they are exploding categorical variables using OHE and storing the columns - that is the wrong thing to do. You should only ever store untransformed feature data in tables. You apply the feature transformations, like OHE, on reading from the tables, as those transformations are parameterized by the data you read (the training data subset you select).
Subsidiarity has been a key building block of the EU and has failed the EU for unexpected reasons. Subsidiarity was pursued for accountability and to make the EU less centralized - decisions should be made at the lowest, most local level possible, with central authorities only stepping in when a task cannot be effectively handled locally.
However, it means that here in Sweden govt bodies are all individually moving to Azure, because each one makes that local decision in their best interest. The same thing has happened all over the EU - and very few govt bodies would ever take the risk of investing in using EU cloud or data platforms. We need public procurement to help kickstart life into the Eurostack.
They control Europe's digital infrastructure and are able to increase rent to usurous levels (tarrifs!) because Europe is dependent on their digital services. Without digital sovereignty, Europe has no sovereignty and will quickly become a modern colony from which wealth will be extracted.
The reason the US is able to raise rents (tariffs) has nothing to do with Europe buying US digital services.
The tariffs are on European exports. The problem is Europe has a weak domestic consumer market and is dependent on selling stuff to the US, not buying from them.
The EU has a services deficit compared to the US, the US has a goods deficit compared to Europe. Together, they are almost in balance, the difference is just 3% of total trade [1]. Put differently, the US and the EU need each other. This is why Trump is using footguns.
The problem is really that Europe has a few dozen weak consumer markets. If there really was a proper single market, I suspect the EU would be much more competitive in digital services.
Unfortunately despite their best efforts this isn't something Eurocrats can simply will into existence. The most important prerequisite is a common language, and there is zero political will to do the only sensible thing and establish English as the official common language of the EU.
Nonsense. Unilaterial tarrifs are not how trade agreements work. This is pure extractive rent.
The reason the US is not able to extract the same rents from China is that they have digital sovereignty and the US cannot just pull the cloud plug from them.
> Nonsense. Unilaterial tarrifs are not how trade agreements work. This is pure extractive rent.
What do you mean by "unilateral tariffs"?
> The reason the US is not able to extract the same rents from China is that they have digital sovereignty and the US cannot just pull the cloud plug from them.
The US has higher tariffs against Chinese imports than European imports.
I agree that this is an anti-pattern for training.
In training, you are often I/O bound over S3 - high b/w networking doesn't fix it (.saftensor files are typically 4GB in size). You need NVMe and high b/w networking along with a distributed file system.
We do this with tiered storage over S3 using HopsFS that has a HDFS API with a FUSE client, so training can just read data (from HopsFS datanode's NVMe cache) as if it is local, but it is pulled from NVMe disks over the network.
In contrast, writes go straight to S3 vis HopsFS write-through NVMe cache.
reply