Diffusion is more parameter-efficient and you quickly saturate the target fidelity, especially with some refiner cascade. It's a solved problem. You do not need more than maybe 4B total. Images are far more redundant than text.
In fact, most interesting papers since Imagen show that you get more mileage out of scaling the text encoder part, which is, of course, a Transformer. This is what drives accuracy, text rendering, compositionality, parsing edge cases. In SD 1.5 the text encoder part (CLIP ViT-L/14) takes a measly 123M parameters.[1] In Imagen, it was T5-XXL with 4.6B [2]. I am interested in someone trying to use a really strong encoder baseline – maybe from a UL2-20B – to push this tactic further.
Seeing as you can throw out diffusion altogether and synthesize images with transformers [3], there is no reason to prioritize the diffusion part as such.
When you are hiring in volume, you are hiring for additive value, not for transformative value (nor multiplicative value). For additive value, conformance is essential.
Most of the semi-successful companies don't need to be "the innovation machine".
I disagree, the burden is almost exclusively maintaining fast implementations of primitive operators for all hardware. These ML libraries are collections of pure functions with minimal interfaces. There's very little code interdependence and it's not particularly difficult to implement modern algorithms to train networks.
For compiler people reading this, a lot of common compiler terms have been entirely reinvented in the context of machine learning frameworks. An ML "graph" refers almost exactly to the dataflow graph (DFG) of a program. TensorFlow 1.0 only exposed a DFG, which is well known to be far simpler to apply optimizations to (assuming you have a linear algebra compiler).
PyTorch integrated with Python (an interpreted language) and does not expose an underlying DFG. This is labeled "eager" and means that compilation of PyTorch requires optimization over both the control flow graph (CFG) and DFG. Python by default exposes neither of these things in a standard way. Some ML workloads simplify easily to a DFG (torch FX can handle this), but the general case does not. Although TorchScript (a subset of Python) tackled the CFG in 1.0, the team is now taking it further and compiling Python byte-code itself (with torchdynamo), which means you don't need to change any code and still get compilation speed ups! That's why 2.0 is significant.
Of course, all of this requires a linear algebra compiler to actually do the optimizations which is why things like AITemplate (for inference) and TorchInductor (which calls into a bunch of other compilers for training) exist for PyTorch. TensorFlow's linear algebra compiler is XLA.
Find a company whose open source projects you are interested in. Dive in and and start fixing things. Then if you really like it after a couple weeks start nudging around for a job. If you do good work they'll just give it to you, no bullshit funnel required.
I like this method because you aren't just doing l33t coding exercises to work on some sight unseen codebase that makes you suicidal and throw you into existential crisis.
Neat algorithm to detect if point is inside polygon is too draw a random line and count intersections, if it's odd the point is inside polygon.
I use a lot of operations with basic geometric primitives and whenever I use stackoverflow it takes years to iron out all the special cases in which top stackoverflow answer fails.
Just breaking the attention matrix multiply into parts allows a significant reduction of memory consumption at minimal cost. There are variants out there that do that and more.
Short version: Attention works as a matrix multiply that looks like this: s(QK)V where QK is a large matrix but Q,K,V and the result are all small. You can break Q and V into horizontal strips. Then the result is the vertical concatenation of:
s(Q1*K)*V1
s(Q2*K)*V2
s(Q3*K)*V3
...
s(QN*K)*VN
Since you're reusing the memory for the computation of each block you can get away with much less simultaneous RAM use.
The fact that there's a claimed generational gap here speaks more to the values of the generations re: passive vs active entertainment. I could also ask: "Why do all these 60-Somethings have the TV on as background noise while doing other tasks?"
To me it seems closed-caption usage is correlated with actually paying attention to television consumption.
People who have their TV on at all hours, as background entertainment to support their lifestyle, tend to not use CC. Why should they? The words literally don't matter, it's just an aesthetic.
People who actually desire an immersive experience, who deliberately pay attention to the shows they watch, tend to care about CC since it complements the audio & visual nicely. I don't have any evidence for this but I'd wager that plot synthesis and comprehension of television shows is greatly improved by CC. Or maybe it's that people who use CC tend to value and perform better at synthesis and comprehension? Regardless of causality, CC seems related to an individual's desire for more active entertainment.
Traffic cameras are publicly available in London. The first thing I used to do when coming to the office each morning was to look myself up in the traffic cameras along my journey (historic footage is also publicly available).
With all the great progress in large language models lately, and them being excellent text compressors, I've started to wonder if you couldn't just replace a search engine with a like 100mb file full of weights that let you query essentially google scale results except all locally.
Another optimization in the same vein: make sure the first KB contains the <meta charset> element, to avoid false-starts in the HTML parser. In fact, it's actually specified as an _error_ in the spec to have it come after 1KB!
It's mentioned in passing on the Lighthouse docs for the charset audit[1], but here's a great breakdown by hsivonen[2].
Of course, in the headers is even better. Point is, try not to have too much cruft before the charset.
I personally feel like that lawsuit happens the moment someone builds the version of this that works on music; in my experience arguing before the copyright office at the library of congress, the people who tend to be the most omnipresent is the RIAA, and when someone releases an AI-generated piece of music that sort of sounds like some recent Taylor Swift song but using that infamous sample from Under Pressure / Ice Ice Baby, the lawsuit will be filed within days.
I'm already using this timestamp technique on my website and so far no bot operator has bothered trying to work around this. However even if some bot operator were to specifically target a website using this technique and try to decrease the timestamp, I believe you could still force a bot to wait by just changing the website to use something like a cryptographic nonce that includes a timestamp instead of just a simple timestamp that can be understood easily.
It's annoying for sure. I deal with abuse at a large scale.
I'd recommend:
- Rate-limit everything, absolutely everything. Set sane limits.
- Rate-limit POST requests harder. Preferably dynamically based on geoip.
- Rate-limit login and comment POST requests even harder. Ban IPs that exceed the amount.
- Require TLS. Drop TLSv1.0 and TLSv1.1. Bots certainly break.
- Require SNI. Do not reply without SNI (nginx has 444 return code for that). Ban IP's on first hit that connect without. There's no legitimate use and you'll also disappear from places like Shodan.
- If you can, require HTTP/2.0. Bots break.
- Ban IP's listed on StopForumSpam, ban destination e-mail addresses listed there. If possible also contribute back to SFS and AbuseIPDB.
- Collect JA3 hashes, figure out malicious ones, ban IPs that use those hashes. This blocks a lot of shit trivially because targeting tools instead of behaviour is accurate.
The post describes essentially a spectral method for solving ODEs in the Taylor (monomial) basis. Haskell definitely makes it nice to play around with this, but the monomial basis is terribly ill conditioned and not a good choice for this application. The Chebyshev basis is much better. Anyone who finds this neat, I recommend checking out the chebfun family of packages. They use the same idea but with much more rigorous numerics.
A vaguely related notion is that naive analysis of big-O complexity in typical CS texts ignores over the increasing latency/cost of data access as the data size grows. This can't be ignored, no matter how much we would like to hand-wave it away, because physics gets in the way.
A way to think about it is that a CPU core is like a central point with "rings" of data arranged around it in a more-or-less a flat plane. The L1 cache is a tiny ring, then L2 is a bit further out physically and has a larger area, then L3 is even bigger and further away, etc... all the way out to permanent storage that's potentially across the building somewhere in a disk array.
In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.
Hence, a lot of algorithms that on paper have identical performance don't in reality, because one of the two may have an extra sqrt(n) factor in there.
This is why streaming and array-based data structures and algorithms tend to be faster than random-access, even if the latter is theoretically more efficient. So for example merge join is faster than nested loop join, even though they have the same performance in theory.
In fact, most interesting papers since Imagen show that you get more mileage out of scaling the text encoder part, which is, of course, a Transformer. This is what drives accuracy, text rendering, compositionality, parsing edge cases. In SD 1.5 the text encoder part (CLIP ViT-L/14) takes a measly 123M parameters.[1] In Imagen, it was T5-XXL with 4.6B [2]. I am interested in someone trying to use a really strong encoder baseline – maybe from a UL2-20B – to push this tactic further.
Seeing as you can throw out diffusion altogether and synthesize images with transformers [3], there is no reason to prioritize the diffusion part as such.
1. https://forums.fast.ai/t/stable-diffusion-parameter-budget-a...
2. https://arxiv.org/abs/2205.11487
3. https://arxiv.org/abs/2301.00704