Gaussian splatting is a way to record 3-dimensional video. You capture a scene from many angles simultaneously and then combine all of those into a single representation. Ideally, that representation is good enough that you can then, post-production, simulate camera angles you didn't originally record.
For example, the camera orbits around the performers in this music video are difficult to imagine in real space. Even if you could pull it off using robotic motion control arms, it would require that the entire choreography is fixed in place before filming. This video clearly takes advantage of being able to direct whatever camera motion the artist wanted in the 3d virtual space of the final composed scene.
To do this, the representation needs to estimate the radiance field, i.e. the amount and color of light visible at every point in your 3d volume, viewed from every angle. It's not possible to do this at high resolution by breaking that space up into voxels, those scale badly, O(n^3). You could attempt to guess at some mesh geometry and paint textures on to it compatible with the camera views, but that's difficult to automate.
Gaussian splatting estimates these radiance fields by assuming that the radiance is build from millions of fuzzy, colored balls positioned, stretched, and rotated in space. These are the Gaussian splats.
Once you have that representation, constructing a novel camera angle is as simple as positioning and angling your virtual camera and then recording the colors and positions of all the splats that are visible.
It turns out that this approach is pretty amenable to techniques similar to modern deep learning. You basically train the positions/shapes/rotations of the splats via gradient descent. It's mostly been explored in research labs but lately production-oriented tools have been built for popular 3d motion graphics tools like Houdini, making it more available.
Thanks for the explanation! It makes a lot of sense that voxels would scale as badly as they do, especially if you want to increase resolution.
Am I right in assuming that the reason this scales a lot better is because the Gaussian splats, once there's enough "resolution" of them, can provide the estimates for how light works reasonably well at most distances? What I'm getting at is, if I can see Gaussian splats vs voxels similarly to pixels vs vector graphics in images?
I think, yes, with greater splat density—and, critically, more and better inputs to train on, others have stated that these performances were captured with 56 RealSense D455fs—then splats will more accurately estimate light at more angles and distances. I think it's likely that during capture they had to make some choices about lighting and bake those in, so you might still run into issues matching lighting to your shots, but still.
That said, I don't think splats:voxels as pixels:vector graphics. Maybe a closer analogy would be pixels:vectors is the same as voxels:3d mesh modeling. You might imagine a sophisticated animated character being created and then animated using motion capture techniques.
But notice where these things fall apart, too. SVG shines when it's not just estimating the true form, but literally is it (fonts, simplified graphics made from simple strokes). If you try to estimate a photo using SVG it tends to get messy. Similar problems arise when reconstructing a 3d mesh from real-world data.
I agree that splats are a bit like pixels, though. They're samples of color and light in 3d (2d) space. They represent the source more faithfully when they're more densely sampled.
The difference is that a splat is sampled irregularly, just where it's needed within the scene. That makes it more efficient at representing most useful 3d scenes (i.e., ones where there are a few subjects and objects in mostly empty space). It just uses data where that data has an impact.
Are meshes not used instead of gaussian splats only due to robustness reasons? I.e., if there were a piece of software that could reliably turn a colored point cloud into a textured mesh, would that be preferable?
Photogrammetry has been around for a long time now. It uses pretty much the same inputs to create meshes from a collection of images of a scene.
It works well for what it does. But, it's mostly only effective for opaque, diffuse, solid surfaces. It can't handle transparency, reflection or "fuzz". Capturing material response is possible, but requires expensive setups.
It’s not only for robustness. Splats are volumetric and don’t have topology constraints, and both of those things are desirable. The volume capability is sometimes used for volume effects like fog and clouds, but it also gives splats a very graceful way to handle high frequency geometry - higher frequency detail than the capture resolution - that mesh photogrammetry can’t handle (hair, fur, grass, foliage, cloth, etc.). It depends entirely on the resolution of the capture, of course. I’m not saying meshes can’t be used to model hair or other fine details, they can obviously, but in practice you will never get a decent mesh out of, say, iPhone headshots, while splats will work and capture hair pretty well. There are hair-specific capture methods that are decent, but no general mesh capture methods that’ll do hair and foliage and helicopters and buildings.
BTW I believe there is software that can turn point clouds into textured meshes reliably; multiple techniques even, depending on what your goals are.
Not everything can be represented by textured meshes used in traditional photogrammetry (think Google Street View)
This includes sparse areas like fences, vegetation and the likes, but more importantly any material properties like reflections, specularity, opacity, etc.
I've recently begun replacing Markdown with Gemini's .gmi/gemtext format. It is Markdown with fewer features. I appreciate the simplicity and it's tremendously easy for custom tools to parse.
It has no inline formatting, only 3 levels of ATX headers (without trailing #s), one level of bullet points using only asterisk and not dash to delimit, does not merge touching non-whitespace lines (thus expecting one line per paragraph), and supports only triple-backtick fenced preformatted text areas that just flip on and off.
Maybe the biggest change is that links are necessarily listed on their own line, proceeded by a `=>` and optionally followed by alt-text.
My gemtext parser is maybe 70 lines and it is arguably 95% of what one needs from Markdown.
I thought I was the only one. It is great for just having some text with headers and links, without any distractions.
Only complaint is that it handles line-breaks the way some Markdown variants do, with each line being one paragraph of text. I much prefer line-breaks to be just treated as whitespace and using double breaks to end paragraphs, like e.g. Pandoc's Markdown format (one reason I always use that when I render Markdown).
Definitely not common! Nice to hear I'm not alone either.
And yeah, I agree. Practically, it's the thing that annoys me the most day-to-day. I've mostly got wrapping set up to handle it now, but it remains a little finicky.
SDFs still scale by geometry complexity, though. It costs instructions to evaluate each SDF component. You could still use something like BvH (or Matt Keeter’s interval arithmetic trick) to speed things up.
Genuine question, how does SPIR-V compare with CUDA? Why is SPIR-V in a trench coat less desirable? What is it about Metal that makes it SPIR-V in a trench coat (assuming that's what you meant)?
There might be a subset of people, such as yourself, that looks for CUDA as a hard requirements when buying a GPU. But I think it's fair to say that Vulkan/Spir-V has a _lot_ of investment and momentum currently outside of the US AI bubble.
Valve is spending a lot of resources and AFAIK so are all the AI companies in the asian market.
There are plenty of people who wants an open-source alternative that breaks the monopoly that Nvidia has over CUDA.
I think the new AMD R9700 looks pretty exciting for the price. Basically a power tweaked RX 9070 with 32gb vram and pro drivers. Wish it was an option 6-7 months ago when I put my new desktop together.
Great. I’m with you there. There is no way that’s describing Apple though.
They’re not open source, for sure. But even setting that aside, they don’t offer anything like CUDA for their system. Nobody is taking an honest stab at this.
If you're familiar with Zorn's Lemma, the construction is just to order bases by inclusion and to consider chains created by noting that there must be an independent dimension and adding it inductively. You can upper bound each of these chains by unioning the members of the chain (which preserves linear independence). By Zorn's Lemma that means there is a maximal linearly independent system and if an element existed outside of that system's span it would contradict that maximality.
Yeah, that's correct. You also often see it as having that for any method `X -> T<Y>` there's a corresponding method `T<X> -> T<Y>`. Or you can have that for any two arrows `X -> T<Y>` and `Y -> T<Z>` there's a composed arrow `X -> T<Z>`. All are equivalent.
Thanks for confirming; I wasn’t 100% sure what I wrote was equivalent to the `X -> T<Y>` definition, but I could see how you’d get one from the other using the unit/counit.
Let's start with function composition. We know that for any two types A and B we can consider functions from A to B, written A -> B. We can also compose them, the heart of sequentiality. If f: A -> B and g: B -> C then we might write (f;g) or (g . f) as two different, equivalent syntaxes for doing one thing and then the other, f and then g.
I'll posit this is an extremely fundamental idea of "sequence". Sure something like [a, b, c] is also a sequence, but (f;g) really shows us the idea of piping, of one operation following the first. This is because of how composition is only defined for things with compatible input and output types. It's a little implicit promise that we're feeding the output of f into g, not just putting them side-by-side on the shelf to admire.
Anyway, we characterize composition in two ways. First, we want to be clear that composition only cares about the order that the pipes are plugged together, not how you assemble them. Specifically, for three functions, f: A->B, g: B->C, h: C->D, (f;g);h = f;(g;h). The parentheses don't matter.
Second, we know that for any type A there's the "do nothing" identity function id_A: A->A. This doesn't have to exist, but it does and it's useful. It helps us characterize composition again by saying that f;id = id;f = f. If you're playing along by metaphor to lists, id is the empty list.
Together, composition and identity and the rules of associativity (parentheses don't matter) and how we can omit identity really serve to show what the idea of "sequences of pipes" mean. This is a super popular structure (technically, a category) and whenever you see it you can get a large intuition that some kind of sequencing might be happening.
Now, let's consider a slightly different sort of function. Given any type types, what about the functions A -> F B for some fixed other type F. F here exists to somehow "modulate" B, annotate it with additional meaning. Having a value of F B is kind of like having a value of type B, but maybe seen through some kind of lens.
Presumably, we care about that particular sort of lens and you can go look up dozens of useful choices of F later, but for now we can just focus on how functions A -> F B sort of still look like little machines that we might want to pipe together. Maybe we'd like there to be composition and identity here as well.
It should be obvious that we can't use identity or composition from normal function spaces. They don't type-check (id_A: A -> A, not A -> F A) and they don't semantically make sense (we don't offhand have a way to get Bs out of an F B, which would be the obvious way to "pipe" the result onward in composition).
But let's say that for some type constructors F, they did make sense. We'd have for any type A a function pure_A: A -> F A as well as a kind of composition such that f: A -> F B and g: B -> F C become f >=> g : A -> F C. These operations might only exist for some kinds of F, but whenever they do exist we'd again capture this very primal form of sequencing that we had with functions above.
We'd again capture the idea of little A -> F B machines which can be plugged into one another as long as their input and output types align and built into larger and larger sequences of piped machines. It's a very pleasant kind of structure, easy to work with.
And those F which support these operations (and follow the associativity and identity rules) are exactly the things we call monads. They're type constructors which allow for sequential piping very similar to how we can compose normal functions.
Thank you so very much. This is the first time monads have made sense to me, and now its clear why. People who try to explain them usually end up adding all kinds of Haskell minutia (the top post is another example), rather than actually explain the concept and why we need it. Your comment is the first time I actually understand what it is, and why it might be useful.
The more constrained your theory is, the fewer models you have of it and also the more structure you can exploit.
Monads, I think, offer enough structure in that we can exploit things like monad composition (as fraught as it is), monadic do/for syntax, and abstracting out "traversals" (over data structures most concretely, but also other sorts of traversals) with monadic accumulators.
There's at least one other practical advantage as well, that of "chunking".
A chess master is more capable of quickly memorizing realistic board states than an amateur (and equally good at memorizing randomized board states). When we have a grasp of relevant, powerful structures underlying our world, we can "chunk" along them to reason more quickly. People familiar with monads often can hand-wave a set of unknowns in a problem by recognizing it to be a monad-shaped problem that can be independently solved later.
> There's at least one other practical advantage as well, that of "chunking".
> When we have a grasp of relevant, powerful structures underlying our world, we can "chunk" along them to reason more quickly.
This is one thing I've observed about Haskell vs. other languages: it more readily gives names and abstractions to even the minutest and most trivial patterns in software, so that seemingly novel problems can be quickly pattern matched and catalogued against a structure that has almost certainly been seen before.
One example: I want to run two (monadic) computations, and then somehow combine together their results (with some binary operation). Such a trivial and fundamental mode of composition, that seems to lack a name in almost every other programming language. Haskell has a name for this mode of composition, and it's called liftM2.
Never again will you have to re-write this pattern for yourself, leaving yourself open to error, now that you have this new concept in your vocabulary. Other languages will happily let you reinvent the wheel for the umpteenth time, or invent idiosyncratic patterns and structures without realizing that they are just particular applications of an already well-studied and well-worn concept.
Every monad is also an applicative and liftA2 does/is the same thing as liftM2. The only reason they both exist was due to Monad being popularized in Haskell earlier than Applicative and thus not having it as a superclass until the Functor-Applicative-Monad Proposal in Haskell 2014. It was obviously correct, but a major breaking change that also got pork barreled a bit and so took a while to land.
Yes you're absolutely right. I had a bit of a brain fart moment here. If they were Applicative operations then you would not be able to use `liftM2`, not the other way around.
This is my fear when I think about doing actual in a team using a functional language. That there's an imbalance in understanding between participants in a team making all discussions about problem x into the pattern matchning problem y. Like "is this liftM2 or liftA2?"
I've only had a couple of months of experience working with scala before the team switched to Java. The reasons were many but one of them was that the external consultant that was most knowledgeable in "thinking with functions" was kind of a dick. Making onboarding into a horror show of "go look up yt video x before I can talk about this functionality" with a condescending tone. So within a month he was let go and then no one in the remaining team really had the courage to keep debe it further. Some thought that they maybe could maintain the current functionality but the solution was only like half complete. (in the consultant mind it was complete because it was so generic you only needed to add a couple of lines in the right place to implement each coming feature)
That said, I would love to work in a hardcore Haskell project with a real team, one with a couple of other "regular" coders that just help each other out when solving the actual problems at hand.
Well, I can't speak to your experience but in the case of liftM2 vs liftA2 I have never even seen liftM2 get used. Its more of a historical oddity that it is available.
I’m not a huge fan of these, but this time I noticed that the best ones feel a lot like naturality arguments. As in, moving structural bits in a way that makes it clear that we’re not touching anything that ought to be universally quantifiable.
I still don’t love this sort of thing being presented as “proof”, but I thought that idea is interesting. Is there a way to formalize naturality into technical diagrams? Probably!
For example, the camera orbits around the performers in this music video are difficult to imagine in real space. Even if you could pull it off using robotic motion control arms, it would require that the entire choreography is fixed in place before filming. This video clearly takes advantage of being able to direct whatever camera motion the artist wanted in the 3d virtual space of the final composed scene.
To do this, the representation needs to estimate the radiance field, i.e. the amount and color of light visible at every point in your 3d volume, viewed from every angle. It's not possible to do this at high resolution by breaking that space up into voxels, those scale badly, O(n^3). You could attempt to guess at some mesh geometry and paint textures on to it compatible with the camera views, but that's difficult to automate.
Gaussian splatting estimates these radiance fields by assuming that the radiance is build from millions of fuzzy, colored balls positioned, stretched, and rotated in space. These are the Gaussian splats.
Once you have that representation, constructing a novel camera angle is as simple as positioning and angling your virtual camera and then recording the colors and positions of all the splats that are visible.
It turns out that this approach is pretty amenable to techniques similar to modern deep learning. You basically train the positions/shapes/rotations of the splats via gradient descent. It's mostly been explored in research labs but lately production-oriented tools have been built for popular 3d motion graphics tools like Houdini, making it more available.