This line means this study is much less interesting than it sounds at first. Basically, it sounds like they used the fMRI voxel data to build a classifier that predicted which frames from the videos were showing, and composited each frame weighted by its probability. In other words, wtf.
This link will decay in the future, but http://gallantlab.org/ is the original source. For a similar image of a bird, it has the text: "The left clip is a segment of the movie that the subject viewed while in the magnet. The right clip shows the reconstruction of this movie from brain activity measured using fMRI. The reconstruction was obtained using only each subject's brain activity and a library of 18 million seconds of random YouTube video that did not include the movies used as stimuli. Brain activity was sampled every one second, and each one-second section of the viewed movie was reconstructed separately." There's also a useful video.
It seems valid to me. There's no reason to ask them to somehow extract this visual data "unbiased", without bootstrapping off of video clips like that.
Actually, I'd commend that link to anybody posting complaints here, it covers everything everybody is saying as of my writing here.
> It seems valid to me. There's no reason to ask them to somehow extract this visual data "unbiased", without bootstrapping off of video clips like that.
I agree that their approach seems valid. There is a reason to ask them to extract the visual data in an even more unbiased fashion, though: if we understand how the brain is wired, then it should be "trivial" to back out the image from the patterns of activation.
Of course, the previous sentence is making a couple assumptions that I don't think are anywhere close to valid. 1) "the brain" implies that there is a single, nearly completely conserved architecture that is remotely similar from one person to another; 2) I think you'd need to get the activity to much higher resolution than fMRI can give you; 3) the stimulus <--> response mapping is moderately close to bijective, so for a given input, there's only one set of activity, and vice versa. Still, this study is an interesting first step on what will, no doubt, be a very long journey to improve the technology.
While I think you're right, this is still pretty astonishing.
It's important to note that they're generalizing from a few hours of training data to millions of videos. So the classifier has to be picking up on something deep for it to be re-applied in such a flexible way.
I sort of imagine this approach as being akin to the way Bumblebee (the yellow VW Beetle in the first Transformers) lost his voice, but was able to communicate by switching between radio stations. As that recomposition process being richer and richer, it starts to approximate the real signal...