I'd do this using both streams of information: audio and video. I'd segment the ...

I'd do this using both streams of information: audio and video.

I'd segment the audio semantically based on the topic of discussion, and I'd segment the video based on editing, subjects in scene, etc. We could start simply and just have a "timestamp": [ subjects, in, frame] key-value.

It'd take some fiddling to sort how to mesh these two streams of data back together. The first thing I'd try is segment by time chunks (the resolution of which would depend on min/max segment lengths in video and audio streams) and then clump the time chunks together based on audio+video content.