Yep, makes sense - conversion to text and then aligning the text with the audio ...

Yep, makes sense - conversion to text and then aligning the text with the audio is a very reasonable way to handle large volumes of speech data. For bioacoustics, we tend to have a loooooot of variation for which there is no real notation, and which may be from areas where we haven't seen much training data, or on taxa where we don't have lots of scientists (eg, insects). So working with the raw audio embeddings tends to be best.