I would love to work on something like that, I have a couple ideas regarding way...

I would love to work on something like that, I have a couple ideas regarding ways to implement it.

However I don't know if I have the time to do it, as there are so many concepts I'm trying to channel through code. Hopefully as I build more computational tools I will increase my bandwidth.

Providing visual tools for phonetic assistance is definitely something I've had in mind for a while and solving the dataset building problem should solve that along the way.

As of now, you can: 1) set the amount_sphere_tube slider to 1 2) decrease the playback_rate to 0.1(in the GUI on the right) 3) on the media controller, go to options(the naming might vary depending on the browser) and set the playback rate to normal or x1. This should give you maximal temporal resolution.

One way features can be further highlighted by generating embeddings from magenta ddsp[0]. Speech sounds are fairly complex though so I don't know how models built on music data would generalize to them. I think tech to do is is there, but for the time being it seems to be scattered around fairly siloed fields. I also tried to use live voice recordings but there are latency issues with the subset of the Web Audio API I'm currently using. However there definitely is value in having a live, spatial feedback of pronunciation.

[0]https://github.com/magenta/magenta-js/tree/master/music#ddsp