I've been wrestling with embeddings for speaker identification recently, and it's not an area I know well. To improve my own understanding, and help anyone else who needs to get up to speed on the practical details, I've put together a Python notebook on Colab. It walks you through how the embeddings work, with inline examples that build into a simple "is this audio from the same person or not?" system. I hope it's helpful to a few of you out there! https://lnkd.in/gE8qWATB
Sorta related, recently I've been looking into S2T models specifically trained on children's voices. Children are some who could benefit the most from AI-enabled HCI and yet I'm finding how rare it is for these models to recognize children's voices.
I enjoyed this simple yet useful workshop. The part where you used histogram to determine the threshold is my favorite.
Nice write up. I would make a small correction: pre processing is needed for non 16khz, and I would add non-wav file support (mp3 and flac).
dude this kinda thing is so fun. Thanks for posting
Hey Pete Warden, creator of pyannote here! We should talk!
Avi Tuschman - wondering if this might be useful for Crickit
Diarization?
Highly Appreciated !
A former PhD student of mine and I recently completed some work on something similar. There is an unused (typically) part of the spectrum that has informstion about speaker identity, speaker gender and phonemic identity. I bet it would improve your embeddings. Or even create parallel embeddings (and then concatinate to create a global embedding). Let me inow if youd like to chst about it. (it turned out to also very reliably discriminate between natural speech and computer-generated speech e.g., AI cloned voices).