Introducing The Talkboy Ultra! An AI Powered Voice Cloning Toy
Note: This blog post is based on a YouTube video I made, which you can watch in the embed, but it also has some stuff that’s not in the video, so read the stuff I wrote below!
So while I was learning about AI voice cloning technology to make myself sing better (and to make Kanye sing stupid songs), I got an idea into my head that I couldn’t let go of. I was reminicing about the original Talkboy Deluxe from the movie Home Alone 2 (and the subsequent commercial that played on kids tv for years after), and how it was supposed to change your voice but really it just kinda slowed it down. I realized that I could make the dream of the Talkboy come true, and do it with real life hardware.
So here’s the story of how I made the Talkboy Ultra, which it turns out is a pretty fun toy.
Building The Dream
I knew that I wanted the Talkboy to be a real device, not just some app on a phone. Phone apps are cool, but there’s just something about single-purpose devices that hit different. They’re more fun and there’s something about pressing a button that’s designed just for a single purpose.
I was hoping that I could run everything locally on a device, but the voice AI inference requires too much memory, so I went with a client-server approach. I started with a Raspberry Pi Zero W as the base, and added a USB audio card with input and output.
I connected a lavalier microphone and tested recording and playback. The recording worked fine but when I plugged in an unamplified speaker, the audio was way too quiet. I decided that instead of figuring out how to amplify an audio signal, I would just go with a USB powered speaker.
With this setup, I could successfully record and playback my voice (after writing some Python code to do that).
From this foundation, I added a few push buttons for the recording and playback controls. I also wanted the device to work without having to plug it in to a power source (for portability) so I added a 1000 mAh Lithium-ion battery pack and a TP4056 Charging Module with micro usb input to charge the battery. I also added a MT3608 DC-DC Step Up Boost Power Converter so I could power the Raspberry Pi with it (since the voltage coming out of the battery is 3.7v).
Finally, I added a 1602 LCD display to show the state of the device, either the currently selected voice or the processing state. And to be able to control the power supply to the toy, I added an on/off switch. I soldered the components together and amazingly they worked the first time!
I couldn’t just let the device sit in the cardboard box that I was using for prototyping, so I also designed and 3D printed a case for it. I started with a blocky box just to test how the components like the buttons could fit in the holes, then I refined the case to have some smooth, rounded edges. I also added a microphone enclosure thing and a handle so I could hold it with one hand, just like the original!
I used TinkerCad for this, which works pretty well once you learn the controls.
Of course, I worked on some software in order to make the toy actually do stuff. I wrote everything in Python since it’s pretty portable and runs well on both my laptop and my Raspberry Pi.
On the Raspberry Pi, I wrote some code to handle button presses for recording, playback and voice selection. I also had to write some code to make HTTP requests between the Raspberry Pi and my laptop (web server).
The basic workflow is: I press and hold the record button, and say whatever it is I want to hear in a different voice. When I release the button, the recording stops and the Talkboy sends the audio file to my web server along with the voice model to use. The LCD switches from the currently selected voice to “Processing….” The web server receives the audio file and runs the inference on it, then sends the response with the changed voice file back to the Talkboy. The LCD switches back to the voice display to show that it’s done processing. Then when I hit the green button again I can hear the changed voice.
I already had some voice models and the script to infer audio from one voice to another set up since I used it in my previous project on AI voice cloning. I just had to write a simple Flask web server to respond to the HTTP requests, run the inference script, and then send the file back to the Talkboy (so simple).
The delay between finishing the recording and being ready to play back the voice depends on how long the recording is. I would say that the total latency is probably around 5-15 seconds for a typical short recording, which isn’t too bad. I could probably optimize the speed by streaming audio to and from the web server but that’s more complicated and I’m not really willing to do that just for a stupid gag project like this.
So I guess I should include some examples of the voice changer here. My previous blog post had some audio examples of me singing some songs, but here’s some before and after clips of the audio I used in the Youtube video.
Arnold’s quote from Kindergarten Cop:
Picard ordering Taco Bell:
Picard ordering Taco Bell:
Homer Simpson having a midnight snack:
Just as an aside, from the above models, I trained the Arnold Schwartzenegger and Patrick Stewart voices (both from audio that I found from audio book clips). The Simpsons ones that I used in the video were downloaded. I didn’t train for a super large number of steps so that might explain why the Simpsons ones sound very close to the real deal, whereas the other ones are just okay.
I had a lot of fun with this project. It really ended up combining a bunch of my interests, including hardware hacking, software (both on an embedded device and server), and utilizing some state of the art AI models too! I also got to 3D print a case that’s much more complicated than anything else I’ve made myself.
It was fun coming up with this challenge and then solving all of the problems that came with it, including fitting everything in the case, and lining up all of the components like the screen and buttons.
While I think it would be really interesting to see this device hit the mainstream market, I don’t think it will happen any time soon. For one thing, the licensing of real voices would probably be an issue. Plus the hardware to run the inference on a device doesn’t exist as far as I know. I think you might be able to do it on an iPhone if you shrank the model, but that wouldn’t be nearly as fun as using a dedicated device.
I’m hoping that I’m wrong though, and that a 30th anniversary Talkboy makes its way to the toy stores, complete with the latest in AI tech!