Hacking Google’s Text To Speech “API”

When I was at my previous job, one task I had was localizing a large set of phrases to multiple languages, both in text and audio files. I did this by using the awesome Google Translate API.

The Google Translate website has features for translating text and playing audio of it in the translated language. There’s no official API for getting audio, though. Luckily, I’ve never let a lack of an official API stop me before.

I had read a few old blog posts about how Google’s undocumented TTS API could be used, albeit with a 100 character limit. Going over 100 characters would result in a truncated audio file. Some of the text I needed to output to audio was longer than that. It turns out that with a little bit of Chrome web inspector, I could replicate the functionality of the Google Translate site.

The first thing to check out is the url scheme of the audio files, which looks like this:

Breaking down the parameters, “ie” is the text’s encoding, “q” is the text to convert to audio, “tl” is the text language, “total” is the total number of chunks (more on that later), “idx” is which chunk we’re on, “textlen” is the length of the text in that chunk and “prev” is not really important.

The Google Translate site itself gets around its own character limit by breaking big blocks of text into “chunks”. It seems to try and break along punctuation, but for super long sentences it will also break in the middle of a sentence, which ends up sounding pretty weird. Using the Gettysburg Address as an example, Google makes a request for the chunk “civil war”.

Gettysburg Address

In order to download audio files for longer chunks of text, I wrote up a python script that broke the text down and made separate requests to Google. The script would write all of the files to one file, and somehow, it worked! Just to be safe, I also set my script up to use Google’s Flash player as the referer (sic) and set the user agent to a version of Firefox.

At the time, I didn’t want to release the code as it was being used for some uber top secret stuff. But since I’m not working on that project anymore, I refactored the original code into a command line Python script. Along the way I had to learn how to use Python’s argparse, which is a pretty neat way of parsing command line arguments.

The project is available on Github right now, so go grab it and try it out. If you’re curious what the output sounds like, here’s a recording of female Abraham Lincoln reciting the Gettysburg Address (yes, she mispronounces some words). One fun thing to try out is outputting clashing input and output languages. Here’s Female Japanese Abraham Lincoln reciting the same speech (she just seems to be spelling words, slacker).

If you enjoyed this hack, let me know and I could post some other ones I’ve been working on. And if you find a way to improve the code (probably not difficult at all) go ahead and submit a pull request on Github. And if you’re from Google, please don’t shut down my Gmail and Adsense accounts.

16 thoughts on “Hacking Google’s Text To Speech “API”

  1. This will be enormously helpful for people who want to talk instead of type, for example children and senior citizens who find pecking keys frustrating. I have several ideas to help non-writers write, which could be the next big thing in publishing: people telling/writing stories instead of sending text messages.

  2. I’m getting the following error in python 3.3 on windows

    “Traceback (most recent call last):
    File “GoogleTTSv2.py”, line 6, in
    import urllib, urllib2
    ImportError: No module named ‘urllib2′”

    I cannot find urlib2 for python 3.3 anywhere.

    1. Oops. I think I replied via email but I’ll also reply here:

      I wrote this for Python 2.7, so you’ll have to change things around for 3.3. I think you’ll need to use urllib in 3.3 instead of urllib2. The methods may also have changed :)

  3. This is really helpful! I’m also interested in choosing gender too, though. You had a female voice reading the Gettysburg Address, but the script produces male output only. Can we change this?

  4. Werd, you slapped something together, glad to see some other python hacker wanted to get something like this going! I was messing around on something similar that uses the Natural Language Toolkit ( http://nltk.org ) to do fully proper multilingual sentence tokenization, but the project stalled cause I got more important work to do. Will let you know if I ever get back to it.

  5. Thanks to you, Python, and Open CV, I am able to help a blind guy see his Magic Cards without revealing them to his opponent (for the first time in years) I also am setting my project up in my store so people can identify cards for sale. I’ve spent a lot of time on the CV part and thanks to you I have avoided a lengthy delve into the very API you used for voice!! I have a working project now, soon to be up on github once I get the knack of that. (I’m not a professional software guy, but getting up to speed in a few areas) I also considered ‘speeks’ for locally doing the voice-synth, but this web-based approach actually ends up with less overhead, and it sounds better too. I didn’t expect latency to be so low. Open source FTW!

      1. I still don’t have the github repo up, but I did manage to get a rigged up installer made along with some other improvements. I am afraid it will seem amateurish to any pros, but it works pretty well once you get it installed. Check README for directions. Since your text-to-speech thing is such a cool addition to my project I’ll put the dropbox folder link here for you or whoever delves this deeply into the comments of your blog.

        https://www.dropbox.com/sh/lxl3a3rlaaxl5mo/Xn1-jDYJUa

        You don’t have to download all 3.5 gigs of card images unless you want the complete Magic: The Gathering catalog. Just copy the whole HydraMatic directory (minus card sets you don’t want in ‘pics’ folder) and from there run ‘python crabby.py’

  6. I discovered this myself a while ago and ever since then I had it as a search engine entry in Google Chrome.

    Right click on address bar (aka omnibox) > Edit search engines…

    Name:Voice
    Keyword: voice
    URL: http://translate.google.com/translate_tts?&tl=en-US&ie=UTF-8&q=%s

    Useful for whenever you wanna blast some message across the house through your loudspeakers – plus it’s in a pleasantly robotic voice instead.

    Don’t miss trying to emulate another language by typing weird stuff in English to make it phonetically similar!

Leave a Reply