I’m very proud to be married to the amazing modern dance choreographer and educator Renay Aumiller. From time to time, Renay and I collaborate to produce works of modern dance that are accompanied by original music. Our most recent piece, By Chance, involves feeding suggestions from the audience into a web application that I wrote, which randomly generates a series of short dance vignettes.
This piece was originally called Out of the Blue because we first performed it in a room in a warehouse space with dark blue walls. In a previous blog post, I talked about my process as I was creating the music, and I shared some excerpts.
An interesting aspect of the piece that I haven’t touched on yet is that it features a text-to-speech narrator. Renay’s idea was that she would converse with a robotic-sounding narrator while she performed, which would both highlight the relationship between humans and computers and add a little bit of humor to the performance. To make the pace of the conversation sound more natural, I would cue each portion of dialogue from my laptop.
I had used the
say command in the past to generate text-to-speech audio. The
say that comes with macOS is great. But I’ve been developing in an Ubuntu
environment for several years now, and I was curious to explore the
text-to-speech options available for Linux.
A while back, I installed some audio package (I forget which one) via
unbeknownst to me, one of the package’s dependencies was GNUstep, a software
bundle that provides a
say command, among other things.
I gave GNUstep’s
say a try, and the results were underwhelming compared to the
It just didn’t sound natural enough. I think it’s wonderful that the Free
Software Foundation implemented an OSS replacement for the macOS
but in an era where children grow up talking to natural-sounding TTS
personalities like Siri and Alexa, GNUstep’s
say can’t help but sound robotic
and corny by comparison.
As I was looking around for other options, I saw that Google Cloud has a Text-to-Speech product. The speech quality is excellent, but it is a paid product that requires a Google Cloud account if you want to use the Text-to-Speech API directly.
It turns out, however, that Google Translate uses the same Text-to-Speech engine (albeit with less options), and you can use it out of the box with a lot less ceremony (and for free, to boot).
There are multiple command-line tools available that implement this method of using Google Text-to-Speech via Google Translate. I ended up using this one. Here’s what it sounds like:
I collected the chunks of narrated text in a bunch of simple, plain text files in a directory:
Each file contains anywhere from 1 to 119 lines of narration, depending on how much the narrator has to say in that section. Each line of the file is either empty or it contains an isolated sentence or phrase. I noticed that by inserting empty lines, I could introduce pauses. This was useful because, as Renay and I rehearsed her dialogue with the TTS narrator, we often found that we needed to adjust the amount of pause time between the narrator’s utterances, and doing so was as easy as adding or removing empty lines.
I put together a quick Bash script to narrate every line of every text file, in
order, by piping each line through
This worked great, but there was another requirement: we needed to be able to pause for a variable amount of time in between narration files, and continue only when I send a cue.
That’s where the FIFO comes in.
There is a neat feature of Unix called named pipes, which are also
commonly known as FIFOs because of their first in, first out behavior (not to
mention the fact that you can create them by using a command called
A named pipe is a special type of file that can be used like a pipe. You can put things into it and take things out of it in a first in, first out fashion.
The best way to understand how named pipes work is to try them yourself. You’ll
need two terminals for this. In the first terminal, create a FIFO and use
to print the first data that comes out of the pipe:
You’ll notice that the terminal “hangs” after the
cat command. That’s because
there is nothing available to print just yet.
Now, in another terminal, put some data into the pipe:
hello, fifo printed in the first terminal. Nice job, you just moved
data from one terminal to another!
When you’re done, you can remove the FIFO with
rm, just like any other file.
In addition to shoveling data from one process to another, you can also use a FIFO to send signals. One process receives continuously from the FIFO, and it does work of some kind whenever it receives a signal.
This gif illustrates the concept:
“Bang” is a term that I took from the multimedia signal processing language Pure Data. A “bang” is a signal with no content that is used to control timing.
The FIFO narrator
It occurred to me that I could use this type of signaling to cue the TTS
narration blurbs in By Chance. I wrote a little
fifo-narrator script that
receives “bangs” from a FIFO in a loop, and each time it receives a “bang,” it
narrates the next file in the
To send the signals, I set up an i3 keybinding that puts the word
into the FIFO whenever I press the
mod key and
; at the same time:
And with that, I was able to control the length of the pauses, allowing the
narration to proceed by pressing
mod + ;.
Our first performance of By Chance (at the time called Out of the Blue) in January 2019 was a smash success! We performed in a warehouse space where we set up the lighting and sound system ourselves. I ran the sound from my laptop out of the headphone jack and into a small bass amplifier, which worked flawlessly. We were the only performers in that particular part of the warehouse, which meant that we were able to set everything up for the tech rehearsals, leave it in place for the three performances that weekend, and double-check right before each performance that everything was still working. All three performances went off without a hitch! Renay danced wonderfully, the crowds were engaged with the idea and they really seemed to enjoy it.
We performed the piece again in November, but unfortunately, it was marred by
technical difficulties. Partway through the introductory text-to-speech
narration, the narrator stopped unexpectedly. In a panic, I flipped over to the
terminal window where my
fifo-narrator script was running, and there I was
presented with a long Python stacktrace. Someone in the crowd went “hey, that’s
Python!” Being already in the middle of a performance, Renay had to improvise
and stumble onward without having a narrator to converse with. Afterward, she
was furious with me.
I hadn’t thought about the fact that you need to have access to the Internet in order to use Google Translate’s text-to-speech API.
Recording the output
Incidentally, it turns out that the
google_speech CLI tool works offline by
caching the audio synthesized from text input that you’ve handed to it
before. Performance #1 had (luckily) gone off without a hitch despite my laptop
not being connected to the internet at the time. The text-to-speech audio was
being played back from the cache from when we had rehearsed at home.
During performance #2, I think the cache failed, somehow. (Maybe it was a TTL cache and it just happened to expire at the worst possible time? Who knows?)
It became clear that our performance should not depend on the reliability of a network connection or the availability of the Google Text-to-Speech service. A better approach would be to record the text-to-speech narration ahead of time and simply play it back during the performance. I could use the same FIFO setup to cue moving from each audio file to the next.
To start, I did a rough calculation of the average length of the pauses between
the lines of text being narrated. Then I wrote a script that generated silent
wav files with a duration of
average pause length X number of consecutive empty
lines, in addition to using
google_speech to output mp3s of each non-empty
line being narrated, and then I stitched everything together into a single mp3
per narration text file.
I was able to do all of this audio file Frankensteining with the help of a very handy command-line tool called SoX. SoX describes itself as “the Swiss Army knife of sound processing programs,” which is an appropriate description. SoX can read and write audio files, convert to and from various audio formats, and even apply some sound effects.
After spending some time with the man page and a little bit of trial and error, I was able to put together the commands to do what I needed to do:
Sox also provides
rec commands, which allow you to play or record
an audio file from the command line. Playing an mp3 file is as simple as running
I ended up with two scripts:
soxto generate a bunch of mp3 files, each containing a portion of the Text-to-Speech narration for the performance.
fifo_narrationwaits for “bang” signals to arrive on the FIFO and each time it receives one, it runs
play "$filename", where
$filenameis the next file in the narration output directory.
In January 2020, we performed By Chance for the third time, and with the above setup in place, the performance went much more smoothly! This time, we were more confident, knowing that our Text-to-Speech narrator wouldn’t let us down in the heat of the moment.
There are a variety of good command-line tools available that can generate text-to-speech audio. The ones that leverage the Google Translate text-to-speech API are particularly nice.
When you’re doing something important like a live performance, you should never rely on the availability of an Internet connection, if you can help it!
Named pipes (a.k.a. FIFOs) can be a wonderful tool when you’re writing Bash scripts and you need a simple queuing mechanism.
SoX (the Swiss Army knife of audio manipulation) is handy when you want to stitch together audio files at the command line.
Writing reliable software isn’t just a good practice; in rare situations, it might just score you some bonus points with your spouse!
Reply to this tweet with any comments, questions, etc.!