Automated Lip Syncing

Richard Kain · Sep 2012

About a month ago, I started looking for a lip-syncing solution that I wouldn't have to break the bank over. I had cooked up a script for Unity that would read a text document with lip-sync timings, and then automatically animate a character's mouth based on pre-defined functions. But the process of actually figuring out the lip-sync timings was taking too bloody long. So I searched once again for an affordable solution.

At the end of the day, I couldn't really find an affordable solution. One of the cheaper alternatives was the lip-sync tool in the Source engine. But that program is just designed to work with Source. Annosoft has a decent tool, and I was able to test the demo for it. But that software costs $500 for one license, and I really didn't feel like dropping that much money. I actually tried using the Windows SAPI libraries to cook up my own solution, but the basic version doesn't give you access to individual phonemes, and isn't terribly accurate besides.

Eventually, I settled on a bit of a compromise. I found that the Annosoft SAPI Command-Line tool worked as I needed it to, but the usability was awful. (it is a command-line program, after all) So I decided to cook up a GUI front-end for this command-line program. After a week or two, I have a beta ready.

LipSync GUI program

I would really appreciate if anyone would give it a quick look-see and confirm that it works.

monster · Sep 2012

That's really cool man. It works great for me. Makes me want to lip sync.

Richard Kain · Sep 2012

Thanks monster, I appreciate the assistance. Feel free to use it however you want. I'm going to polish up the software a bit before I consider it done. I want to add a batch processing feature that will let the user select a folder and automatically process all the wave files in it. Already made some progress on that, actually.

I'll post an update when I'm closer to the end, and I'll probably have an example of the results being used practically. Hopefully someone else will find this tool useful.

Mark Dygert · Sep 2012

It seems to be working fine from what I can tell. I don't have a way to pipe it into morph targets so I can't check its timing or phoneme picking.

But this might be VERY useful to me.

I do a lot of lip sync, about 30-60min per game and we do 2 games a year. The sync doesn't take that much effort (we do it by hand) but any time we can shave off of animating the face we can spend on animating the body.

We used Dio-Matic's Voice O-Matic for a while after trying out various tools, Anosoft's tool was the best 3rd party software we found. I liked it, but it was a separate app and the pipeline was too convoluted to be useful. I liked that VOM was integrated right into max and had similar functionality where it processed raw sound files (which is hit and miss, more miss than hit actually...).

You could input text but there was no control for timing and spacing so it botches it pretty good when you use text... We scraped VOM and do it by hand because cleaning it up took just as long as doing it right by hand.

I wish VOM was able to read the timing that you are spitting out, or I had a way to map it directly to the morph target tracks in max... humm... I might have to carve out some time to see if I can get that hooked up...

What are the numbers it is spitting out next to the phonemes, frames, ticks, milliseconds? It would be helpful if there was a option to set the FPS or ticks frames, ect...

Richard Kain · Sep 2012

Mark Dygert wrote: »

What are the numbers it is spitting out next to the phonemes, frames, ticks, milliseconds? It would be helpful if there was a option to set the FPS or ticks frames, ect...

I'm glad to hear that it is working for you. The results you get from this program when using just the audio are a bit hit-and-miss. But I've found that the results you get from using text transcription are almost spot-on. (which is why I wanted to use it)

The numbers it is spitting out are the beginning time for when the specified phoneme shape ought to be displayed. The command-line program provides beginning and end times, the label for the phoneme, and a blending value that I believe is sometimes used for shape keys. The beginning and end times are fed out in milliseconds. I converted them to float values that are represented as seconds with three decimal places. (for the milliseconds)

The GUI program I wrote is essentially calling the command-line program, feeding the variables for the Wave file and text transcription into it, and then receiving and parsing the results. As such, I can parse the results in any way I want to. If you had a specific format you wanted, I could add that as an export option. I could add frame-based exporting as well, it's not that hard.

Denny · Sep 2012

Great initiative Richard! Annosoft does great lipsyncing software, I can't wait to mess with this.

Mark Dygert · Sep 2012

Richard Kain wrote: »

The numbers it is spitting out are the beginning time for when the specified phoneme shape ought to be displayed. The command-line program provides beginning and end times, the label for the phoneme, and a blending value that I believe is sometimes used for shape keys. The beginning and end times are fed out in milliseconds. I converted them to float values that are represented as seconds with three decimal places. (for the milliseconds)

Ahh the start time in seconds/milliseconds, that makes sense. It would be helpful to have the end time and the blending value, if you could toss those in? If not I can probably make due with just the start times. And yea the blending values would help me calculate the strength of the shapes.

Richard Kain wrote: »

The GUI program I wrote is essentially calling the command-line program, feeding the variables for the Wave file and text transcription into it, and then receiving and parsing the results. As such, I can parse the results in any way I want to. If you had a specific format you wanted, I could add that as an export option. I could add frame-based exporting as well, it's not that hard.

Maybe check boxes as to what you want exported and in what format? Start? End? Milliseconds or 15fps 24fps, 30fps?

What would be kind of cool, is if it read the wav and did some speech to text (think microsoft has some free stuff available?) and populated the text input field with what it thought the wav said, then you could clean it up if needed.

I work off of a script so I can copy/paste it but even that gets to be a chore and one of the reasons we went with Voice O Matic (convince).

And I'm just throwing this stuff out because I think it might be useful, I'm not expecting you to bust your balls and get any of it added. If you do that's cool, if not I'm fine with that too

Richard Kain · Sep 2012

Mark Dygert wrote: »

What would be kind of cool, is if it read the wav and did some speech to text (think microsoft has some free stuff available?) and populated the text input field with what it thought the wav said, then you could clean it up if needed.

Thanks for the suggestions Mark.

Just as a heads-up, I put in some more work on the program and it is up to beta version 0.7.

LipSync beta version 0.7

Unfortunately, I'm going to hold off on the speech-to-text suggestion you made, Mark. It is actually possible, I experimented with that approach initially before settling on the Annosoft command-line program. I even got some examples working. The biggest problem is that the recognition isn't very accurate initially. You have to practice using it over and over again until it starts to "learn" your voice. I was planning on using this program for animated web-series that will probably feature multiple voices. I can't be taking the time to train a computer for each voice.

I am thinking of adding the different export options to the program. I think it would fit well under the settings menu, as a pop-up window with checkboxes for each of the data options you want to export. I could make these settings persistent as well, so that you don't have to go in and customize them each time you run the program.

One of the big additions to the new beta is the batch processing button. I got folder selection working properly, and added batch processing support for processing the contents of a directory instead of just one file at a time.

Mark Dygert · Sep 2012

Richard Kain wrote: »

Thanks for the suggestions Mark.

Just as a heads-up, I put in some more work on the program and it is up to beta version 0.7.

LipSync beta version 0.7

Unfortunately, I'm going to hold off on the speech-to-text suggestion you made, Mark. It is actually possible, I experimented with that approach initially before settling on the Annosoft command-line program. I even got some examples working. The biggest problem is that the recognition isn't very accurate initially. You have to practice using it over and over again until it starts to "learn" your voice. I was planning on using this program for animated web-series that will probably feature multiple voices. I can't be taking the time to train a computer for each voice.

That request was a bit pie in the sky, I forgot about the training, that would be nearly impossible for a lot of clips where you don't have access to the voice actor. We do but it would eat up more studio time than we would want to pay for... so yea good call

Richard Kain wrote: »

I am thinking of adding the different export options to the program. I think it would fit well under the settings menu, as a pop-up window with checkboxes for each of the data options you want to export. I could make these settings persistent as well, so that you don't have to go in and customize them each time you run the program.

Remembered settings would be nice.

Richard Kain wrote: »

One of the big additions to the new beta is the batch processing button. I got folder selection working properly, and added batch processing support for processing the contents of a directory instead of just one file at a time.

That works great! but I do have some suggestions for the UI.

1) Add a 3rd field for the export path? Maybe just above the processing buttons? So instead of being prompted to specify it each time if the path looks right I can click process and walk away. That would only be useful if it remembers the path...

2) Maybe have a single processing button? To do this add another field underneath "Choose WAV" for "Choose Folder". Then process whatever is in both fields.
Or maybe leave it as one field and allow them to pick a folder or file (if that is even possible from one select pop-up menu)? Which might not be obvious at first so maybe under the button put a little label "or specify a folder"?

Richard Kain · Sep 2012

Mark Dygert wrote: »

1) Add a 3rd field for the export path? Maybe just above the processing buttons? So instead of being prompted to specify it each time if the path looks right I can click process and walk away. That would only be useful if it remembers the path...

2) Maybe have a single processing button? To do this add another field underneath "Choose WAV" for "Choose Folder". Then process whatever is in both fields.
Or maybe leave it as one field and allow them to pick a folder or file (if that is even possible from one select pop-up menu)? Which might not be obvious at first so maybe under the button put a little label "or specify a folder"?

I actually already added a feature that might help you out with the first issue you described. Check under "Settings" in the top-bar menu. There you will find default directory options for both regular and batch processing. Using these settings, you can specify what directory you want the various fields to point to by default. All of these settings are persistent and will be saved for later sessions.

I actually had a 3rd field for the export path originally, but it occurred to me that it was somewhat redundant, especially if you could specify a default path as I have already.

The default directory settings options are especially useful for batch processing. If you have already set the default batch processing paths to where you want them to go, you can essentially just click "OK" when the batch processing button prompts you to choose your directories.

Mark Dygert · Sep 2012

Oh yea, that's prefect, no need to break default settings out and put them on display

KlausHerbert · Oct 2012

Hi, I created an account here because I found this thread and I thought someone here could help me. I was looking for a tool to get Phonemes and their timings from an audio file. After a long time searching I found the annosoft console tool and it worked really fine for me on win XP (since its based on SAPI 5.1). So I wrote a GUI (similar to yours) to batch a bunch of files automatically. My Problem now is, that I'm trying to get this to work with Windows 7 x64. There is still an output and no errors are displayed, but the phonemes and timings are completely wrong. I dont really get the source code (the annosoft tool is open source). So it would be really really great and i would be endlessly thankful if someone could help me to port this tool from sapi 5.1 to 5.4

Best Regards

rbaral · Feb 2014

Hi,

Thanks guys for the cool application and the useful information. Which tools/applications can import this output text file and render the effect of the phonemes in the 3D characters?

Thanks.

akshea · Mar 2014

In the first post, you have developed the lipsync tool with Annosoft SAPI command line tool. And you have mention in the beginning of the post that u have cooked up script with unity. I need a clarification regarding it, Can i make use of your lipsync tool with unity? How to use this tool with unity for making my own character with lip sync. Please help me to achieve this. Waitng for the response.

monster · Mar 2014

He never shared his Unity script so you're out of luck. You'd need to write a unity script that could import the timings of the text output this tool creates.

Automated Lip Syncing

Replies