About a month ago, I started looking for a lip-syncing solution that I wouldn't have to break the bank over. I had cooked up a script for Unity that would read a text document with lip-sync timings, and then automatically animate a character's mouth based on pre-defined functions. But the process of actually figuring out the lip-sync timings was taking too bloody long. So I searched once again for an affordable solution.
At the end of the day, I couldn't really find an affordable solution. One of the cheaper alternatives was the lip-sync tool in the Source engine. But that program is just designed to work with Source. Annosoft has a decent tool, and I was able to test the demo for it. But that software costs $500 for one license, and I really didn't feel like dropping that much money. I actually tried using the Windows SAPI libraries to cook up my own solution, but the basic version doesn't give you access to individual phonemes, and isn't terribly accurate besides.
Eventually, I settled on a bit of a compromise. I found that the
Annosoft SAPI Command-Line tool worked as I needed it to, but the usability was awful. (it is a command-line program, after all) So I decided to cook up a GUI front-end for this command-line program. After a week or two, I have a beta ready.
LipSync GUI program
I would really appreciate if anyone would give it a quick look-see and confirm that it works.
Replies
I'll post an update when I'm closer to the end, and I'll probably have an example of the results being used practically. Hopefully someone else will find this tool useful.
But this might be VERY useful to me.
I do a lot of lip sync, about 30-60min per game and we do 2 games a year. The sync doesn't take that much effort (we do it by hand) but any time we can shave off of animating the face we can spend on animating the body.
We used Dio-Matic's Voice O-Matic for a while after trying out various tools, Anosoft's tool was the best 3rd party software we found. I liked it, but it was a separate app and the pipeline was too convoluted to be useful. I liked that VOM was integrated right into max and had similar functionality where it processed raw sound files (which is hit and miss, more miss than hit actually...).
You could input text but there was no control for timing and spacing so it botches it pretty good when you use text... We scraped VOM and do it by hand because cleaning it up took just as long as doing it right by hand.
I wish VOM was able to read the timing that you are spitting out, or I had a way to map it directly to the morph target tracks in max... humm... I might have to carve out some time to see if I can get that hooked up...
What are the numbers it is spitting out next to the phonemes, frames, ticks, milliseconds? It would be helpful if there was a option to set the FPS or ticks frames, ect...
I'm glad to hear that it is working for you. The results you get from this program when using just the audio are a bit hit-and-miss. But I've found that the results you get from using text transcription are almost spot-on. (which is why I wanted to use it)
The numbers it is spitting out are the beginning time for when the specified phoneme shape ought to be displayed. The command-line program provides beginning and end times, the label for the phoneme, and a blending value that I believe is sometimes used for shape keys. The beginning and end times are fed out in milliseconds. I converted them to float values that are represented as seconds with three decimal places. (for the milliseconds)
The GUI program I wrote is essentially calling the command-line program, feeding the variables for the Wave file and text transcription into it, and then receiving and parsing the results. As such, I can parse the results in any way I want to. If you had a specific format you wanted, I could add that as an export option. I could add frame-based exporting as well, it's not that hard.
Maybe check boxes as to what you want exported and in what format? Start? End? Milliseconds or 15fps 24fps, 30fps?
What would be kind of cool, is if it read the wav and did some speech to text (think microsoft has some free stuff available?) and populated the text input field with what it thought the wav said, then you could clean it up if needed.
I work off of a script so I can copy/paste it but even that gets to be a chore and one of the reasons we went with Voice O Matic (convince).
And I'm just throwing this stuff out because I think it might be useful, I'm not expecting you to bust your balls and get any of it added. If you do that's cool, if not I'm fine with that too
Thanks for the suggestions Mark.
Just as a heads-up, I put in some more work on the program and it is up to beta version 0.7.
LipSync beta version 0.7
Unfortunately, I'm going to hold off on the speech-to-text suggestion you made, Mark. It is actually possible, I experimented with that approach initially before settling on the Annosoft command-line program. I even got some examples working. The biggest problem is that the recognition isn't very accurate initially. You have to practice using it over and over again until it starts to "learn" your voice. I was planning on using this program for animated web-series that will probably feature multiple voices. I can't be taking the time to train a computer for each voice.
I am thinking of adding the different export options to the program. I think it would fit well under the settings menu, as a pop-up window with checkboxes for each of the data options you want to export. I could make these settings persistent as well, so that you don't have to go in and customize them each time you run the program.
One of the big additions to the new beta is the batch processing button. I got folder selection working properly, and added batch processing support for processing the contents of a directory instead of just one file at a time.
Remembered settings would be nice.
That works great! but I do have some suggestions for the UI.
1) Add a 3rd field for the export path? Maybe just above the processing buttons? So instead of being prompted to specify it each time if the path looks right I can click process and walk away. That would only be useful if it remembers the path...
2) Maybe have a single processing button? To do this add another field underneath "Choose WAV" for "Choose Folder". Then process whatever is in both fields.
Or maybe leave it as one field and allow them to pick a folder or file (if that is even possible from one select pop-up menu)? Which might not be obvious at first so maybe under the button put a little label "or specify a folder"?
I actually already added a feature that might help you out with the first issue you described. Check under "Settings" in the top-bar menu. There you will find default directory options for both regular and batch processing. Using these settings, you can specify what directory you want the various fields to point to by default. All of these settings are persistent and will be saved for later sessions.
I actually had a 3rd field for the export path originally, but it occurred to me that it was somewhat redundant, especially if you could specify a default path as I have already.
The default directory settings options are especially useful for batch processing. If you have already set the default batch processing paths to where you want them to go, you can essentially just click "OK" when the batch processing button prompts you to choose your directories.
Best Regards
Thanks guys for the cool application and the useful information. Which tools/applications can import this output text file and render the effect of the phonemes in the 3D characters?
Thanks.