(This is sort of dated, not sure why I never published it. Hope it helps someone?)
code: SpeechRecognitionController.rar
bin x86: SpeechRecognitionController.x86.rar
Voice recognition
So I took a break from my normal nefarious deeds today to play with voice recognition in Window 7.
A nice feature of this software is that you can activate or deactivate it using a voice command on the fly. This is useful because it prevents you from accidentally formatting your hard drive, or more realistically, closing out some windows when you want to have a chat with a human sitting next to you.
Great interface, but a cheesy problem
Something that was bugging me about the interface is the fact that the only way to get it to start listening to you again is to say “start listening.” There seems to be limitless customization that you can do once you’ve got the shared interface turned on, but this “start listening” thing is annoying. I want to say “hello computer”. Or maybe, “sup bitch.”
I searched far and wide and was disappointed to find that there’s no way to do this out of the box. Nor could I find anyone who had found a simple, clean solution. There is this paid app that might solve the problem, but fuck that. So I set out on a path to figure out how it could be done. Sure enough, I found one.
Why Interop, why now!?
.NET/C# is pretty much the only way I fly. I haven’t delved into interop very much before, but while googling the vast labrynths of the interwebs I stumbled across a few clues that brought me into this rabbit hole. Firstly, this guy, who was using the API to figure out the current state of the shared adapter. Looking at the code on his site was like greek to me. I have no idea what he’s doing with SAPI, what TSF is, or why he is using this interface, but my mind started to bubble up some ancient memories of using interop to communicate with unmanaged Windows stuff before.
He was nice enough to include a link to the MSDN on SAPI on his page, and after trawling through it for a while and doing lots more googling it seemed the general consensus was that I would need to interface with the ISpRecognizer3 if I wanted to make my computer respond to “bitch”. I was lucky enough to stumble across a great CodeProject demo that succeeded in confusing the hell out of me until I found a writeup that gave me the proper nudge.
TlbImp.exe (Type Library Importer)
This nifty tool, included with the everyday installation of Visual Studio (.NET Framework Tools), will auto-magically build us an interface to communicate with the COM objects, all we have to do is point and shoot. Point and shoot at what though?
The sapi binary (sapi.dll), because that’s the guy who controls all the magic-sauce behind this sweet Voice Recognition software. I had a piss of a time finding it though, until I finally broke down and used Process Explorer to sniff out the location.
By the way, if you don’t have process explorer go and download it right now.
Ahh, the old System32 path. Who’d have thunk it? So apparently by running one command with TlbImp I should be able to generate a dll I can use to communicate with SAPI from my .NET app:
TlbImp "C:\Windows\System32\Speech\Common\sapi.dll" /out:Interop.SpeechLib.dll
This throws a ton of warnings for some reason but doesn’t seem cause an issue, and it gives me my Interop binary so I can go play with Voice Recog.
.NET’s SpeechRecognitionEngine
I’m not going to go too far into this, let me just say that Microsoft has made it super-simple to throw together an app that handles speech recognition. Check out this short and sweet article from Prodigy Productions that taught me just about everything I needed to know about the SpeechRecognitionEngine. It goes something like this:
SpeechRecognitionEngine engine = new SpeechRecognitionEngine(new CultureInfo("en-US")); engine.SetInputToDefaultAudioDevice(); engine.SpeechRecognized += this.Engine_SpeechRecognized; Choices choices = new Choices(); choices.Add(this.txTest.Text); GrammarBuilder grammarBuilder = new GrammarBuilder(choices); Grammar grammar = new Grammar(grammarBuilder); engine.LoadGrammar(grammar); this.engine.RecognizeAsync(RecognizeMode.Multiple);
This is literally all of the code it takes to spin up the engine…
Okay I’m lying you have to define your SpeechRecognized function too:
private void Engine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) { foreach (RecognizedWordUnit word in e.Result.Words) { this.txOutput.Text += word.Text; this.txOutput.Text += " "; } }
Now, assuming you have a windows form with a txOutput textbox on it to show your output, and a txTest textbox to define what you want to say, when you run the app you’ll be able to speak to it!
Awesome!
Now, to get our cheesy app to override Windows’ native “start listening” trigger, we have one more puzzle piece- the interop.
The Magic Sauce
Just drop the nifty magic Interop binary we previously generated into our references and execute the below function to take control of the voice recog:
private void LaunchTheThing() { ISpRecognizer3 pRecog = new SpSharedRecognizerClass(); pRecog.SetActiveCategory(null); }
Voila, now we can wake him up to anything we want!
These are the basics, of course. There’s a totally dead state the process goes into sometimes on its own “off”, which requires prodding the ISpRecognizer instead of the ISpRecognizer3. I stuck these aspects into an app, and even a separate binary for the INTEROP calls if you’d like to use them in your own app. If there’s a lot of curiousity around it I’ll go into the nuances but for now I’m going to work on anothe writeup involving turning this guy into a tray-icon because although I’ve been developing for years I still don’t know how to do that.
Back soon!