Azure Cognitive Services and Unity

I have been working a lot with Azure Cognitive Services lately.  Everything seems to need them in one form or another.  Whether people are building bots with Bot Framework, connecting them to their data sets for data analysis, image analysis, feature detection in videos, natural language, or speech recognition… AI is rapidly being embedded everywhere.

Then over a holiday break I wanted to play around with some of the services from Unity.  I wasn’t sure as to why, but my thought process was that if I did build it, I’d probably see what was possible at that point.  And I wanted to make it Unity specific, not just calling REST API’s with strings of text.

Vision API

I started with the Vision API.  The main reason is I wanted the Vision API to tell me what it saw on screen.  This required a little bit of encoding, but it was a lot easier than expected.

The hardest part is getting the PNG data in the right format, but even that is pretty straight forward.

vision_api_diagram.png

It’s quite interesting to see what the Vision API sees.  One thing it does is a form of reality check.  What we believe is a realistic rendered scene in a game, the API comes back with some interesting insights, and error data due to things including bloom, filtering, texture glitches, tearing etc.  Over production makes us believe the world is shinier than it is, and the Vision API is not falling for it.

Very quickly I had some Vision API code working.  It was a lot easier than I expected, so I decided to next dive into the Bing Speech API’s.

Bing Speech API

The Bing Speech API is a bit more complex than the Vision API in regards to getting everything in shape.  The main issue is the audio recording.

This version only does audio samples, and does not do streaming audio analysis.  I’ll save save that battle for later.

To record audio clip in Unity, you start recording into a buffer for X seconds, and can stop it at any time, or when the buffer is full.  This gives you raw audio in the format preferred by Unity, but not in the format required for the Speech API.

speech_api_diagram.png

To transcode the audio format into that required by the BingSpeechAPI, I stumbled on the awesome SavWav samples on github.  It has all the code for converting Unity AudioClip data into PCM wave files for saving to disk.  I converted SavWav to the WavData class, which allows us to get access to the PCM format byte data.

Once transcoded, we send the data, and again, voila, it works.  SavWav had a few things that needed changing, primarily there were some precision errors with floats, that when submitting files to the BingSpeechAPI caused it to time out expecting additional content in the stream.  This was pretty hard to figure out since the audio files would play if saved to disk.

Ensure your audio data is perfect.  The API is trusting that you know what you are doing.  Bad data will give undesired results.

The audio API’s were incredibly accurate.  I’d even try to trick it, and it was only when I threw rare names or mumbled did it fail.  It was also set to en-US I’m speaking en-AU, which usually fails more often than what I experienced when the locale is set to en-US.

Troubleshooting

At the moment there are a few ways to check each step for audio recording-encoding-transcoding-transmitting.

The API

Load a pre-recorded audio clip from a WAV file.  Best files to use are the ones that come with the BingSpeechAPI SDK as you know they should work.  You can also test them in the sample apps that come with the SDK to triple check the APIs and keys are working.

Recording

API looks to be working, but when you use recorded audio it isn’t?  One trick here is to save the WAV file to disk in addition to sending it to the BingSpeechAPI, so you know what the SpeechAPI heard.  You can open the saved file in an audio application and verify the quality.  At one point my mic wasn’t working in Unity and I had no way of knowing otherwise.

Accuracy

Make sure you set the locale, it does make a big difference.  Even if you’re getting 90% accuracy on the wrong locale, why not try for 95%?

Other API’s

There are many other Azure Cognitive Services API’s, and the support so far is still minimal for Vision and Speech, and more of a proof of concept.  So if you are keen, please feel free to contribute.

For now, head on over to the AzureCognitiveServicesForUnity repository on github.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s