Learning Microsoft Cognitive Services

Getting Started with Microsoft Cognitive Services

You have just started on the road to learning about Microsoft Cognitive Services. This chapter will serve as a gentle introduction to the services. The end goal is to understand a bit more about what these cognitive APIs can do for you. By the end of this chapter, we will have created an easy-to-use project template. You will have learned how to detect faces in images and have the number of faces spoken back to you.

Throughout this chapter, we will cover the following topics:

Learning about some applications already using Microsoft Cognitive Services
Creating a template project
Detecting faces in images using Face API
Discovering what Microsoft Cognitive Services can offer
Doing text-to-speech conversion using Bing Speech API

Setting up boilerplate code

Before we start diving into the action, we will go through some setup. More to the point, we will set up some boilerplate code which we will utilize throughout this book.

To get started, you will need to install a version of Visual Studio, preferably Visual Studio 2015 or higher. The Community Edition will work fine for this purpose. You do not need anything more than what the default installation offers.

You can find Visual Studio 2017 at https://www.microsoft.com/en-us/download/details.aspx?id=48146.

Throughout this book, we will utilize the different APIs to build a smart-house application. The application will be created to see how one can imagine a futuristic house to be. If you have seen the Iron Man movies, you can think of the application as resembling Jarvis, in some ways.

In addition, we will be doing smaller sample applications using the cognitive APIs. Doing so will allow us to cover each API, even those that did not make it to the final application.

What's common with all the applications that we will build is that they will be Windows Presentation Foundation (WPF) applications. This is fairly well known and allows us to build applications using the Model View ViewModel (MVVM) pattern. One of the advantages of taking this road is that we will be able to see the API usage quite clearly. It also separates code so that you can bring the API logic to other applications with ease.

The following steps describe the process of creating a new WPF project:

Open Visual Studio and select File | New | Project.
In the dialog, select the WPF Application option from Templates | Visual C#, as shown in the following screenshot:

Delete the MainWindow.xaml file and create files and folders matching the following image:

We will not go through the MVVM pattern in detail, as this is out of scope of this book. The key takeaway from the image is that we have separated the View from what becomes the logic. We then rely on the ViewModel to connect the pieces.

If you want to learn more about MVVM, I recommend reading an article from CodeProject at http://www.codeproject.com/Articles/100175/Model-View-ViewModel-MVVM-Explained.

To be able to run this, we do, however, need to cover some of the details in the project:

Open the App.xaml file and make sure StartupUri is set to the correct View, as shown in the following code (class name and namespace may vary based on the name of your application):

        <Application x:Class="Chapter1.App"
            xmlns="http://schemas.microsoft.com/
winfx/2006/xaml/presentation" 
            xmlns:x = "http://schemas.microsoft.com/winfx/2006/xaml" 
            xmlns:local="clr-namespace:Chapter1" 
            StartupUri="View/MainView.xaml">

Open the MainViewModel.cs file and make it inherit from the ObservableObject class.

Open the MainView.xaml file and add the MainViewModel file as datacontext to it, as shown in the following code (namespace and class names may vary based on the name of your application):

        <Window x:Class="Chapter1.View.MainView" 
           
            xmlns="http://schemas.microsoft.com/
winfx/2006/xaml/presentation"
            xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" 
            xmlns:d="http://schemas.microsoft.com/
expression/blend/2008" 
            xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
            xmlns:local="clr-namespace:Chapter1.View" 
            xmlns:viewmodel="clr-namespace:Chapter1.ViewModel" mc:Ignorable="d" 
            Title="Chapter 1" Height="300" Width="300"> 
            <Window.DataContext> 
                <viewmodel:MainViewModel /> 
            </Window.DataContext>

Following this, we need to fill in the content of the ObservableObject.cs file. We start off by having it inherit from the INotifyPropertyChanged class as follows:

        public class ObservableObject : INotifyPropertyChanged

This is a rather small class, which should contain the following:

        public event PropertyChangedEventHandlerPropertyChanged; 
        protected void RaisePropertyChangedEvent(string propertyName) 
        { 
            PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName)); 
        }

We declare a property changed event and create a function to raise the event. This will allow the User Interface (UI) to update its values when a given property has changed.

We also need to be able to execute actions when buttons are pressed. This can be achieved when we put some content into the DelegateCommand.cs file. Start by making the class inherit the ICommand class, and declare the following two variables:

        public class DelegateCommand : ICommand 
        {
            private readonly Predicate<object> _canExecute; 
            private readonly Action<object> _execute;

The two variables we have created will be set in the constructor. As you will notice, you are not required to add the _canExecute parameter, and you will see why in a bit:

            public DelegateCommand(Action<object> execute, Predicate<object> canExecute = null) 
            { 
                _execute = execute; 
                _canExecute = canExecute; 
            }

To complete the class, we add two public functions and one public event as follows:

        public bool CanExecute(object parameter) 
        { 
            if (_canExecute == null) return true; 
            return _canExecute(parameter); 
        } 
 
        public void Execute(object parameter) 
        { 
            _execute(parameter); 
        } 
    
        public event EventHandlerCanExecuteChanged 
        { 
            add 
            { 
                CommandManager.RequerySuggested += value; 
            } 
            remove 
            {
                CommandManager.RequerySuggested -= value; 
            } 
        } 
    }

The functions declared will return the corresponding predicate, or action, declared in the constructor. This will be something we declare in our ViewModels, which, in turn, will be something that executes an action or tells the application that it can or cannot execute an action. If a button is in a state where it is disabled (the CanExecute function returns false) and the state of the CanExecute function changes, the event declared will let the button know.

With that in place, you should be able to compile and run the application, so go on and try that. You will notice that the application does not actually do anything or present any data yet, but we have an excellent starting point.

Before we do anything else with the code, we are going to export the project as a template. This is so that we do not have to redo all these steps for each small sample project we create:

Replace namespace names with substitute parameters:

1. In all the .cs files, replace the namespace name with $safeprojectname$ .

2. In all the .xaml files, replace the project name with $safeprojectname$ where applicable (typically class name and namespace declarations).

Navigate to File | Export Template. This will open the Export Template wizard, as shown in the following screenshot:

Click on the Project Template button. Select the project we just created and click on the Next button.
Just leave the icon and preview image empty. Enter a recognizable name and description. Click on the Finish button:

The template is now exported to a zip file and stored in the specified location.

By default, the template will be imported into Visual Studio again. We are going to test that it works immediately by creating a project for this chapter. So go ahead and create a new project, selecting the template we just created. The template should be listed in the Visual C# section of the installed templates list. Call the project Chapter1 or something else if you prefer. Make sure it compiles and you are able to run it before we move to the next step.

Detecting faces with the Face API

With the newly created project, we will now try our first API, the Face API. We will not be doing a whole lot, but we will see how simple it is to detect faces in images.

The steps we need to cover to do this are as follows:

Register for a Face API preview subscription at Microsoft Azure.
Add the necessary NuGet packages to our project.
Add some UI to the application.
Detect faces on command.

Head over to https://portal.azure.com to start the process of registering for a free subscription to the Face API. You will be taken to a login page. Log on with your Microsoft account, or if you do not have one, register for one.

Once logged in, you will need to add a new resource by clicking on + New on the right-hand side menu. Search for Face API and select the first entry:

Enter a name and select the subscription, location, and pricing tier. At the time of writing, there are two pricing options, one free and one paid:

Once created, you can go into the newly created resource. You will need one of the two available API keys. These can be found in the Keys option of the resource menu:

This is where we will be creating all of our API resources throughout this book. You can choose to create everything now or when we come to the respective chapters.

Some of the APIs that we will cover have their own NuGet packages created. Whenever this is the case, we will utilize those packages to do the operations we want to do. Common for all APIs is that they are REST APIs, which means that in practice you can use them with whichever language you want. For those APIs that do not have their own NuGet package, we call the APIs directly through HTTP.

For the Face API we are using now, a NuGet package does exist, so we need to add that to our project. Head over to the NuGet Package Manager option for the project we created earlier. In the Browse tab, search for the Microsoft.ProjectOxford.Face package and install the package from Microsoft:

As you will notice, another package will also be installed. This is the Newtonsoft.Json package, which is required by the Face API.

The next step is to add some UI to our application. We will be adding this in the MainView.xaml file. Open this file where the template code we created earlier should be. This means that we have a datacontext and can make bindings for our elements, which we will define now.

First we add a grid and define some rows for the grid as follows:

    <Grid> 
        <Grid.RowDefinitions> 
            <RowDefinition Height="*" /> 
            <RowDefinition Height="20" /> 
            <RowDefinition Height="30" /> 
        </Grid.RowDefinitions>

Three rows are defined. The first is a row where we will have an image. The second is a line for the status message, and the last is where we will place some buttons.

Next we add our image element as follows:

        <Image x:Name="FaceImage" Stretch="Uniform" Source=
            "{Binding ImageSource}" Grid.Row="0" />

We have given it a unique name. By setting the Stretch parameter to Uniform, we ensure that the image keeps its aspect ratio. Further on, we place this element in the first row. Last, we bind the image source to a BitmapImage in the ViewModel, which we will look at in a bit.

The next row will contain a text block with some status text. The Text property will be bound to a string property in the ViewModel as follows:

        <TextBlockx:Name="StatusTextBlock" Text=
            "{Binding StatusText}" Grid.Row="1" />

The last row will contain one button to browse for an image and one button to be able to detect faces. The command properties of both buttons will be bound to the DelegateCommand properties in the ViewModel as follows:

        <Button x:Name = "BrowseButton" 
                  Content = "Browse" Height="20" Width="140"  
                  HorizontalAlignment = "Left" 
                  Command="{Binding BrowseButtonCommand}" 
                  Margin="5, 0, 0, 5"Grid.Row="2" /> 
 
        <Button x:Name="DetectFaceButton" 
                  Content="Detect face" Height="20" Width="140" 
                  HorizontalAlignment="Right" 
                  Command="{Binding DetectFaceCommand}" 
                  Margin="0, 0, 5, 5"Grid.Row="2"/>

With the View in place, make sure the code compiles and run it. This should present you with the following UI:

The last part is to create the binding properties in our ViewModel and make the buttons execute something. Open the MainViewModel.cs file. The class should already inherit from the ObservableObject class. First we define two variables as follows:

    private string _filePath; 
    private IFaceServiceClient _faceServiceClient;

The string variable will hold the path to our image, while the IFaceServiceClient variable is to interface the Face API. Next, we define two properties as follows:

    private BitmapImage _imageSource; 
    public BitmapImageImageSource 
    { 
        get { return _imageSource; } 
        set 
        { 
            _imageSource = value; 
            RaisePropertyChangedEvent("ImageSource"); 
        } 
    } 
 
    private string _statusText; 
    public string StatusText 
    { 
        get { return _statusText; } 
        set 
        { 
           _statusText = value; 
           RaisePropertyChangedEvent("StatusText"); 
        } 
    }

What we have here is a property for the BitmapImage, mapped to the Image element in the View. We also have a string property for the status text, mapped to the text block element in the View. As you also may notice, when either of the properties is set, we call the RaisePropertyChangedEvent event. This will ensure that the UI updates when either property has new values.

Next we define our two DelegateCommand objects, and do some initialization through the constructor:

    public ICommandBrowseButtonCommand { get; private set; } 
    public ICommandDetectFaceCommand { get; private set; } 
 
    public MainViewModel() 
    { 
        StatusText = "Status: Waiting for image..."; 
 
        _faceServiceClient = new FaceServiceClient("YOUR_API_KEY_HERE", "ROOT_URI); 
 
        BrowseButtonCommand = new DelegateCommand(Browse); 
        DetectFaceCommand = new DelegateCommand(DetectFace, CanDetectFace); 
    }

The properties for the commands are both public to get but private to set. This means we can only set them from within the ViewModel. In our constructor, we start off by setting the status text. Next we create an object of the Face API, which needs to be created with the API key we got earlier. In addition, it needs to specify the root URI, pointing at the location of the service. It can, for instance, be https://westeurope.api.cognitive.microsoft.com/face/v1.0 if the service is located in west Europe. If the service is located in the west US, you would replace westeurope with westus. The root URI can be found in the following place in the Azure Portal:

At last, we create the DelegateCommand constructor for our command properties. Notice how the browse command does not specify a predicate. This means it will always be possible to click on the corresponding button. To make this compile, we need to create the functions specified in the DelegateCommand constructors: the Browse, DetectFace, and CanDetectFace functions.

We start the Browse function by creating an OpenFileDialog object. This dialog is assigned a filter for JPEG images, and, in turn, it is opened. When the dialog is closed, we check the result. If the dialog was cancelled, we simply stop further execution:

    private void Browse(object obj) 
    { 
        var openDialog = new Microsoft.Win32.OpenFileDialog(); 
 
        openDialog.Filter = "JPEG Image(*.jpg)|*.jpg"; 
        bool? result = openDialog.ShowDialog(); 
 
        if (!(bool)result) return;

With the dialog closed, we grab the filename of the file selected and create a new URI from it:

        _filePath = openDialog.FileName; 
        Uri fileUri = new Uri(_filePath);

With the newly created URI, we want to create a new BitmapImage. We specify it to use no cache, and we set the URI source of the URI we created:

        BitmapImage image = new BitmapImage(fileUri); 
 
        image.CacheOption = BitmapCacheOption.None; 
        image.UriSource = fileUri;

The last step we take is to assign the bitmap image to our BitmapImage property, so the image is shown in the UI. We also update the status text to let the user know the image has been loaded:

        ImageSource = image; 
        StatusText = "Status: Image loaded..."; 
    }

The CanDetectFace function checks whether or not the DetectFacesButton button should be enabled. In this case, it checks if our image property actually has a URI. If it does by extension that means we have an image and we should be able to detect faces:

    private boolCanDetectFace(object obj) 
    { 
        return !string.IsNullOrEmpty(ImageSource?.UriSource.ToString()); 
    }

Our DetectFace method calls an async method to upload and detect faces. The return value contains an array of the FaceRectangles variable. This array contains the rectangle area for all face positions in the given image. We will look into the function that we are going to call in a bit.

After the call has finished executing, we print a line with the number of faces to the debug console window as follows:

    private async void DetectFace(object obj) 
    { 
        FaceRectangle[] faceRects = await UploadAndDetectFacesAsync(); 
 
        string textToSpeak = "No faces detected"; 
 
        if (faceRects.Length == 1) 
            textToSpeak = "1 face detected"; 
        else if (faceRects.Length> 1) 
            textToSpeak = $"{faceRects.Length} faces detected"; 
 
        Debug.WriteLine(textToSpeak); 
    }

In the UploadAndDetectFacesAsync function, we create a Stream from the image. This stream will be used as input for the actual call to the Face API service:

    private async Task<FaceRectangle[]>UploadAndDetectFacesAsync() 
    { 
        StatusText = "Status: Detecting faces..."; 
 
        try 
        { 
            using (Stream imageFileStream = File.OpenRead(_filePath))

The following line is the actual call to the detection endpoint for the Face API:

            Face[] faces = await _faceServiceClient.DetectAsync(imageFileStream, true, true, new List<FaceAttributeType>() { FaceAttributeType.Age });

The first parameter is the file stream we created in the previous step. The rest of the parameters are all optional. The second parameter should be true if you want to get a face ID. The next parameter specifies whether you want to receive face landmarks or not. The last parameter takes a list of facial attributes you may want to receive. In our case, we want the age parameter to be returned, so we need to specify that.

The return type of this function call is an array of faces, with all the parameters you have specified:

            List<double> ages = faces.Select(face =>face.FaceAttributes.Age).ToList(); 
            FaceRectangle[] faceRects = faces.Select(face =>face.FaceRectangle).ToArray(); 
 
            StatusText = "Status: Finished detecting faces..."; 
 
            foreach(var age in ages) { 
                Console.WriteLine(age); 
            } 
            return faceRects; 
        } 
    }

The first line iterates over all faces and retrieves the approximate age of all faces. This is later printed to the debug console window, in the foreach loop.

The second line iterates over all faces and retrieves the face rectangle, with the rectangular location of all faces. This is the data we return to the calling function.

Add a catch clause to finish the method. Where an exception is thrown in our API call, we catch that. You want to show the error message and return an empty FaceRectangle array.

With that code in place, you should now be able to run the full example. The end result will look like the following screenshot:

The resulting debug console window will print the following text:

    1 face detected 
    23,7

An overview of what we are dealing with

Now that you have seen a basic example of how to detect faces, it is time to learn a bit about what else Cognitive Services can do for you. When using Cognitive Services, you have 21 different APIs to hand. These are, in turn, separated into five top-level domains according to what they do. They are vision, speech, language, knowledge, and search. Let's learn more about them in the following sections.

Vision

APIs under the Vision flags allows your apps to understand images and video content. It allows you to retrieve information about faces, feelings, and other visual content. You can stabilize videos and recognize celebrities. You can read text in images and generate thumbnails from videos and images.

There are four APIs contained in the Vision area, which we will look at now.

Computer Vision

Using the Computer Vision API, you can retrieve actionable information from images. This means that you can identify content (such as image format, image size, colors, faces, and more). You can detect whether or not an image is adult/racy. This API can recognize text in images and extract it to machine-readable words. It can detect celebrities from a variety of areas. Lastly it can generate storage-efficient thumbnails with smart cropping functionality.

We will look into Computer Vision in Chapter 2, Analyzing Images to Recognize a Face.

Emotion

The Emotion API allows you to recognize emotions, both in images and in videos. This can allow for more personalized experiences in applications. Emotions detected are cross-cultural emotions: anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise.

We will cover Emotion API over two chapters: Chapter 2, Analyzing Images to Recognize a Face, for image-based emotions, and Chapter 3, Analyzing Videos, for video-based emotions.

Face

We have already seen a very basic example of what the Face API can do. The rest of the API revolves around this detecting, identifying, organizing, and tagging faces in photos. Apart from face detection, you can see how likely it is that two faces belong to the same person. You can identify faces and also find similar-looking faces.

We will dive further into Face API in Chapter 2, Analyzing Images to Recognize a Face.

Video

The Video API is about the analyzing, editing, and processing of videos in your app. If you have a video that is shaky, the API allows you to stabilize it. You can detect and track faces in videos. If a video contains a stationary background, you can detect motion. The API lets you generate thumbnail summaries for videos, which allows users to see previews or snapshots quickly.

Video will be covered in Chapter 3, Analyzing Videos.

Video Indexer

Using the Video Indexer API, one can start indexing videos immediately upon upload. This means you can get video insights without using experts or custom code. Content discovery can be improved, utilizing the powerful artificial intelligence of this API. This allows you to make your content more discoverable.

Video indexer will be covered in Chapter 3, Analyzing Videos.

Content Moderator

The Content Moderator API utilizes machine learning to automatically moderate content. It can detect potentially offensive and unwanted images, videos, and text for over 100 languages. In addition, it allows you to review detected material to improve the service.

Content Moderator will be covered in Chapter 2, Analyzing Images to Recognize a Face.

Custom Vision Service

Custom Vision Service allows you to upload your own labeled images to a vision service. This means that you can add images that are specific to your domain to allow recognition using the Computer Vision API.

Custom Vision Service is not covered in this book.

Speech

Adding one of the Speech APIs allows your application to hear and speak to your users. The APIs can filter noise and identify speakers. Based on the recognized intent, they can drive further actions in your application.

Speech contains three APIs that are discussed as follows.

Bing Speech

Adding the Bing Speech API to your application allows you to convert speech to text and vice versa. You can convert spoken audio to text, either by utilizing a microphone or other sources in real time or by converting audio from files. The API also offers speech intent recognition, which is trained by Language Understanding Intelligent Service (LUIS) to understand the intent.

Speaker Recognition

The Speaker Recognition API gives your application the ability to know who is talking. By using this API, you can verify that the person speaking is who they claim to be. You can also determine who an unknown speaker is based on a group of selected speakers.

Custom Recognition

To improve speech recognition, you can use the Custom Recognition API. This allows you to fine-tune speech recognition operations for anyone, anywhere. By using this API, the speech recognition model can be tailored to the vocabulary and speaking style of the user. In addition, the model can be customized to match the expected environment of the application.

Translator Speech API

The Translator Speech API is a cloud-based automatic translation service for spoken audio. Using this API, you can add end-to-end translation across web apps, mobile apps, and desktop applications. Depending on your use cases, it can provide you with partial translations, full translations, and transcripts of the translations.

We will cover all speech related APIs in Chapter 5, Speak with Your Application.

Language

APIs related to language allow your application to process natural language and learn how to recognize what users want. You can add textual and linguistic analysis to your application, as well as natural language understanding.

The following five APIs can be found in the Language area.

Bing Spell Check

The Bing Spell Check API allows you to add advanced spell checking to your application.

This API will be covered in Chapter 6, Understanding Text.

Language Understanding Intelligent Service (LUIS)

LUIS is an API that can help your application understand commands from your users. Using this API, you can create language models that understand intents. By using models from Bing and Cortana, you can make these models recognize common requests and entities (such as places, times, and numbers). You can add conversational intelligence to your applications.

LUIS will be covered in Chapter 4, Let Applications Understand Commands.

Linguistic Analysis

The Linguistic Analysis API lets you parse complex text to explore the structure of text. By using this API, you can find nouns, verbs, and more in text, which allows your application to understand who is doing what to whom.

We will see more of Linguistic Analysis in Chapter 6, Understanding Text.

Text Analysis

The Text Analysis API will help you in extracting information from text. You can find the sentiment of a text (whether the text is positive or negative). You will be able to detect language, topic, and key phrases used throughout the text.

We will also cover Text Analysis in Chapter 6, Understanding Text.

Web Language Model

By using the Web Language Model (WebLM) API, you are able to leverage the power of language models trained on web-scale data. You can use this API to predict which words or sequences follow a given sequence or word.

Web Language Model API will be covered in Chapter 6, Understanding Text.

Translator Text API

By adding the Translator Text API, you can get textual translations for over 60 languages. It can detect languages automatically, and you can customize the API to your needs. In addition, you can improve translations by creating user groups, utilizing the power of crowd-sourcing.

Translator Text API will not be covered in this book.

Knowledge

When talking about Knowledge APIs, we are talking about APIs that allow you to tap into rich knowledge. This may be knowledge from the web. It may be from academia or it may be your own data. Using these APIs, you will be able to explore different nuances of knowledge.

The following four APIs are contained in the Knowledge API area.

Academic

Using the Academic API, you can explore relationships among academic papers, journals, and authors. This API allows you to interpret natural language user query strings, which allow your application to anticipate what the user is typing. It will evaluate said expression and return academic knowledge entities.

This API will be covered more in Chapter 8, Query Structured Data in a Natural Way.

Entity Linking

Entity Linking is the API you would use to extend knowledge of people, places, and events based on the context. As you may know, a single word may be used differently based on the context. Using this API allows you to recognize and identify each separate entity within a paragraph, based on the context.

We will go through Entity Linking API in Chapter 7, Extending Knowledge Based on Context.

Knowledge Exploration

The Knowledge Exploration API will let you add the possibility of using interactive search for structured data in your projects. It interprets natural language queries and offers auto-completions to minimize user effort. Based on the query expression received, it will retrieve detailed information about matching objects.

Details on this API will be covered in Chapter 8, Query Structured Data in a Natural Way.

Recommendations

The Recommendations API allows you to provide personalized product recommendations for your customers. You can use this API to add a frequently bought together functionality to your application. Another feature you can add is item-to-item recommendations, which allows customers to see what other customers like. This API will also allow you to add recommendations based on the prior activity of the customer.

We will go through this API in Chapter 7, Extending Knowledge Based on Context.

QnA Maker

The QnA Maker is a service to distill information for Frequently Asked Questions (FAQ). Using existing FAQs, either online or per document, you can create question and answer pairs. Pairs can be edited, removed, and modified, and you can add several similar questions to match a given pair.

We will cover QnA Maker in Chapter 8, Query Structured Data in a Natural Way.

Custom Decision Service

Custom Decision Service is a service designed to use reinforced learning to personalize content. The service understands any context and can provide context-based content.

This book does not cover Custom Decision Service.

Search

Search APIs give you the ability to make your applications more intelligent with the power of Bing. Using these APIs, you can use a single call to access data from billions of web pages, images, videos, and news.

The following five APIs are in the search domain.

Bing Web Search

With Bing Web Search, you can search for details in billions of web documents indexed by Bing. All the results can be arranged and ordered according to a layout you specify, and the results are customized to the location of the end user.

Bing Image Search

Using the Bing Image Search API, you can add an advanced image and metadata search to your application. Results include URL to images, thumbnails, and metadata. You will also be able to get machine-generated captions, similar images, and more. This API allows you to filter the results based on image type, layout, freshness (how new the image is), and license.

Bing Video Search

Bing Video Search will allow you to search for videos and return rich results. The results contain metadata from the videos, static or motion-based thumbnails, and the video itself. You can add filters to the result based on freshness, video length, resolution, and price.

Bing News Search

If you add Bing News Search to your application, you can search for news articles. Results can include authoritative images, related news and categories, information on the provider, URL, and more. To be more specific, you can filter news based on topics.

Bing Autosuggest

The Bing Autosuggest API is a small, but powerful one. It will allow your users to search faster with search suggestions, allowing you to connect a powerful search to your apps.

All Search APIs will be covered in Chapter 9, Adding Specialized Search.

Bing Entity Search

Using the Bing Entity Search API, you can enhance your searches. The API will find the most relevant entity based on your search terms. It will find entities such as famous people, places, movies, and more.

We will not cover Bing Entity Search in this book.

Getting feedback on detected faces

Now that we have seen what else Microsoft Cognitive Services can offer, we are going to add an API to our face detection application. Through this part, we will add the Bing Speech API to make the application say the number of faces out loud.

This feature of the API is not provided in the NuGet package, and as such we are going to use the REST API.

To reach our end goal, we are going to add two new classes, TextToSpeak and Authentication. The first class will be in charge of generating correct headers and making the calls to our service endpoint. The latter class will be in charge of generating an authentication token. This will be tied together in our ViewModel, where we will make the application speak back to us.

We need to get our hands on an API key first. Head over to the Microsoft Azure Portal. Create a new service for Bing Speech.

To be able to call the Bing Speech API, we need to have an authorization token. Go back to Visual Studio and create a new file called Authentication.cs. Place this in the Model folder.

We need to add two new references to the project. Find System.Runtime.Serialization and System.Web packages in the Assembly tab in the Add References window and add them.

In our Authentication class, define four private variables and one public property as follows:

    private string _requestDetails; 
    private string _token; 
    private Timer _tokenRenewer; 
 
    private const int TokenRefreshInterval = 9; 
 
    public string Token { get { return _token; } }

The constructor should accept one string parameter, clientSecret. The clientSecret parameter is the API key you signed up for.

In the constructor, assign the _clientSecret variable as follows:

    _clientSecret = clientSecret;

Create a new function called Initialize as follows:

    public async Task Initialize()
    {
        _token = GetToken(); 
 
        _tokenRenewer = new Timer(new TimerCallback(OnTokenExpiredCallback), this, 
        TimeSpan.FromMinutes(TokenRefreshInterval), 
        TimeSpan.FromMilliseconds(-1));
    }

We then fetch the access token, in a method we will create shortly.

Finally, we create our timer class, which will call the callback function in 9 minutes. The callback function will need to fetch the access token again and assign it to the _token variable. It also needs to assure that we run the timer again in 9 minutes.

Next we need to create the GetToken method. This method should return a Task<string>object, and it should be declared as private and marked as async.

In the method, we start by creating an HttpClient object, pointing to an endpoint that will generate our token. We specify the root endpoint and add the token issue path as follows:

    using(var client = new HttpClient())
    {
        client.DefaultRequestHeaders.Add ("Opc-Apim-Subscription-Key", _clientSecret);
        UriBuilder uriBuilder = new UriBuilder (https://api.cognitive.microsoft.com/sts/v1.0”);
        uriBuilder.Path = “/issueToken”;

We then go on to make a POST call to generate a token as follows:

var result = await client.PostAsync(uriBuilder.Uri.AbsoluteUri, null);

When the request has been sent, we expect there to be a response. We want to read this response and return the response string:

return await result.Content.ReadAsStringAsync();

Add a new file called TextToSpeak.cs if you have not already done so. Put this file in the Model folder.

Beneath the newly created class (but inside the namespace), we want to add two event argument classes. These will be used to handle audio events, which we will see later.

The AudioEventArgs class simply takes a generic stream, and you can imagine it being used to send the audio stream to our application:

    public class AudioEventArgs : EventArgs 
    { 
        public AudioEventArgs(Stream eventData) 
        { 
            EventData = eventData; 
        } 
 
        public StreamEventData { get; private set; }  
    }

The next class allows us to send an event with a specific error message:

    public class AudioErrorEventArgs : EventArgs 
    { 
        public AudioErrorEventArgs(string message) 
        { 
            ErrorMessage = message; 
        } 
 
        public string ErrorMessage { get; private set; } 
    }

We move on to start on the TextToSpeak class, where we start off by declaring some events and class members as follows:

    public class TextToSpeak 
    { 
        public event EventHandler<AudioEventArgs>OnAudioAvailable; 
        public event EventHandler<AudioErrorEventArgs>OnError; 
 
        private string _gender; 
        private string _voiceName; 
        private string _outputFormat; 
        private string _authorizationToken; 
        private AccessTokenInfo _token;  
 
        private List<KeyValuePair<string, string>> _headers = new  List<KeyValuePair<string, string>>();

The first two lines in the class are events using the event argument classes we created earlier. These events will be triggered if a call to the API finishes, and returns some audio, or if anything fails. The next few lines are string variables, which we will use as input parameters. We have one line to contain our access token information. The last line creates a new list, which we will use to hold our request headers.

We add two constant strings to our class as follows:

private const string RequestUri =  "https://speech.platform.bing.com/synthesize"; 

private const string SsmlTemplate = 
    "<speak version='1.0'xml:lang='en-US'>
        <voice xml:lang='en-US'xml:gender='{0}' 
        name='{1}'>{2}
        </voice>
    </speak>";

The first string contains the request URI. That is the REST API endpoint we need to call to execute our request. Next we have a string defining our Speech Synthesis Markup Language (SSML) template. This is where we will specify what the speech service should say, and a bit on how it should say it.

Next we create our constructor as follows:

        public TextToSpeak() 
        { 
            _gender = "Female"; 
            _outputFormat = "riff-16khz-16bit-mono-pcm"; 
            _voiceName = "Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)"; 
        }

Here we are just initializing some of our variables declared earlier. As you may see, we are defining the voice to be female and we define it to use a specific voice. In terms of gender, naturally it can be either female or male. In terms of voice name, it can be one of a long list of options. We will look more into the details of that list when we go through this API in a later chapter.

The last line specifies the output format of the audio. This will define the format and codec in use by the resulting audio stream. Again, this can be a number of varieties, which we will look into in a later chapter.

Following the constructor, there are three public methods we will create. These will generate an authentication token, generate some HTTP headers, and finally execute our call to the API. Before we create these, you should add two helper methods to be able to raise our events. Call them the RaiseOnAudioAvailable and RaiseOnError methods. They should accept AudioEventArgs and AudioErrorEventArgs as parameters.

Next, add a new method called the GenerateHeaders method as follows:

        public void GenerateHeaders() 
        { 
            _headers.Add(new KeyValuePair<string, string>("Content-Type", "application/ssml+xml")); 
            _headers.Add(new KeyValuePair<string, string>("X-Microsoft-OutputFormat", _outputFormat)); 
            _headers.Add(new KeyValuePair<string, string>("Authorization", _authorizationToken)); 
            _headers.Add(new KeyValuePair<string, string>("X-Search-AppId", Guid.NewGuid().ToString("N"))); 
            _headers.Add(new KeyValuePair<string, string>("X-Search-ClientID", Guid.NewGuid().ToString("N"))); 
            _headers.Add(new KeyValuePair<string, string>("User-Agent", "Chapter1")); 
        }

Here we add the HTTP headers to our previously created list. These headers are required for the service to respond, and if any is missing, it will yield an HTTP/400 response. What we are using as headers is something we will cover in more detail later. For now, just make sure they are present.

Following this, we want to add a new method called GenerateAuthenticationToken as follows:

        public bool GenerateAuthenticationToken(string clientSecret) 
        { 
            Authentication auth = new Authentication(clientSecret);

This method accepts one string parameter, the client secret (your API key). First we create a new object of the Authentication class, which we looked at earlier, as follows:

        try 
        { 
            _token = auth.Token; 
 
            if (_token != null) 
            { 
                _authorizationToken = $"Bearer {_token}"; 
 
                return true; 
            } 
            else 
            { 
                RaiseOnError(new AudioErrorEventArgs("Failed to generate authentication token.")); 
                return false; 
            } 
        }

We use the authentication object to retrieve an access token. This token is used in our authorization token string, which, as we saw earlier, is being passed on in our headers. If the application for some reason fails to generate the access token, we trigger an error event.

Finish this method by adding the associated catch clause. If any exceptions occur, we want to raise a new error event.

The last method that we need to create in this class is going to be called the SpeakAsync method. This method will be actually performing the request to the Speech API:

        public Task SpeakAsync(string textToSpeak, CancellationTokencancellationToken) 
        { 
            varcookieContainer = new CookieContainer(); 
            var handler = new HttpClientHandler() { 
                CookieContainer = cookieContainer 
            }; 
            var client = new HttpClient(handler);

The method takes two parameters. One is the string, that will be the text we want to get spoken. The next is a cancellation token. This can be used to propagate that the given operation should be cancelled.

When entering the method, we create three objects which we will use to execute the request. These are classes from the .NET library, and we will not be going through them in any more detail.

We generated some headers earlier and we need to add these to our HTTP client. We do this by adding the headers in the preceding foreach loop, basically looping through the entire list:

            foreach(var header in _headers) 
            { 
                client.DefaultRequestHeaders.TryAddWithoutValidation (header.Key, header.Value); 
            }

Next we create an HTTP Request Message, specifying that we will send data through the POST method and specifying the request URI. We also specify the content using the SSML template we created earlier and adding the correct parameters (gender, voice name, and the text we want to be spoken):

            var request = new HttpRequestMessage(HttpMethod.Post, RequestUri) 
            { 
                Content = new StringContent(string.Format(SsmlTemplate, _gender, _voiceName, textToSpeak)) 
            };

We use the HTTP client to send the HTTP request asynchronously as follows:

            var httpTask = client.SendAsync(request, HttpCompletionOption.ResponseHeadersRead, cancellationToken);

The following code is a continuation of the asynchronous send call we made previously. This will run asynchronously as well and check the status of the response. If the response is successful, it will read the response message as a stream and trigger the audio event. If everything succeeded, then that stream should contain our text in spoken words:

    var saveTask = httpTask.ContinueWith(async (responseMessage, token) => 
    { 
        try 
        { 
            if (responseMessage.IsCompleted && 
                responseMessage.Result != null &&   
                responseMessage.Result.IsSuccessStatusCode) { 
                var httpStream = await responseMessage. Result.Content.ReadAsStreamAsync().ConfigureAwait(false); 
                RaiseOnAudioAvailable(new AudioEventArgs (httpStream)); 
            } else { 
                RaiseOnError(new AudioErrorEventArgs($"Service returned {responseMessage.Result.StatusCode}")); 
            } 
        } 
        catch(Exception e) 
        { 
            RaiseOnError(new AudioErrorEventArgs (e.GetBaseException().Message)); 
        } 
    }

If the response indicates anything other than success, we will raise the error event.

We also want to add a catch clause as well as a finally clause to this. Raise an error if an exception is caught, and dispose of all objects used in the finally clause.

The final code we need is to specify that the continuation task is attached to the parent task. Also, we need to add the cancellation token to this task as well. Go on to add the following code to finish off the method:

    }, TaskContinuationOptions.AttachedToParent, cancellationToken); 
    return saveTask; 
}

With that in place, we are now able to utilize this class in our application, and we are going to do that now. Open the MainViewModel.cs file and declare a new class variable as follows:

        private TextToSpeak _textToSpeak;

Add the following code in the constructor to initialize the newly added object. We also need to call a function to generate the authentication token as follows:

            _textToSpeak = new TextToSpeak(); 
            _textToSpeak.OnAudioAvailable +=  _textToSpeak_OnAudioAvailable; 
            _textToSpeak.OnError += _textToSpeak_OnError; 
 
            GenerateToken();

After we have created the object, we hook up the two events to event handlers. Then we generate an authentication token by creating a function GenerateToken with the following content:

public async void GenerateToken()
{
    if (await _textToSpeak.GenerateAuthenticationToken("BING_SPEECH_API_KEY_HERE"))
        _textToSpeak.GenerateHeaders(); 
}

Then we generate an authentication token, specifying the API key for the Bing Speech API. If that call succeeds, we generate the HTTP headers required.

We need to add the event handlers, so create the method called _textToSpeak_OnError first as follows:

            private void _textToSpeak_OnError(object sender, AudioErrorEventArgs e) 
            { 
                StatusText = $"Status: Audio service failed -  {e.ErrorMessage}"; 
            }

It should be rather simple, just to output the error message to the user, in the status text field.

Next, we need to create a _textToSpeak_OnAudioAvailable method as follows:

        private void _textToSpeak_OnAudioAvailable(object sender, AudioEventArgs e) 
        { 
            SoundPlayer player = new SoundPlayer(e.EventData); 
            player.Play(); 
            e.EventData.Dispose(); 
        }

Here we utilize the SoundPlayer class from the .NET framework. This allows us to add the stream data directly and simply play the message.

The last part we need for everything to work is to make the call to the SpeakAsync method. We can make that by adding the following at the end of our DetectFace method:

    await _textToSpeak.SpeakAsync(textToSpeak, CancellationToken.None);

With that in place, you should now be able to compile and run the application. By loading a photo and clicking on Detect face, you should be able to get the number of faces spoken back to you. Just remember to have audio on!