Challenges and gaps in conversational UI
Conversational UI is still new to us and, as such, there are still challenges and gaps that prevent it from reaching its full potential. Technology has improved greatly over the years to get us to where we are, but, although we are far from HAL 9000 (from the movie 2001: A Space Odyssey, in which a computer program interacts freely with the ship's astronaut crew and controls the systems of the Discovery One spacecraft using thinking and feeling), we must keep in mind that even HAL had some malfunctions. In this section, I will list the five main challenges that technology and bot designers will have to address in the next few years.
NLU is an AI-hard problem
As human-machine interaction becomes more sophisticated, natural, and humanized, the harder it is to build and develop it. While creating a simple command-line text-based interface can be done by any developer, a high-quality UI in the form of a chatbot or voicebot requires many experts, including chat and voice designers and NLU specialists, both of which are very hard to find.
Natural language understanding is the attempt to mimic reading comprehension by a machine. It is a subtopic of AI and, as mentioned earlier, it is an AI-hard (or AI-complete) problem. An AI-hard problem is equivalent to solving the central AI problem: making computers as intelligent as people (https://en.wikipedia.org/wiki/AI-complete). Why is it so difficult? As discussed above, when responding to a conversational UI, there is an infinite number of unknown and unexpected features in the input, within an infinite number of options of syntactic and semantic schemes to apply to it. This means that when we chat or talk to a bot, just as when we talk to another person, we are unlimited in what we can say. We are not restricted to keeping to a specific GUI path: we are free to ask about anything and everything.
One way to tackle the NLU AI-hard issue is to focus and limit the computer's understanding to a specific theme, subject, or use case. When I go to the doctor, I'm probably not going to consult with him about the return I will yield when investing in the NY stock exchange. When I visit the doctor, I am within a specific context: I don't feel well, I need a new subscription to a medication, and so on. In fact, just within a doctor scenario, there are so many use cases that we will have to predefine, so it would make sense to break those down into sub-use cases, to help improve our NLU in sub-domain contexts (pediatrician, gynecology, oncology, and so on).
If we go back to our travel example, we can train the NLU layer of our bot to be able to respond to everything related to the booking of flights. In this case, we mimic a possible conversation between the user and a travel agent. While a human travel agent can help us with additional tasks, such as finding a hotel, planning our trip, and more, in this use case we will stay within the context of booking flights to maximize the experience and the responses.
A major derivative of the NLU problem is the accuracy level of the conversation. Even when limiting our bot to a specific use case, the need to cover all possible requests, in each form of language, makes it very hard to create a good user experience (UX). In fact, more than 70% of the interactions we have with machines fail (https://www.fool.com/investing/2017/02/28/facebook-incs-chatbots-hit-a-70-failure-rate.aspx). While users are willing to try and address their needs quickly with an automated system, they are unforgiving once the system fails to serve them.
The accuracy of the level of understanding is dependent on the number of preconfigured samples in the bot. Those samples are sentences that users say that represent their request or intent. The bot, thereafter, translates them into actions. For every request, there are hundreds of such sentences. For complex requests, where there are also many parameters involved (such as our flight booking bot example), there are thousands, if not tens of thousands of them. This remains an unsolved problem today and, as a result, many bots today offer a poor experience to their users, which stays within very limited boundaries.
The transition from GUI to conversational UI (CUI), as well as to conversational user experience (CUX), and voice user experience (VUX) introduces many challenges within this paradigm shift that we are witnessing. Beyond the unlimited options that we discussed above, as part of the AI-hard problem raised around NLU, when building a conversational UI, and especially a voice UI and UX, there is a challenge of exposing the user to your offer in a screenless environment.
When I go to the store, I can see all the items I can choose from and purchase, and I can ask the salesperson for more help. A good salesperson will help me and recommend items that they think I should be made aware of in the store. When I shop online, I can view all the items that are available for me to purchase and can also search for something specific and browse through the various results. Here, as well, I can get recommendations, sometimes based on my previous purchases, in different graphical forms such as pop-ups or newsletters. Exposing the user to your offering within a text or a voice conversational UI is extremely difficult. Just as a conversational UI is limited in nature (focusing on specific use cases, within a certain context), the ways to expose the user to what you offer, or how you can help him/her, are limited as well.
Many chatbots offer a menu-based interaction, providing options to choose from. This way, the conversation is limited to a specific flow (state machine supported), but the added value is that the user can be exposed to additional information. The problem with this solution is that it inherits the GUI experience into the CUI and very often offers very little value.
In the case of voicebots, we often witness a "help" section, which provides the user with a list of actions they can perform when talking to the bot. This will be in the form of an introduction to the application, offering a few examples of what the user can ask. Going back to our flight example, imagine that a user says, Ok Google, open travel bot. The first response can be Welcome to Travel Bot! How can I help you? You can ask me: what is the next flight to NYC from SF? In addition, voice-enabled devices, such as Amazon Alexa and Google Home, provide users with an instruction cart that gives some examples of questions. The companies also send out a weekly newsletter with new capabilities.
Non-implicit contextual conversation
I mentioned a couple of times the need to build contextual conversational UI and UX, and I will dedicate a full chapter (Chapter 3, Building a Killer Conversational App) to this in the book. Being a major challenge in today's conversational UI development, I believe that it deserves one more mention in this section.
We expect bots to replace humans – not computers. The conversational UI mimics my interaction with a human, whether through text or voice. Even when we limit the interaction to a specific use case and include all possible sample sentences that could prompt a question, there is one thing that is very difficult to predict within a contextual conversation: non-implicit requests.
If I call my travel agent and excitedly tell her that my daughter's 6th birthday is coming up, she might "do the math" and understand that we are planning a family trip to Disneyland. She will then extract all the parameters needed to complete my request:
- Dates
- Number of people/adults/kids
- Flights
- Hotels for the dates
- Car rental
- Allergies and more…
Even though I haven't explicitly requested her help to plan a trip to Disneyland, the travel agent will be able to connect the dots and respond to my request. Training a machine to do that, that is, to react to non-implicit requests, remains a huge challenge in today's technology stack. However, the good news is that AI technologies and, more specifically, machine learning and deep learning, will become very useful in the next couple of years for tackling this challenge.
One very controversial aspect when discussing chatbots and voicebots is security and, more specifically, the privacy around it. In today's world, chatbot and voicebot platforms are controlled by some of the leading corporations and our data and information become their assets. Although Google, Amazon, and Facebook have been collecting private data for quite a while (whenever we searched the web, purchased items on Amazon, or just posted something on Facebook), now those companies "listen" to us outside of the web/app environment: they are in our homes and in every private message. Recently, Amazon Alexa was accused of recording a private conversation of a man at his home and sending it to his boss, without that person's consent.
The "constantly listening" functionality reminds many of George Orwell's 1984 and the party-monitoring telescreen that was designed to simultaneously broadcast entertainment and listen in to people's conversations to detect disorders. Although Orwell's telescreen was used by a tyranny to control its people, whereas today's solutions are owned by commercial corporations, one cannot help but wonder what the implications of using such devices will be in the future.
Conversational channels controlled by the above corporations have also become a challenge for businesses that are forced into running their customers' interactions through third-party channels. Where five years ago businesses were reluctant about shifting their data centers to the cloud, today it has no meaning at all, when data is being transferred through additional channels anyway.
This is important for us to understand when we design our chatbots and voicebots. Mainly, we should protect our customers' data and, where needed, obey the relevant country's/state's regulations. We should make sure we are not asking for specific data, such as SSN or credit card numbers and, for the time being, use complementary ways to get that, such as rerouting the user to a secure site to complete registration.