Description: C:\Users\Chris\Desktop\Manning\Images\cover images\voice apps with alex and googass.jpg

From Voice Applications for Alexa and Google Assistant by Dustin Coates

This excerpt from Voice Applications for Alexa and Google Assistant discusses the fundamentals of voice UI design.


Save 37% off Voice Applications for Alexa and Google Assistant. Just enter fcccoates into the discount code box at checkout at manning.com.


What the Computer Says

In his talk at Google I/O 2017, James Giangola discussed the cooperative principle. This principle, introduced by Paul Grice, highlights what makes a successful conversation and can be boiled down to four maxims: quality, quantity, relation, and manner.

  • Quality: say the truth
  • Quantity: say enough, but not too much
  • Relation: say something relevant to the conversation
  • Manner: say it clearly

If our skills are liberal in what they accept, this is being conservative in what it says. When these maxims are broken by a conversation partner, the other party may have a difficult time understanding. The maxims can even be flouted to make a point or introduce humor. “I’d like to get a puppy,” the girl says to her father. “I’d like a million dollars,” he responds. This clearly isn’t relevant to the conversation, but the man says this to make a point to his daughter. Be aware. Flouting the maxims is an expert-level move and one that assumes a close bond between those inside the conversation. This is a bond which is unlikely to develop between computer and human outside of science fiction. Ways exist for a voice-interface to evade the maxims without being as clear about it as the sarcastic father.

When a device says, “I’m sorry, but that song isn’t available in your region” even though you listened to the same track on your phone, you can see a failure of the maxim of quality. Meanwhile, voice interfaces break the maxim of quantity regularly. They rarely do this by being overly taciturn. Rather, saying too much is the most common issue. “What would you like to do? You can turn on the lights, turn off the lights, set a timer, remove a timer, find more information about your lighting system or quit.” The user quit listening around “turn off the lights.”

Failures of the maxim of relation are common, too, but no less jarring. When asking to play a video on the television and the device instead provides the weather, a frustrated person appears where a hopeful soon-to-be video viewer stood. The maxim of manner, finally, relates that our computers should speak to be understood. Asking, “does Paris have more people than Berlin?” shouldn’t receive an answer of “the population is two million two-hundred forty-four million.”

Giangola pointed out as well that we assume that conversations adhere to these maxims. This assumption causes issues when the maxims aren’t adhered to, but provides great benefits when they are. When the outdoors enthusiast asks “is it cold outside?” The response of, “you should consider wearing a jacket today” is assumed to follow the maxims. The user assumes this is an advice based on the question, and not an unwanted comment on how to hide a growing waistline. In this way, the maxims can assist each other. For the weather inquiry, quality and relation reduce the necessary quantity.

Imagine that we have created a voice-activated sleep tracker. The sleep quality intent replies back with certain phrases depending on the reported sleep quality. If the quality was identifiably good, the skill responds with “Let’s keep the great sleep going!” A poor night’s sleep gets the response “I hope tonight’s better for you.” Both responses fit the four maxims, and are truthful, the right length, clear, and clearly responding to what was said.

There was a third response, too. “I’ve got a good feeling about your sleep tonight.” This ambiguous response follows an ambiguous statement of quality. Imprecision is acceptable in this situation. I call this the fortune cookie approach. When you have expressions to handle that you don’t know upfront, craft a statement that can apply to nearly any that might arise. Because people assume the maxims are always in place, they interpret a vague phrase in a way that conforms to their expectations.

Building Your VUI Persona

The persona of a VUI also influences how the conversation is interpreted by the listener. Although some anecdotal evidence hints that current voice experiences are training people to speak to computers differently than they do to humans, other research suggests that the brain doesn’t make such a distinction. As a result, all of our attendant pre-conceived notions about different kinds of voices come along.[1] All of these ideas help shape people’s ideas of what personality a voice represents. Changing a voice, or even qualities about a voice, can influence how people react to what that voice’s saying.

When building for a voice-first platform, you don’t have as much control over the personality as you would if you were building an experience for a phone system, mobile app, or website. Your skill is only a small part of the overall platform and users expect you to adhere to that platform’s personality. You can record your own audio to use in the response or run the responses through another speech to text provider and then stream that back to the user.

Hiring and recording voice talent for the skill can create a richer experience, but also comes with the downfall that it’s expensive and out of budget for all but the most deep-pocketed. It also has the downside that it reduces the potential responses significantly; speech can no longer be assembled as-needed but must be re-recorded if a change in the skill calls for it. That’d open the door for using another speech to text provider to use a different voice. A new voice can help set your skill apart from others when done well, but can also be confusing to users if there’s not a good reason for it. Creating and streaming the speech can also introduce extra monetary and latency costs.

One aspect of the personality that you have a lot of control over what’s said. A surefire way to have a boring personality is to always say the same response. Repetition is an easy way to take a user out of the moment and have them remember that they’re, indeed, speaking with a computer. It’s also what the sleep tracker skill is doing right now.

To make the skill sound more conversational, we should develop multiple ways of responding to the user. In truth, we’re already responding with varied phrases, but each phrase maps to an input. A good night’s sleep is always greeted with “Let’s keep the great sleep going!” Surely there are different ways to answer. Replying the same way each time sounds wooden and stilted. By creating a list of responses and choosing randomly, we can make our VUI sound more fluid and even human.

Listing 1. index.js

 
 function pluck (arr) {                                                       
   const randIndex = Math.floor(Math.random() * arr.length);                  
  
   return arr[randIndex];                                                     
 }                                                                            
  
 const WellRestedPhrases = {                                                  
     "I think you may sleep too much and swing back to tired.",               
     "Whoa, that's a lot of sleep. You'll wake up rested for sure."           
   ],                                                                         
   justRight: [                                                               
     "You should wake up refreshed.",                                         
     "Rest is important and you're getting enough.",                          
     "With that much sleep, you're ready to face the world.",                 
     "You'll wake up invigorated."                                            
   ],                                                                         
   justUnder: [                                                               
     "You may get by, but watch out for a mid-day crash.",                    
     "You'll be alright, but would be better off with a bit more time.",      
     "You might be a little tired tomorrow."                                  
   ], (20)
   tooLittle: [                                                               
     "You'll be dragging tomorrow. Get the coffee ready!",                    
     "Long night or early morning? Either way, tomorrow's going to be rough." 
   ]  (24)
 };   (25)
  
 const handlers = {
   WellRestedIntent () {
     const slotValue = this.event.request.intent.slots.NumberOfHours.value;
     const numOfHours = parseInt(slotValue);
  
     if(Number.isInteger(numOfHours)) {
       let speech;
       if(numOfHours > 12) {
         speech = pluck(WellRestedPhrases.tooMuch);                           
       } else if(numOfHours > 8) {
         speech = pluck(WellRestedPhrases.justRight);                         
       } else if(numOfHours > 6) {
         speech = pluck(WellRestedPhrases.justUnder);                         
       } else {
         speech = pluck(WellRestedPhrases.tooLittle);                         
       }
  
       this.emit(':tell', speech);
     } else {
       console.log(`Slot value: ${slotValue}`);
  
       const prompt = "I'm sorry, I heard something that doesn't seem like" +
                       " a number. How many hours of sleep do you want?";
       const reprompt = 'Tell me how many hours you plan to sleep.';
       this.emit(':ask', prompt, reprompt);
     }
   },
   ...
 }
 

  A helper function to return a random value from an array

  All of the possible responses organized in a single object

  Grabbing a random value, based on the number of hours slept

In this code, responses are set as arrays inside an object. It’s unnecessary to have the same number of phrases for each contingency; more common occurrences should naturally have more responses. More options should be available for between six and twelve hours, as these are the lengths most people sleep during the night. A function (pluck) takes a random response on each go.

Rotating through responses might seem small, but it can go a long way toward making your skill more enjoyable to use. Beyond adding variety, you’ll want to tailor your responses to the context and the mood. Go ahead, give Alexa some irreverence for the skill of reporting baseball scores. Baseball is a game; no one will get too upset (but maybe do me a favor and be sensitive if the Astros lose in a gut-wrenching fashion). If someone is checking their bank account balance, you may want to lean toward being more sedate. Possibly your user discovered a surprise extra hundred, but it is as likely that the number isn’t as high as hoped and a light mood won’t go over well.

This is voice and there are no ways to inject personality with color schemes or typeface choices. What you say’s what users remember. Follow Grice’s maxims. Make your skill’s speech true, brief, relevant, and clear.


That’s all for this article. If your interest in voice applications is piqued, have a look at it on liveBook here and see this slide deck.


 

[1] Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship