How to design a voice experience

Designing for voice: a brand new challenge

Voice devices are growing in popularity - so much so that we now have an entire department dedicated to making voice experiences here at the 麻豆约拍. So how do our designers in 麻豆约拍 Voice + AI go about making experiences that work through smart speakers?

Inspired by our work on the 麻豆约拍 Kids skill - a voice experience for three to seven year olds - we've come up with 12 design principles for voice.

In this guide, we talk about how to:

write good content
use language and tone
handle errors
use sound effects
choose between the device voice and a human voice-over
test your designs

We believe that by understanding how to design a voice experience for children, you'll learn how to design a voice experience for anyone.

Asking questions

1. If you're asking a question, end the sentence with it and listen for an answer immediately.

When children have something to say to a smart speaker, they won't wait nicely for their turn to speak. As soon as you ask a question they will want to answer it, so you should be ready to listen.

Whilst we can't change how impulsive children are or make smart speakers listen when they're speaking, we can write dialogue that prompts children to speak at the right time.

Bad practice: Asking a question mid sentence

Smart speaker: "Who do you want to play with? Andy or Justin?"

In early usability testing sessions, we observed children trying to answer after the first question: who do you want to play with? But at this point the smart speaker is still talking, and isn't listening to what they're saying. The child is also talking over information that they need to actually answer the question.

So by writing dialogue this way, there's a high likelihood that the interaction will be unsuccessful, either because the smart speaker hasn't listened or the child doesn't know how or when to answer. This can be frustrating for the child.

Best practice: Ending a sentence with a question

Smart speaker: "Andy and Justin are here. Who do you want to play with?"

2. Don't tell children to 'say this' or 'say that', simply ask the question.

Given that conversation is the interface of a voice experience, it can feel patronising to tell your users exactly what to say. A good voice experience should aspire to feel like a natural conversation, not an automated phone system from the 1980s. As demonstrated in the example below, telling your users exactly what to say can make the experience lengthier than it needs to be.

Bad practice: Telling the user what to say

Smart speaker: If you'd like to play a game, say 'game'. Or, if you'd like to hear a story, say 'story'.

Read all 19 of these words aloud, and it's easy to see how this is both insulting to the intelligence of the user and disrespectful of their time.

Best practice: Asking the user a question

Smart speaker: Would you like a game or a story?

Reduced to eight words, this approach is far more succinct and respectful. In our user testing, three year olds were able to answer this question with ease.

3. Don't ask rhetorical questions; children will answer them.

As we mentioned in our first principle, children are impulsive and will immediately try and answer a question. This is especially true of rhetorical questions.

It can be very easy when writing dialogue for children to try and engage them with rhetorical questions, such as "wasn't that nice?" as part of a longer sentence. Children won't nod along and continue listening when stood by a smart speaker. They will try and answer.

Bad practice: Using a rhetorical question

Smart speaker: "I wonder who's here to play?"

User: "I am!"

Smart speaker: "Go Jetters"

Smart speaker: "Hey Duggee"

Smart speaker: "Who do you want to play with?"

Here, as a way of priming the user for the options that are about to be presented, the rhetorical question "I wonder who's here to play" is used. The child, not realising this is a rhetorical question, answers immediately and speaks over the available options. When the 'real' question is finally asked, they don't know how to answer.

Best practice: Avoiding rhetorical questions

Smart speaker: "Let's listen and find out who's here to play."

Smart speaker: "Go Jetters"

Smart speaker: "Hey Duggee"

Smart speaker: "Who do you want to play with?"

User: "Duggee!"

By using an instructional command "Let's listen and find out who's here to play", the user is prompted to listen rather than answer.

4. Ask questions that have distinctive, easy-to-say answers.

During early rounds of usability testing, we observed just how often children were misunderstood by smart speakers. Since the launch of our skill, our analytics have shown that roughly 40% of everything children say to our skill is misunderstood.

So if children aren't good at speaking and smart speakers aren't good at listening, then what can we do to help?

We've found that asking for simple, distinct utterances reduces the margin for error. Ideally, pick one or two word phrases that are clearly different from the other options being offered. Following these guidelines makes it easier for children to understand, remember and say these utterances successfully. They're also easier for a smart speaker to understand.

Bad practice: Offering similar sounding choices

Smart speaker: "Wait or Walk?"

These two terms are short, simple and easy to say. But they're not distinct enough from one another.

Whilst the alliteration of "wait or walk" is easy to grasp, memorable and satisfying to say, the similarity in sound of the two terms can make it tricky for a smart speaker to differentiate between them. This can lead to misinterpretations and in the case of this example, 'Waffle the Wonder Dog' walking when he should wait and waking Mrs Hobbs up. Disaster!

Best practice: Offering distinctive sounding choices

Smart speaker: "Hide or Walk?"

By replacing 'wait' with 'hide', the two options offered are now just as easy to say and remember, but far more distinct. They are much easier for a smart speaker to differentiate between.

Listening for answers

5. When offering a choice, provide no more than three options.

Another early insight taught us that children, when presented with options, struggle to retain any more than three. Research from other Voice + AI projects also highlighted this as a limitation for adults following along with a recipe in a voice experience.

More than three options are difficult for users to remember. This can mean users are busy thinking about what they've forgotten when they should be making a decision. Decision-making involves weighing up the available options, and that's very difficult to do if you can't remember all of them.

Bad practice: Providing too many options

Smart speaker: "We've got five games. Go Jetters, Hey Duggee, Waffle the Wonder Dog, Move It and Justin. Which game would you like?"

In this example, five games are presented to the user at once. This many options can feel overwhelming. Few people would be able to remember all five of the options they heard. Consequently, the question at the end can make the user feel at fault for forgetting this information.

Best practice: Providing three or fewer options

Smart speaker: "We've got five games. How about 'Go Jetters' or 'Waffle the Wonder Dog'. Choose one or ask for more."

However, in this example, the five games are split into smaller chunks of two games at a time. Users are able to process the information more easily, weighing up two options and deciding if they like what they've heard or if they want to hear more.

6. Strive to present options that are balanced in their appeal to children.

When first testing our navigation, we noticed that more children were choosing to play a game than listen to a story. Through further investigation, we learned that listen was associated with being told to listen at school. This wasn't very appealing, especially presented against the option to play a game.

Providing choices that are equally weighted in their appeal is an ongoing challenge. But by getting this right we can be sure that children have the best chance of navigating through and discovering all of our content.

Bad practice: Providing options with unbalanced appeal

Smart speaker: "Would you like to play a game or listen to a story?"

'Listen' is off-putting for children, whilst 'play' has an almost insurmountable appeal. The options aren't fairly weighted.

Best practice: Providing options with similar appeal

Smart speaker: "Would you like a game or a story?"

By removing the verbs 'play' and 'listen', we offer two far more balanced options.

Handling errors

7. Don't keep children stuck in error loops. Turn a bad situation good by progressing them even when they are misunderstood.

Children get frustrated if they aren't able to do what they want within an experience. And if they're consistently misunderstood, they will abandon the experience altogether.

To prevent this, if a child is misunderstood twice in our navigation, we make a choice for them. Even if they can't speak clearly or can't remember what to say, they'll eventually reach some content.

We frame this randomly chosen piece content as a surprise! The prospect of an unexpected treat is thrilling to a child and distracts from any frustration they may be feeling.

Bad Practice: Maintaining a state of frustration

Smart speaker: "Would you like a game or a story?"

User: [Misheard utterance]

Smart speaker: "Sorry! I didn't quite catch that. Would you like a game or a story?"

User: [Misheard utterance]

Smart speaker: "Oops, I still don't understand. Would you like a game or a story?"

User: [Misheard utterance]

Smart speaker: "Oops, I still don't understand. Would you like a game or a story?"

Providing a second chance for a misheard utterance to be said again is good practice. But doing this multiple times is less sympathetic to the user. It can become a torturous loop they're unable to break out of.

Best Practice: Progressing the action

Smart speaker: "Would you like a game or a story?"

User: [Misheard utterance]

Smart speaker: "Sorry! I didn't quite catch that. Would you like a game or a story?"

User: [Misheard utterance]

Smart speaker: "Oops, I still don't understand. Let's play a surprise game." Drum roll

Smart speaker: "Let's join in and dance in Justin's House!"

By providing a surprise after a second misheard utterance, the user is progressed to some content and away from a frustrating loop. Children love a surprise, so the experience of being misheard is positive, not negative.

8. Don't use language or tone to make the child feel as though they are to blame.

We've touched upon how easily children can become frustrated by a voice experience that doesn't understand them. But when a voice experience blames a child for the problem, it can go beyond frustration and become upsetting.

During testing, we met one particularly shy and sensitive child who, through some careful facilitation, found the confidence to speak to our smart speaker. But this was short lived, as the speaker didn't understand them and the response left them feeling frustrated and upset. Three words were to blame.

Bad practice: Putting the blame on the user

User: [Misheard utterance]

Smart speaker: "Sorry, I didn't understand what you said."

Those three words, 'what you said' puts the responsibility for the misunderstanding on the child, making them feel at fault. Coming from trusted CBeebies presenter Rebecca, it's no wonder that our participant lost their confidence.

Best practice: Taking the blame

User: [Misheard utterance]

Smart speaker: "Sorry, I didn't understand."

By removing 'what you said', the emphasis is placed onto 'I' and not 'you'. The blame is taken away from the user entirely.

Writing content

9. Use a real voice to speak to children in a warm, friendly tone - avoid cold, monotone, synthesised voices.

Watching any TV programme for preschoolers, you'll notice a warm, friendly and inclusive tone. At present, the synthesised voices provided by smart speakers can't get close to this dynamic delivery.

When we originally began working on the 麻豆约拍 Kids Skill we prototyped an experience where Alexa helped you navigate to Justin's Hide & Seek game. The handover from synthesised voice to Justin jarred. Badly. Alexa felt lifeless by comparison.

We decided there and then to banish synthesised voices from the 麻豆约拍 Kids Skill altogether and turn that decision into one of our core design principles.

Bad practice: Using Alexa's voice

It's difficult not to feel underwhelmed by Alexa's jarring delivery after hearing Justin and Ubercorn.

Best practice: Using a human voice

CBeebies' Rebecca on the other hand delivers warmth and joy in every syllable. Her tone of voice knits together perfectly with Justin and Ubercorn to provide an altogether more cohesive experience. We carried this through to the skill - ensuring kids have the same connection with 麻豆约拍 Kids on smart speaker as they do with CBeebies on TV.

When writing the navigation script, we stayed true to Rebecca's warm, friendly tone of voice and used language that reflected her natural style. At the record, we encouraged Rebecca to adapt the script, ensuring her delivery felt conversational - just as children are used to when watching her on CBeebies.

10. Use sound effects and music to break up dialogue and immerse children in make believe.

We observed that children lose attention during extended stretches of pure dialogue. But when interjected with sound effects, voice experiences are much more engaging.

Sound effects can be used to build a world out of audio, transporting the listener to virtually anywhere and increasing engagement even further. They are also important signposts to help the listener orient themselves in an experience that is entirely non-visual.

Bad practice: Uninterrupted dialogue

Pure uninterrupted dialogue for 24 seconds is hard for anyone to follow, let alone a 3 year old. Without a pause for a sound effect or some music the brain never stops processing the information it's being fed. It asks the user for 24 seconds of pure concentration as they try to listen, understand and remember what's being said. In short, this is boring.

Best practice: Dialogue punctuated by sound effects & music

Punctuated by Duggee's 'awoof', the cheers of the Squirrels and some catchy music, the same information is far easier to absorb. It's broken up nicely, it's fun to listen to and it builds the world of Hey Duggee out of audio for the child to enjoy.

11. Encourage engagement beyond speaking and listening.

Children don't sit still and talk to a smart speaker; they are interacting with the world around them. Speaking isn't the only way they can engage with a product, and we've found that movement, dancing and performing are really popular ways to interact with the skill.

For many parents, smart speakers represent a welcome escape from screens and a healthy choice for their children. So, wherever possible, engage children in physical activity to enhance this healthy relationship with technology.

12. Test early and often. Try your design out with real children.

All of our principles came from making mistakes and learning from them in our research. We conducted seven rounds of usability testing over a six month period, with over 100 children on the way to releasing the 麻豆约拍 Kids Skill.

Alongside our learnings about the content of our skill, we also learned that you don't need a fully working prototype to be able test a voice experience. Using a Bluetooth speaker and some audio clips of your own voice, you can fake an intelligent experience using Wizard of Oz testing.

It's also vital to test the working skill on a real device as soon as you can. Whilst Wizard of Oz testing puts your design under the microscope, testing a working skill is essential to uncover issues that may arise from a smart speaker's natural language processing.

Conclusion

As we strive to craft more delightful Voice experiences, these principles are likely to grow and evolve. Our user research is continuous and our learning is ongoing.

These principles were informed by a project to build a voice experience for children. However, we strongly believe that they have universal application. The considerations and thought processes demonstrated by these principles can be applied to any voice experience.