Are Voice UIs Making Natural Conversations?

Tech & Experience Design

The current state of Voice UIs and interactive robots

If you have a smart speaker, have you used it recently? Have you ever talked to a robot when you saw one on the street? Have you ever been annoyed by interactions with bots or digital signs?

Voice user interfaces (Voice UIs or VUIs) seem to have made a spectacular debut as future interfaces. But when people try them out, they are often frustrated that VUIs are somewhat inconvenient to use or do not work as expected. It appears that, in reality, they are not that widely used.

From what I have gathered, people seem to be very aware of smart speakers nowadays. Many of you reading this article probably know of the existence of smart speakers, even if you don’t have one. Amazon Echo, Google Home, Line’s CLOVA WAVE, and Apple’s Homepod debuted in Japan around 2016. 2017, when speakers with AI voice assistant capabilities attracted so much attention, was even called Year One of the age of smart speakers. Some of you may have purchased one that year and have it with you.

And now, about five years later, many have left their smart speakers in their closets or given them to their kids and rarely use them. Some people say they use smart speakers daily. Still, the only functions they actually use are weather forecasts, telling the time, and playing music. If you think about it, these are mostly doable with a smartphone. I highly doubt whether we can really call these features “smart”.

Let’s turn our attention to VUIs throughout cities. Softbank’s Pepper is a famous interactive robot in Japan. Many of you have probably seen it in retail stores and shopping malls. Since then, various manufacturers have introduced interactive bots. But they are far removed from the fantasies of sci-fi movies in which humans converse with robots/humanoids and enjoy the conveniences they bring.

Suppose you have talked to a robot or an AI agent (avatar) on a screen. In that case, you may have experienced that you could not have a genuine conversation with it. I can just picture a situation where you would have to ask or tell it your name or the product’s name and barely manage to pass on the information using simple words.

If it says, “My name is ***, and I’m an AI robot. Feel free to talk to me about anything!” do you think you can talk to it straight away?

In the end, the interaction would have ended with a generic self-introduction, greeting, or product description. In 2021, Sales of Pepper were reportedly discontinued. In fact, many other robots have been similarly discontinued from sale or development. We don’t know the real reason for Pepper’s discontinuation. Still, interactive robots and avatars (agents) are in such a state of limbo that in early 2020 there was widespread talk of signs that the communication robot bubble was about to burst.

VUIs are also emerging in car navigation and infotainment systems. In addition to guiding drivers to their destination, these systems can also lower the temperature of the car’s air conditioner when it hears drivers’ comments about the heat. They can even change the color of the car’s interior lighting. But I don’t expect many drivers to be using these features yet.

VUIs and interactive robots showed great promise as next-generation interfaces, but why hasn’t the user experience improved that much?

Before getting into detail, there are two major reasons for this:

  1. Integration of VUIs without an understanding of the positives and negatives they bring
  2. Inability to communicate naturally with people in spoken conversations

Both issues are significant challenges for the service and product designers. To begin with, it is not surprising that users are turned off because these systems are not delivering pleasant voice-based experiences.

So how can we create pleasant voice-based experiences?

Understanding the characteristics of VUIs

The solution to the first issue would be for designers to first gain a solid understanding of the characteristics of VUIs and the technology behind them.

Many failures stem from the fact that too many manufacturers and service providers have adopted VUIs for naive reasons such as: “If we put in voice interfaces, it will give our product a next-generation feel,” “It will put us ahead of the competition because nobody else is doing it,” or “Even though I have never experienced voice control myself, but it sounds fun and convenient, so I’m sure users will jump on it.”. That is no way to create quality experiences. I have heard so many stories of companies that invested large sums of money in smart-speaker services, thinking that their services would be more user-friendly if operated with VUIs, only to find that the number of users did not increase at all because of poor usability.

It is faster and more reliable to touch a smartphone screen or press a button that you are familiar with than to use your voice, so it is only natural that VUIs diminish user experiences. The more tired we are, the less we want to say things aloud. Research shows that requesting things aloud is surprisingly labor-intensive. I have had similar results in user interviews I conducted in the past.

User: “Hey, ***. Turn on the light.”

System: “I didn’t catch that. Could you say that again?”

Once you have had this experience, you may never use VUIs again. In the example above, wouldn’t it be easiest to use a motion sensor to turn on the lights when you enter the room?

On the other hand, if, for example, you want to check the weather in Yokohama a week from now, it is easier to do it with your voice. Suppose you use your weather forecasting application of choice but do not live in the city of Yokohama. In that case, you need to take several steps: first, select the Kanto Region, Kanagawa Prefecture, Yokohama City, then choose the dates for the next week, and so on. But with your voice, all you need to say is, “What’s the weather like in Yokohama next week?” (although the way to say this may vary slightly depending on the AI assistant).

I will not go into the characteristics of VUIs in detail. Still, it is a significant issue that we have not yet found enough use cases where VUIs have an advantage over other ways of interacting with devices. Creators do not communicate well the benefits of VUIs. (They’ve simply presented VUIs without giving us any context.)

Understanding voice-based communication and human beings

Let’s look at the second issue: the inability to communicate naturally with people in spoken conversations.

This means that although it seems that you are having a dialogue (conversation), in fact, you are not. First of all, in our daily lives, we humans never say, “Please speak to me. I’ll answer anything.” You may be at a loss when choosing “anything” to say. If you were dealing with a human being, you would want to ask things like, “What kind of questions can you answer?” or “What are you good at?”

Perhaps the people who created these robots and AI assistants are trying to convey how advanced their technology is by telling us, “Hey, I can answer anything!” But showcasing the technology is not the same as creating good interactive experiences. To start with, bragging and showing off your knowledge on a subject are typical examples of things people don’t want to hear in conversations. I believe interactive robots should consider active listening, which deals with how to talk about a topic and how to get information out of a person.

One more point on the subject of communication. In a conversation between human beings, there is always another person involved. We humans are always mindful of the relationship between the other person and ourselves when we talk. Japanese people use three forms of speech to show respect: the polite form, respectful form, and humble form. Here, I will introduce the PAC model (Parent, Adult, Child), a psychotherapeutic way of analyzing communication called Transactional Analysis.

Roughly speaking, we divide our state into P (Parent = acting and thinking like a parent), A (Adult = acting and thinking calmly as an adult), and C (Child = acting and thinking like a child). We analyze the state of each person in the dialogue and judge that conflicts are unlikely to happen in complementary transactions where there is no crossing in the conversation (the arrows are parallel in the diagram). On the other hand, crossed transactions can cause problems in communication, such as ending the conversation or making the listener feel uncomfortable.

This is a universal, generic communication model regardless of language or culture. So, what about robots and humans? Humans, especially adults, can be in any of the states of P, A, or C and can change between states according to the situation. However, these distinctions are not clearly defined in many robots and conversational agents.

In the case of a cute child-type robot, the C (Child) state is probably the most common, but in some scenarios, the robot may suddenly speak in the P or A state at times. If you ask the robot, “What’s the best news of the day?” it will talk about political news in a grown-up tone. We never experience this in interactions between humans. Children are supposed to have their own way of speaking, so a sudden change would make you feel confused.

In human dialogue, there is always another person. Hence, the psychological states of both you and the other person are essential. People change the way they say things, and they also perceive the other person’s feelings through their subtle expressions and phrases.


There are many other reasons we cannot have a natural experience when communicating with a robot or AI assistant. VUIs are entering a very human-like realm. It is quite challenging to figure out how robots (systems) should interact in conversations when not even humans can do it perfectly. In VUIs, using them merely as a means (tool) is not enough to create high-quality experiences.

First, those who build robots and design user experiences should understand not only the technology at hand but also how humans use their voices to interact and communicate with each other. Humans, on the other hand, are animals that can get used to new situations and adapt. For example, we speak loudly and slowly to the elderly and use a softer tone with children. Perhaps even with robots and conversational agents, new communication styles and experiences may emerge in the future.


In Tech and Experience Design, we will discuss design from a broad range of perspectives that goes beyond UX design written in textbooks. We will delve deeply into what it means to design experiences, whether digital or analog, by weaving together the actual development process, global trends, and familiar perspectives such as our living space and human emotions and sensitivities.

Written By

Michinari Kohno

Michinari is a BXUX Director & Designer and is the owner of NeomaDesign. He has worked on UIUX design at Sony for 22 years, mainly working on global products like Playstation 3, Playstation 4. After Sony, he became independent and now is a consultant for next-generation UIUX, doing anything from designing concepts to project management and direction. He loves dancing at musicals himself, watching motor races, and walking his dog.

Upcoming Events


Thanks for supporting Spectrum Tokyo ❤️

partners Design Matters partners Savvy UX Summit 2022 partners
fest partner Recruit fest partner note
fest partner fest partner A.C.O.

Interested in partnership? See here