How does a voice assistant work?
A voice assistant is a set of software resources making it possible to perform voice and language processing to respond to a user’s request. It can be embedded in various objects embedding microphones, loudspeakers, and computing capabilities (more or less developed depending on the case).
Objects incorporating a voice assistant (speaker, smartphone, or any medium capable of carrying such a device) can thus interact with the user to deliver a service to him following a voice request.
The assistant can answer a question, play music, give the weather forecast, adjust the heating, activate lights, make online purchases, etc.
It is common to confuse voice assistant and smart speaker, the latter being only an object containing a voice assistant.
The general operating principle of a voice assistant is characterized by 5 main steps:
Step 1: User “wakes up” the speaker using a keyword
The speaker is constantly listening to the keyword but, in principle, does not record anything and does not carry out any operation until it has heard it. However, there may be false activations when the object thinks it has detected the keyword (for example by pronouncing a word that resembles the activation word).
Step 2: The user is recognized (optional)
Some models allow the user to pre-record samples of their voice so that they can be recognized later and allow them to access a service that is differentiated from other users of the device (parents, children, guests, etc.). This is called voice biometrics.
As biometric data is sensitive data within the meaning of the GDPR, it may in particular only be processed in this context on the basis of the explicit consent of the person concerned.
3. The user states his request
Some loudspeakers record the user’s requests locally to leave him in control of his data. Most devices, however, send these requests to the cloud, ie to the voice assistant designer’s servers. In both cases, the device (or its servers) may be required to keep:
- A history of the requests is transcribed to allow the person to be able to consult them and the editor to adapt the functionalities of the service.
- History of audio requests to allow the person to listen to them again and the publisher to improve its speech processing technologies.
- Metadata associated with the request such as date, time, account name, etc.
Step 4: The spoken word is automatically transcribed into text and then interpreted so that an appropriate response is provided
The assistant will first translate the speech stream into words before extracting the meaning of the request, then define the action or the response to be made. Thus, a response phrase is synthesized and then played on the speaker, and/or command is placed (raise the blinds, increase the temperature, play a piece of music, answer a question, etc.).
Step 5: The speaker returns to “standby”
Finally, the speaker returns to standby, earnestly listening for a new command.