At Phonic we advocate for the efficiency of human speech. Communicating with voice is fast (150 words per minute compared to 50 for typing), rich in emotion and accessible to nearly everyone on the planet. We've always known that these differences enable longer, more detailed survey responses, but we recently dug into the data to quantify exactly how much better audio open-ends are than text.
Approach
To do this, we conducted a study with two almost identical surveys: one with traditional open text responses and one where respondents answered with voice (N=200). The surveys were completed by 200 unique respondents, randomly assigned to either voice or audio. The respondents consisted of a general population sample from the United States, Canada and the United Kingdom.
The survey prompted respondents on their granola bar preferences, imitating a typical market research study. We focused primarily on comparing three metrics: length, descriptiveness and complexity between text and audio responses.
1. Utterance Length: The total number of words in a response.
2. Descriptive Language: The total number of adjectives and adverbs in a response.
3. Lexical Complexity: The complexity of a response, measured by Flesch-Kincaid grade level.
Results
The results demonstrate a statistically significant increase in descriptive language, utterance length and lexical complexity in the audio responses as compared to the text. Audio responses are on average 2.8x greater in length and use 1.5x more descriptive words per response. The average complexity also jumped more than 2.5 grade levels.
Quantitative measurements varied minimally among the six unstructured responses. Less than 5% of text responses were longer than 30 words, whereas 35% of voice responses were at least this length. The distribution of text response lengths exponentially decreased, but the audio response lengths are more accurately modelled by a lognormal distribution, suggesting there is a typical audio response length.
Aside from the measurable quantitative differences, the voice responses differed from text in other important ways. Respondents were enthusiastic, shared personal anecdotes, jokes, increased detail and otherwise engaged more authentically with the survey questions.
Takeaways
Overall, the study demonstrates that both the quantity and quality of information offered via voice responses far surpasses what users are willing to type. The efficacy of voice surveys is supported by a substantial increase in descriptive language, utterance length and lexical complexity. These results are not unprecedented: they are a predictable consequence of using a lower friction input mode, and this study begins to quantify the degree to which survey friction compromises data quality. When answering open-ended questions, speaking aloud comes more naturally than typing and as a result people are more willing to elaborate on their thoughts and opinions to a degree not seen in text responses.