Dismayed by woeful AI chatbots, boffins hired real people – and went back to square one
Amazon Turk serfs have their own problems
Analysis Convinced that intelligent conversational assistants like Amazon Alexa, Microsoft Cortana, and Apple Siri are neither particularly intelligent nor capable of sophisticated conversation, computer boffins last year began testing a crowd-powered assistant embodied by Amazon Mechanical Turk workers.
The chatbot, a people-powered app called Chorus, proved better at conversation than software-based advisors, but hasn't managed to overcome poor human behavior.
Described in a recently published research paper, Chorus was developed by Ting-Hao (Kenneth) Huang and Jeffrey P. Bigham of Carnegie Mellon University, Walter S. Lasecki of the University of Michigan, and Amos Azaria of Ariel University.
The researchers undertook the project because chatbots are just shy of worthless, a sorry state of affairs made evident by the proliferation of labelled buttons in chatbot interfaces. It was hoped by businesses the world over that conversational software could replace face-to-face reps and people in call centers, as the machines should be far cheaper and easier to run.
The problem is simply that natural language processing in software is not very good at the moment.
"Due to the lack of fully automated methods for handling the complexity of natural language and user intent, these services are largely limited to answering a small set of common queries involving topics like weather forecasts, driving directions, finding restaurants, and similar requests," the paper explains.
Jeff Bigham, associate professor at Carnegie Mellon's Human-Computer Interaction Institute, in a phone interview with The Register, said, "Today, if you look at what's out there, like Siri, they do a pretty good job using specific speech commands. But if you want to talk about anything you want, they all fail badly."
Bigham and his colleagues devised a system that connects Google Hangouts, through a third-party framework called Hangoutsbot, with the Chorus web server, which routes queries to on-demand workers participating in Amazon Mechanical Turk.
Chorus is not the first project to incorporate a living backend, the research paper acknowledges, pointing to projects like VizWiz, which crowdsources help for the blind. Its aim is to explore the challenges of deploying a crowd-based system and to suggest future avenues of research for improving conversational software.
Real people, it turns out, are fairly adept at extemporaneous conversation, even if they're basically meat-to-metal bridges for Google Search queries in Chorus.
During the test period last year, 59 people participated in 320 conversations, which lasted more than 10 minutes and involved more than 25 messages on average. A lengthy sample exchange presented in the paper details a conversation about the number of suitcases a person can take on a plane from the US to Israel. It reads like a call center transcript.
The average cost of each HIT – Amazon Mechanical Turk terminology for a task – came to $5.05. The average cost per day was $28.90 total.
So far so good. But while people may have an edge with words, they bring with them their own set of problems.
First, they don't have a sense of when the conversation has ended. Software can be set to timeout after inactivity, but people, without social cues, may not be so savvy.
Chorus in fact implements a session timeout, but that didn't fully address the problem of waiting. "Often towards the end of a conversation, users respond slower or just simply leave," the paper explains, noting that one person who asked about wedding gown rentals in Seattle went silent for 40 minutes before responding "Thanks" after the session timeout.
Lack of information about when conversations conclude can be a burden for workers and drives up costs for the system.
Then there's the problem of dealing with malice, from workers and from end users. Chorus saw spammers (workers who responded with meaningless information), flirters (workers showing too much interest in a user's personal information), and one instance of abuse.
That user, who spewed profanity and hate speech, appears to have been trying to recreate the glorious failure of Microsoft's Tay chatbot, which had to be taken down after internet users decided to hijack it to spew hate speech.
The Chorus message log suggests the abusive individual initially thought the app was a machine learning project. "The user later realized it was humans responding, and apologized to workers with 'sorry to disturb you,'" the paper explains. "The rest of this user's conversation became nonviolent and normal. The abusive conversation lasted nearly three conversational sessions till the user realized it was humans."
Bigham said he's heard concerns that Chorus could become the next Tay. "Chorus fortunately is not structured to quite so easily take on such behavior," he said.
Chorus revealed other challenges to human-supported chat, including ensuring there are enough workers available to field queries in a timely manner, dealing with questions without clear answers, and requests for complex actions like making a restaurant reservation.
"The next big paradigm shift in these systems is really being able to talk to them like a human assistant," said Bigham, who believes that shift remains a long way out.
"I don't see the path from what we have right now to a completely automated system that is as capable as me calling up a friend on the phone," he explained.
Even so, he hopes Chorus will serve as a platform to explore how conversational interaction can be made more automated. ®