Brit neural net pioneer just revolutionised speech recognition all over again
Deep learning with Dr Tony Robinson
Profile One of the pioneers of making what's called "machine learning" work in the real world is on the comeback trail.
At Cambridge University's Computer Science department in the 1990s, Dr Tony Robinson taught a generation of students who subsequently turned the Fens into the world speech recognition centre. (Microsoft, Amazon and even secretive Apple have speech labs in Cambridge; Robinson's pioneering work lives on having been tossed around Autonomy, HP and now MicroFocus.) With his latest venture Speechmatics, Robinson finally wants to claim some of that success for himself.
A former student of Robinson's, techie and entrepreneur Matthew Karas, recalls him as "the most popular lecturer in that department by a long way; a lovely guy and a genuine scientific genius. He really is up there with the absolute greats."
What the Teessider achieved was prove that neural networks could work for speech recognition.
"He did what the best speech scientists said was impossible," Karas recalls. "By 1994 he had a system in the top 10 in the world in the DARPA Continuous Speech Evaluation trial. The other nine systems were all hidden Markov models, and Tony's was the only neural network system. He proved it could get into top 10, which was a massive innovation."
With neural networks today tweaked and rebranded as "machine learning" and "deep learning" (which is how Robinson's Speechmatics brands its system), his legacy represents an important but largely unheralded British contribution to the modern world. Even more so since "deep learning" sounds better in a research paper than in reality.
How neural nets revolutionised speech recognition
Robinson himself explains:
The theory goes back to a lot of IBM work way before the 1980s. I could see the potential in the very late 1980s and early 1990s of neural nets and speech recognition. Recurrent neural nets had the wonderful advantage that they feed back on themselves. So much of how you say the next bit depends on how you said anything else. For example, I can tell you're a male speaker just from the sound of your voice. Hidden Markov models have this really weird assumption in them that all of that history didn't matter. The net sample could come from a male speaker or a female speaker, they lost all that consistency. It was the very first time that continuous recognition was done. We used some DSP chip we had lying around.
Karas, who recently won investment in his latest speech ventures, says: "Hidden Markov models work only if you know not just the phoneme context probabilities, but the word context probabilities. The list of viable three-word combinations is very very long.
"With neural networks, the system will assign a probability to something in context without having to know every possible context. It does it by trial and error. The others know the probability, because they have a list of all the hits and misses and divide one by the other. Now, almost all speech recognition systems use neural nets in some way."
But wider recognition has come late, and it's been a rocky road. The height of the dot.com bubble found both Robinson and Karas each working for their own respective speech startups with investment from Mike Lynch's Autonomy.
Autonomy invested in Robinson's first company SoftSound in May 2000. Karas in the meantime had setup BBC News Online, the skunkworks project that "saved the BBC", and then found an application for his speech recognition know-how in turning newsroom video into searchable text: Dremedia. But it was a third startup, Blinkx, that got Autonomy's attention, and through acquisition, Autonomy had built out a diversified set of businesses, losing interest in Robinson's work. He left when Autonomy acquired SoftSound outright in 2006.
Then for a while Robinson led the advanced speech group for SpinVox, which became notorious when the proportion of human transcription was revealed. Insiders told us that "no more than 2 per cent" of messages were actually machine translated, and SpinVox wanted Robinson to build a future system with a much higher level of automation. Within months the company was sold to Nuance. So it was back to the drawing board.
In recent years speech recognition from Amazon, Microsoft and Google has made phenomenal advances. What can Speechmatics boast? What is it and what does it do?
Language models falling short? Make a new one
Despite great advances from the US giants, there are still huge flaws. Six years after launch, Apple's Siri still can't cope with Scottish or Geordie accents. And adding new languages, while necessary to break overseas markets, is a painstaking process.
Reflecting on his 90s work, Robinson had unfinished business. "We'd nearly made it to the tipping point of tipping everyone over. But we didn't. There was a period of slowdown in improvements, although it always got better year on year."
He went back to the drawing board. What emerged has caused ripples across the speech recognition community. Speechmatics showed a real-time, speaker-independent recognition system that could add new languages easily, but ran on an Android phone – or a server on your premises.
"If you read it you would not know it's incredible. It's a technology of such remarkable ingenuity. I was astonished when I heard it was possible," enthuses Karas.
To understand the breakthrough you need to grasp the importance of a probabilistic language model in speech recognition.
"A language model is a huge load of probability tables and data that match sound with word," Karas explains. "Neural networks help because it's quicker to get to a correlation without having to list every probability of every context. The context won't have to be enumerated (was that 'Tattoo' or 'Tattle?')."
But some contexts have a genuine lexical ambiguity, and no matter how much data you have, the machine will struggle. The different uses of row and row are an example. For example, "even with the context, you couldn't know how to say 'row' in the sentence, 'John and Jim were rowing'. A phonetic equivalent might be that the system would need more than just close word context, to distinguish between 'The doctor needs more patience' and 'The doctor needs more patients'," says Karas.
Another wrinkle is accents: people speak differently at different times. Tony Blair, for example, famously adopted Mockney on occasion.
So Robinson devised a new way of doing a language model.
New models used to take months because the process would include the compilation of pronunciation dictionaries, and other canonical data sources, which would then be applied to the training data from real speakers. The training data used to be marked up phonetically, which was only partially automated – requiring laborious manual correction.
"The new system uses an algorithm with an international phoneme set, which can work for a completely unknown language," says Karas. "Give it some Mongolian audio and Mongolian text and, just from the order and frequency of the characters in the text and the properties of the sound wave, it works out which time-slice of the audio matches which word in the text. You can process any transcribed or scripted source, with associated media, search the text, and the results will link directly to the right point in the audio within a few milliseconds."
"Since 2000 I've been with around half a dozen companies," Robinson said. "Small companies are always cash bound and all want to compete with the large companies. It was obvious with this background that if we wanted to compete with very large speech companies we had to produce more than 20 languages, roughly. So with my money how could we do this?"
Speechmatics put a live demo on its website, with a metered usage model. But, unsatisfied, Robinson dismantled the architecture and rebuilt it last year.
Rip it apart, cut the chaff, put it back together
"Without wanting to contribute too much to the hype around neural networks here, we had a system that was live on the Speechmatics site and the volume we were doing was going up and up. We knew we needed to get more efficient at it. We took the whole system apart, this time last year, and I set about with two bright guys I work with a task: take it all apart, work out what we need to do, stick it together in the most efficient order. Really question everything we need to do, every assumption. How much can we put in the neural networks? How much can we take away from the CPU-intensive part of it? Get rid of it as much as we can.
"Between the three of us, we came up with a new architecture for doing speech recognition. It heavily relies on neural network acoustic models and language models. We brought the memory down, and the speed up, so it was good enough to go on the phone. We put it on without too much work. But it's using only one processor core."
Even against a noisy background, the demo on an Android was stunningly accurate.
"Last year we were putting languages out every two weeks. There's 27 right now and some more are coming. We're tackling the hardest languages in the world, like Icelandic. A year ago building a new language model was an overnight job, but we've got that down."
There are several reasons why companies would want to use a speech specialist like Speechmatics, rather than Google.
"It's fun. We have so many different people coming to us with so many different needs. Everyone has understandable concerns about who's using the data for what: you want to know it's not leaving the building. We do on-premises work, halfway between cloud and the embedded stuff," Robinson says.
"We can just say: here's a copy, it's the same thing running on Android but we know you're a bank, you cannot have data leave the building, so here's something you can install on the tin and it runs the speech recognition in exactly the same way as it does with the cloud. The cloud is in many ways a shop front for us."
Speechmatics' ability to transcribe and index huge volumes of speech quickly has been noticed in finance and legal circles.
"You need to be able to unwind a financial transaction, and explain that this is the sequence of things that led up to it. A recorded conversation by itself is not good to you. You need to make it searachable. We're just a little tool in their grand scheme of things."
How is Speechmatics able to add new languages so quickly?
"It's the neural networks! We need some data for a particular language, but much less than you normally need, because we can pick up what we've done from other languages. How I make sounds with my mouth is quite similar to a Japanese speaker – you've got the same vocal apparatus. You're making the same sort of sounds.
"So a lot of what we've got to do is, first going from wave form, that acoustic data, is get to the phonemes, the basic sounds of the language. It isn't completely language dependent. So we can have thousands of hours from one language and a smaller amount from another, and just say tweak it a little bit."
Lowering the cost has some unexpected benefits.
"Icelandic has only about 400,000 speakers, and they're worried that their language will die out. But it's a country with only one-twentieth the population of London. If it was expensive to do, we'd never do Icelandic."
And the near future, and long-term goals?
"How can we ensure as many people as possible can use it? One of the things I like about commercial research is that people actually use it. You can publish four-page papers on your work, and people just fall asleep.
"We have released the API to our cloud version and the API to the real-time embedded one is almost ready. There are business problems to sort out – like licensing – but we want to stay the most accurate."
Even if the "neural network hype bursts, we've got a solid base of users." ®