back to article Want a medal? Microsoft 7.2% less bad at speech recognition than IBM

In a machine learning tug-of-war, Microsoft may have just barely slipped ahead of IBM for speech transcription accuracy. Researchers are studying how to recognise human speech in a variety of settings – from realtime interactions to offline, pre-recorded voicemails. Boffins tell us that one application, particularly of offline …

  1. Anonymous Coward
    Boffin

    She said what?

    My company has been trialling putting corporate videos on Microsoft Stream. The accuracy of automatic transcription is astonishing, making video content searchable by word as well as metadata.

  2. imanidiot Silver badge

    I'm not holding my breath

    Error rates have been improving, but that doesn't really mean speech recognition will get GOOD any time soon. Humans tend to be able to gloss over a lot of misheard words and auto-correct from context. Automated recognition systems too often get this wrong.

  3. Pen-y-gors

    What sort of errors?

    It would be interesting to know what sort of errors they're getting, and what the humans get.

    It's very different if the software interprets 'Trump' as 'rumpy-pumpy' or whether it confuses heel, heal, he'll etc.

    Accents are a big problem - think strong Derry accent 'now' sounds like 'nigh'. Heaven knows how you train software to recognise that speaker A is from Glasgow and speaker B is from New Jersey.

    1. Mage Silver badge
      Coat

      Re: speaker A is from Glasgow and speaker B is from New Jersey.

      Context and engage in conversation is best. Speech recognition / "AI" is poor at context and rubbish at "real" conversation.

      It's more brute force using a giant database.

      "Kin yer mammy sew?"

      Very different meaning in parts of Glasgow. Or it was 40 years ago.

    2. Teiwaz

      Re: What sort of errors?

      strong Derry accent 'now' sounds like 'nigh'.

      It's more a nai[[g]h](the strength of the 'gh' is dependant on emotion, alcohol intake and how long it's been raining).

  4. The Jon

    Outlook Voice Recognition

    Here's the last line (personally transcribed) from a voicemail received recently:

    I'll drop you an email in this respect as well and hopefully we'll catch up before Thursday. Take care, bye bye.

    And this is what Outlook automatically transcribed it as (see if you can spot the difference):

    I'll drop you an email in this respect as well and hopefully will touch before fisted take care bye bye.

    1. ratfox
      Trollface

      Re: Outlook Voice Recognition

      I heard that those automatic systems take into consideration the interests of the user, as extracted from their browser history, etc.

      1. Solarflare

        Re: Outlook Voice Recognition

        Personally, I would want you to buy me a coffee before any touchy-fisty business, but then I'm old fashioned like that I suppose.

    2. kain preacher

      Re: Outlook Voice Recognition

      So it knows what you really do on Thursdays. Man you should clean out your emails from your domme.

  5. Anonymous Coward
    Thumb Down

    Every time IBM crows about something good they've done, I just remember WebSphere and Lotus Notes.

    1. Lord Elpuss Silver badge

      TBH WebSphere was (is?) pretty good. The issues arose due to the sheer spellbinding complexity - often requiring a dedicated team to spend days installing it. For me, the killer was the fact that every WS installation was a custom job, and only ever worked well for one specific environment. Migration, upgrade, patching or even just breathing hard in the server room could often necessitate a custom re-install.

      1. kain preacher

        Websphere is use to torture the support team. Lotus notes is to torture every one. Combine the two and you get sharepoint. Wounder what happens if you combine share point point with clippy or even better yet sharepoint running on websphere with lotus notes?

  6. PNGuinn
    Trollface

    Obligatory

    I'm so depressed ... I've got this pain ....

  7. Tony W

    Accents

    Humans learn to adjust for accents which mostly make consistent changes in vowels, so I expect AI will eventually do the same. Dialect is another matter, it took me a year to understand broad Potteries.

    But I wish some of this expertise could be applied to systems used on the phone by organisations like BT and British Gas, Both these systems (made by the same company I'm sure) consistently fail to recognise when I say "yes" even on repeated attempts, and my accent is pretty ordinary London. Maybe I should say "Yep", "Yup" or "Yeah".

    1. Paul Herber Silver badge

      Re: Accents

      @TonyW

      Absolutely

  8. Nick Kew

    Nothing new here

    All the subjects mentioned here in comments - unclear speech, stumbles, hesitation, um, ah, accents, intoxication, sloppy figures of speech, cocktail party, are precisely what makes speech recognition hard.

    And they were what made it hard when I worked in the field, back in the early 1990s. Commentards are identifying issues the researchers have been wrestling with for decades. Something has clearly improved since then, and I don't think it's *just* the march of hardware (Moore's law, etc), though certainly my project was eventually privileged to have use of a supercomputer with tens of megabytes of RAM.

    Sadly, performance measures - the accuracy of speech recognition - is still based on some very suspect measures. The figures MS or IBM, or Apple or Google, report, will be very specific tasks that have limited value in measuring the real world. Accuracy measures are quite pernicious, in that they naturally favour a system that uses the same unit of classification as the test set.

    So in my day, our system which worked with syllables as primary unit couldn't be meaningfully compared with the majority that used phonemes. And among the latter, different teams used different phoneme sets, thus setting themselves widely different tasks. A system that just classifies vowel-vs-consonant has a much easier task than one that classifies 80 different sounds, but you have to dig deeper into the results than this article to tell that the former's 90% might be less rather than more impressive than the latter's 80%.

    I did try to push the information-theoretic measure of entropy as offering better cross-system comparisons, but they seem instead to have gone for defined tasks. Like, erm, testing diesel emissions. Or kids working to an exam. Or ...

  9. Anonymous Coward
    Anonymous Coward

    It isn't just machines that have trouble

    You don't need 100% accuracy, but you need 100% accuracy in the right places. My iPhone does voicemail to text, and even though it isn't always 100% I generally don't have to listen to the message because a 95% accurate (or whatever) transcription is good enough.

    For meeting notes you'd want near perfect accuracy, especially if you want it searchable. If I'm looking for the meeting where we discussed the outcome of "project athena", but its name was transcribed as "project tina" in a meeting we were informed it was canceled then I'm not going to be able to get that critical information via search. Whereas if I was reading the meeting notes or received notice of the cancellation via voicemail I could easily infer what "project tina" really was.

    1. Nick Kew

      Re: It isn't just machines that have trouble

      If I'm looking for the meeting where we discussed the outcome of "project athena", but its name was transcribed as "project tina"

      Now that's the kind of error that's very common in human-produced notes.

      Indeed, that applies to any name (other than one clearly expected in the context), because you don't have the reference point to correct what you heard slightly ambiguously. Cold-callers know this well, hence "This is Athena from [mumble]" obscures her affiliation without her failing to tell you it - and if you hear it as Tina that's an extra bonus.

      1. Anonymous Coward
        Anonymous Coward

        Re: It isn't just machines that have trouble

        Only if the notes are taken by a different person each time, and who isn't involved in any of the work. Otherwise there's no reason why a person should make that error if project athena is discussed at previous meetings, presentations are given in the meeting that mention it, etc.

  10. Griffo

    Will it ever be possible?

    Subject context is another item that must make speech recognition very difficult.

    A while back, I was reviewing the official police transcript of an interview with a man who was accused of killing his wife during a scuba accident.

    Luckily the video of the same interview was available, as the number of Critical mis-transcriptions was amazing. Here was something that a human had transcribed from a very clear AV feed, and had seriously gotten wrong on several occaisions. Why? Because the transcriber had no knowledge of Scuba terms, and had transcribed what he/she thought they had heard.

    But what they thought they had heard was not shaped by any knowledge of the subject, so was wrongly transcribed. I

    supposed a compute may one day be able to work out the subject and then apply a specific set of industry / subject terms to it (which is why I guess they make medical specific transcription software) but as a human I can never follow along when my wife switches topic 13 times in one conversation, so how will a computer?

  11. mako23

    Not a surprise

    The Microsoft speech translator understands words like "honesty" "dignity" and mostly "paying a decent redundancy"

    The IBM product doesn't

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like