Getting vocal about HERE spoken navigation

Our HERE apps for mobile phones come with voice-navigation to give you the right directions every time you need to make a decision.

But did you know that there are actually three different kinds of voice packs available in the apps?*

The lowest level concentrates on basic driving directions – for example, ‘After 500 metres, turn left’. These are made from pre-recorded voice snippets, sewn back together on demand. These are used to create local voice packs quickly and with fewer resources. Take heart, though: because it’s recorded, the voice might sound more natural than the synthesised speech.

In the audio below you can hear a sample in Portuguese Brazilian that says, "After 500 meters, turn left".

The second level consists of the regular voice packs available for US and UK English, French, German, Italian and Spanish. Instead of recordings, these employ a text-to-speech engine tuned to each particular language, to convert spelled-out street and exit names into an acoustic signal (i.e. spoken words) using a technique called statistical parametric synthesis.

So this way, the spoken directions are capable of saying something like ‘After 500 metres, turn left onto King's Road’. This requires very little storage space – commonly about 5MB – or processing power, so it will work on any phone, but it is quite basic in terms of pronunciation and prosody: it does sound like a computerised voice.

The highest level are the hi-fi voice packs, also available in US and UK English, French, German, Italian and Spanish. This has the same capabilities as the second level voice packs in terms of turning written words into speech, and utilising a specifically tuned text-to-speech engine rather than a collection of recordings. (If you think about it for the moment, the storage requirements for packs of recordings would be unusably large even for the smallest county).

However, it it applies a more sophisticated technique for generating speech called ‘unit selection’. So it speaks with a voice that’s more natural and has more of the normal rhythm, stress and intonation of each of the supported languages.

It does this at the cost of more storage – the hi-fi voice packs weigh in at around 70MB – plus a little more memory and processor usage while it’s operational. But if you have your phone hooked up to your car’s speaker system, for example, you’ll probably find that the additional comfort is definitely worth it.

Sadly, weird variations in the way placenames are pronounced, born from years of history, are beyond the powers of pretty much any mobile speech generation system available today.

We will promise to correctly lead you from Greenwich in SE London (pronounced ‘Grenn-idge’ by locals) to Leicester (‘Lester’) and onwards up to Edinburgh (‘Edin-burah’) – but we’ll do so phonetically!

The question is: what's your favourite voice? Let us know in the comments section below.

*This table shows which voice packs are available on the different operating systems.

Android	Highest level as: “Hi-fi”; second and lowest level voices under “Regular”.
iOS	Highest level as: “Hi-fi”; regular only has second-level text-to-speech voices.
Windows Phone	A mixture of text-to-speech and recorded voices.