How we find and update Points of Interest from street-level imagery

As part of our map making process, our data scientists are always exploring new ways of using artificial intelligence and machine learning to detect and extract street-level features and automatically build our map.

An important part of keeping our maps fresh and comprehensive involves continuously capturing vast amounts of information about Points of Interest (POIs) from numerous sources around the world.

One method we’ve developed to find POIs, and verify existing ones, involves detecting business signs from street-level imagery captured by our HERE mapping vehicles.

We met two data scientists from our Chief Technology Office in Seattle, Mark Thompson, the project lead, and Xing Fu, to discover what they’ve been up to.

“We’re currently collaborating with our Core Maps team to help them automate the process of verifying POIs to keep our map up-to-date,” Mark explains. “We’ve seen that deep learning techniques for extracting information from imagery have been dramatically successful over the last few years, and our HERE mapping vehicles have been capturing street-level imagery for over a decade now.

“Each of our fisheye camera-equipped cars gathers vast amounts of imagery and data each day that’s helping us build a real-time digital representation of the physical world (what we call the Reality Index). That’s a tremendously rich source of data that we own and can easily tap into.

“It was about this time last year when we asked ourselves: 'Is there a way we can automatically extract POIs from our own street-level imagery?' The answer, of course, is yes.”

Since then, Mark and Xing have developed deep learning techniques that automatically detect business signs from our street-level imagery. They can then match these signs to our map to see whether existing POIs are correct or need updating, and to add new ones.

Recognizing and transcribing the text on all kinds of signs from street-level imagery is a huge challenge, yet the techniques they’ve developed enable them to do so.

The process as it stands today involves two key steps, which are described in the following video.

The first step is “text localization”. Xing explains, “Raw imagery from the HERE mapping vehicle is fed into a fully convolutional neural network, which extracts the parts of the images that contain text. It does this by creating a heatmap that tells you where there is a high probability there is text.

“This isn’t as straightforward as it sounds, as signs are hugely variable in appearance (font, layout, background etc.), are sometimes captured in poor lighting, and are often obstructed from view. Still, the majority of signs are detected as the car passes.”

The second step is “text recognition”. “Using deep learning,” Xing continues, “we classify text and filter out any “noise”, such as windows or patterns, for example, which are sometimes misclassified as text. Our recurrent neural network uses a sequence model to transcribe the text, with an attention model focusing on specific parts of the images to better recognize individual characters, numbers and symbols.”

Now Mark and Xing, along with other teams in the Chief Technology Office, are working on mining other data from street-level imagery, such as logos, storefront images and accessor points.

“There are numerous other useful applications of extracting scene text that we’ll explore in the coming months,” Mark enthuses. “And there’s so much else we can do. It’s exciting to imagine what we’ll be able to achieve within the next few years as our deep learning techniques continue to evolve and advance.”