Over the last few years, image recognition A.I. has been gaining popularity with the retail sector and large technology companies. These kinds of analytics can detect multiple types of products on store shelves, but it is also quite a challenge to recognize each one of the packaging designs. The number of particular items to be recognized in a general store can be gigantic, roughly a few thousand. For a medium sized store this metric greatly surpasses the standard capability of object detectors.

How the big ones are doing it

One of the greatest examples for this technology is the unmanned store Amazon Go. In this store, there are dozens of CCTV cameras that can monitor consumers’ behavior and identify the things they are buying using deep learning technologies. Nonetheless, the image recognition accuracy still leaves a lot to be desired. As a result, various technologies such as Bluetooth and weight sensors are used to ensure that retail products are appropriately identified.

Following Amazon Go, in 2019, Walmart opened a new retail store named Intelligent Retail Lab to investigate the use of artificial intelligence in retail services. Deep learning was used with cameras in the real world to automatically recognize out-of-stock products and warn staff workers when it was time to refill. In addition, intelligent retail services such as automatic vending machines and self-serve scales have lately been available.

The challenge

The peculiarity of retail product recognition, as described in the introduction, makes it more challenging than general item detection because there are some specific conditions to consider. Furthermore, we’ll walk through the issues of retail product recognition in general and categorize them into four categories.

Deep learning-based systems necessitate a substantial amount of annotated data for training, which presents a significant challenge when just a few instances are available.

There are a few open-source programs available for image labeling which can be two types: bounding box and mask. To classify every object in each image, these image captioning technologies require manual labor. A common object identification dataset typically contains tens of thousands of training photos, implying that building a database with enough training information will take a really long time.

Another argument is the fact that the majority of training data for food product recognition in retail situations is collected in ideal conditions rather than in real-world scenarios. Try to imagine how different the items are to be found in an actual store.

There is also the aspect of subcategory recognition which means to identify different variations of a product. Retail items are extremely difficult to distinguish due to their apparent similarities in terms of shape, color, typography, and metric size. The biggest problem is that objects from related subordinate categories often have only minor visual differences between them, but sometimes this job is harder for humans.

Nevertheless, in a world where cost margins are tightening and customers’ available time is shrinking, product recognition will become increasingly critical.