CoStrategix Case Studies

Using ML Models to Improve Product Catalog Data Quality

The Challenge | Improve Data Quality and Drive Customer Motivation

The nation’s largest grocery chain knows it must build relevance, loyalty, and authority with its customers in order to command customer preference. The grocery chain has responded with innovative solutions to provide fresh food for everyone – both online and in-store. But to build trust among its internal and external customers, it needs to have concrete, unimpeachable data quality.

As this grocery chain deals with 100s of vendors, 1,000s of brands, and millions of SKUs, it is not feasible or cost-effective to manage big data quality using human-intensive efforts. The retailer approached CoStrategix for its proven experience in applying advanced analytics to improve data quality, enrich data, and drive efficiency in the process by reducing dependency on human data curation.

Produce on Shelf

Our Approach

This grocery chain approached CoStrategix with five distinct data challenges across several stages in the information lifecycle. CoStrategix thrives on tackling challenges like these, where there isn’t a clear path to the solution and the expectation is to deliver on an ambitious set of features.

 

For each of the five challenges, CoStrategix started with discovery sessions to determine the parameters of the problem. Armed with that understanding, we conducted numerous experiments and proposed an underlying solution hypothesis, or Proof of Concept (POC). Taking an agile approach with design sprints, we tested the validity of the initial premise and zeroed in on a Minimum Viable Product (MVP).

 

No one technique fit the various big data challenges that the grocery chain faced with its product data. So CoStrategix applied a portfolio of models and tools to tackle the various data challenges. CoStrategix also built monitoring and frequent calibration into the system to ensure that models run as intended in production and continue to perform as required. The robust set of tools provided by Microsoft Azure Machine Learning and other related Azure services helped Costrategix to develop, manage, deploy, and monitor the pipelines for all of the models.

nutrition fact block

Results

1. Improving Quality of Nutrition Facts Data
“Nutrition Facts” are included on all processed food packaging and include information such as calories, fat, carbohydrates, and more. This retailer includes all of these facts in its data catalog – and posts them on its digital channels. Converting the text on a package into strings of data that can be read by a computer is inherently difficult and often yields messy data. Furthermore, the data comes from many different sources, including original manufacturers, wholesalers, or 3rd-party data brokers, with varying degrees of quality. Often some of the nutrition facts data are incomplete, inconsistent, or inaccurate. To tackle this challenge, CoStrategix applied multiple unsupervised ML models to detect these anomalies and ensure they were corrected.

 

In addition to the quality issues, the training data also lacked sufficient examples to identify all the patterns across all the nutrients and product categories. A single model wasn’t an option. To get around this roadblock, CoStrategix built a portfolio of models, each trained for a specific group of related inputs. For example, the macronutrients (carbohydrates, protein, fats, or calories), trace elements (chromium, copper, iodine, or fluoride), and food categories (milk, fish, or cereal) could be trained together to build separate models. By limiting the variability within the sample space, CoStrategix was able to train a model to recognize patterns with fewer samples, and then combine the outputs from these models in the final solution.

 

2. Handling Variation in Ingredient Lists
The grocery chain wanted to assess the “healthiness” of a food based on its ingredients. To help accomplish this task, CoStrategix aimed to parse out the list of ingredients on the packaging and match each item against a master list of ingredients. To do this, CoStrategix developed a recursive parsing algorithm, combined with a greedy matching algorithm. The parsing algorithm created various numeric representations of the text that would measure the strength and quality of a prospective match.

 

The ingredient strings tended to follow a set of well-defined rules, but not without special cases. Therefore, it took some effort to unravel and encode all those patterns into rules and an acceptable algorithm. For instance, CoStrategix had to create and apply different sets of rules for synonyms (such as salt and sodium chloride), foreign words (such as water, eau, and agua), multiple matches (such as flour, bleached white flour, and whole wheat flour), and other outliers.

 

3. Supplementing Nutrition Information with 3rd Party Data
The grocery website displays nutrition data about its food so that shoppers can make informed purchase decisions. Many products sold in a typical grocery store have nutritional information printed on the package. But fresh produce and fresh-cut meats and fish are generally not packaged beforehand, and therefore have no associated nutritional information. To address this gap, CoStrategix worked with the grocer to enrich nutritional information from a trusted 3rd party data source, in this case, the U.S. Department of Agriculture.

 

The primary difficulty was to match the food products against the 3rd party products in order to ensure that the nutritional information was accurately applied for each product. CoStrategix employed a combination of text analysis algorithms, as well as supervised machine learning models, in order to establish confident matches between the food product name and the USDA product description. Not all products were represented in the 3rd party data set, so confidence thresholds were established to determine when the nutritional data should be included.

 

4. Recognizing and Identifying Brands
For all new products added to its Product Information Management (PIM) system, the grocery chain wanted to automatically associate the brands to its master brand list and provide a level of confidence in the match. Only matches with a low level of confidence would be kicked out for manual review. When a brand did not already exist in its master list, the grocery chain wanted to identify a group of possible matches or create a new brand in the master catalog.

 

CoStrategix created intelligent, ML-driven software to determine brands from limited textual descriptions of products and match those to existing known brands in the PIM. For new brands, it applied a combination of cutting-edge language models and domain-specific computer vision models designed to recognize logos and brands in product images.

 

5. Fixing Problems with Image Quality or Mismatches
When you’re in a grocery store, the product freshness, shelf arrangement, and aisle placement all impact a consumer’s purchase. Online, that information has to be communicated through images. Thus, the grocery chain and its suppliers wish to display multiple high-quality images for all of their products. More importantly, it is imperative that those images are associated with the right items when they are added to the grocery product catalog.

 

The quality of images received from the original source, however, varies widely. Issues range from image focus to exposure to reflections captured in the image. In addition, the grocery chain needs to screen images for adult content and other content that might confuse a shopper or discourage a purchase. (e.g., visible expiration dates in the past, non-English text). As previously stated, there also must be a mechanism to catch those cases where the image presents the wrong item, or the product is indistinguishable. To manage this complexity, CoStrategix applied several deep learning approaches that helped speed time-to-value.

Summary

While CoStrategix started out with an experimental approach to using intelligent solutions to solve data challenges, in the end, CoStrategix also incorporated scalability, flexibility, auditing, and monitoring. The foundation for any “trusted, unimpeachable” solution for the product catalog data needs to be:

 

  1. Robust to changes over time (exceptions, uncertainty, data drift, etc.)
  2. Elastic – Be both scalable and flexible
  3. Easy to monitor and maintain
  4. Start with automation first, but also
  5. Keep “humans-in-the-loop” (i.e., auditing predictions, enriching training data, etc.)

 

The ongoing transformation from in-store merchandising to digital commerce is causing a data revolution. This case study covers five different data challenges. The key is that no one solution fits all. With more than 600 product attributes, there is still so much opportunity.