How to Use Machine Learning for Master Data Management

Avatar photo Jatin Nath December 14th, 2020

Improving the MDM Process

Master Data Management (MDM) is a system that deals with an enterprise’s official master data assets to confirm uniformity and accuracy. To create master data assets, commonly you’ll group similar records to create and maintain a golden copy against each group of similar records. When we talk about finding duplicates or distinct records, traditional queries are adequate. But when we think of grouping similar records (records with variation, not necessarily duplicates), traditional queries need additional help.  Machine Learning (ML) is a program that gives computers the ability to learn without being explicitly programmed. An MDM system be can leverage ML to improve the MDM process. 

Machine Learning

When we type something on our mobile device, it auto-suggests to us based on learnings from our previous usages. Machine Learning is a computer program that learns from experience and gives computers the ability to do so without being explicitly programmed. There are 3 core types of ML: 


  • Supervised learning: Supervised learning is to learn through training by using examples of the desired output. A user trains the system with labelled pairs. A labelled pair is set of 2 records with the desired output (class). The system maps an input to an output based on example input-output pairs and determines the class labels for actual inputs. Common techniques used in supervised machine learning are classification and regression. Recommender systems could also be an example here, as many of them are based on supervised machine learning. 
  • Unsupervised learning: Unsupervised algorithms learn from test data that has not been labeled. It identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. Unsupervised algorithms are useful for learning structure in the data. In this case, we do not have any desired output, but want to group our data. Clustering, anomaly detection, neural networks, etc. most fall under unsupervised learning. As an example we can think of a digital photo gallery with thousands of images and we want to make 10 groups.
  • Reinforcement learning: Reinforcement learning is all about making decisions sequentially. The output depends on the state of the current input and the next input depends on the output of the previous input. The agent/application learns to achieve a goal in an uncertain, potentially complex environment. Reinforcement learning algorithms are used in autonomous vehicles or game development where the computer is learning to play a game against a human opponent. Recommender systems could also be an example for reinforcement learning. 

Master Data Management

Let’s use airline industry data as an example. Passengers are booking tickets from various sources (agent, friends, multiple online accounts etc). So, we will have multiple records for the same passenger, sometime with variation in name, address and other details.

machine learning master data management

The airlines will like to group the similar records and maintain a master copy of record for each passenger that may be used for personalized marketing, reward programs, customer retention efforts, etc.

Traditional queries can’t fully meet the requirements because of the variations in data (not duplicates, but similar data), hence we need to leverage ML to create the customer master copy. 


Zaloni’s DataOps platform, Arena, offers built-in data mastering that is powered by supervised machine learning techniques. It performs the steps below to create and maintain a golden copy of your important data.

  1. Training: There is an interactive interface to teach the system with training inputs (labelled pairs). A labelled pair is 2 records with that indicates whether they are similar or not.
  2. Pair formation: The system forms pairs of the input records, in an optimized way 
  3. Matching: The trained system marks (similar or not) each of the input pairs
  4. Grouping: The system then groups the similar records and form clusters
  5. Master record formation: The MDM system use rules (eg. longest “name”, latest “phone number” etc) at column level to form the master record, against each group of similar records (cluster) 



Traditional queries are effective for finding distinct or grouping duplicates but find similarities or group similar records you should consider leveraging machine learning. Arena’s agile data mastering can help your organization easily and accurately master data. 

To dive deeper into Arena’s data mastering capability and some of the common use cases, read our technical white paper: Arena Data Mastering for Golden Record Creation

If you are ready to take on the next step and learn more about the importance of Machine Learning Data Catalog functionality for your organization, download a complimentary copy of the Now Tech: Machine Learning Data Catalogs Q4, 2020 report.

machine learning data catalog

about the author

Jatin Nath is a Technical Manager at Zaloni and a coding expert in multiple languages and both front-end and back-end technologies. For 10+ years at Zaloni, he has provided deep technical expertise and become a trusted advisor, especially in the area of machine learning and data mastering.