How do you create a Golden Record?
Companies today are swamped with large quantities of poor quality data. You have to manage this so as to avoid the risk of negatively impacting your turnover and credibility. Establishing Golden Records for your data is not easy – otherwise every company would have one. To get a Golden Record, data must be matched, cleaned and consolidated. Without using a Master Data platform, this is an endless task because, once your data has finally been cleaned, it will already start becoming out of date. If you use a more technical approach to cleaning, one challenge is that the system often becomes too rigid and does not allow for flexibility with respect to how data entries should be combined.
A smarter way to perform these tasks is to use a Master Data Management platform like CluedIn. CluedIn is designed as a more dynamic approach, which somewhat turns the Golden Record concept on its head. Instead of determining which data and source is most correct and then constructing an algorithm, CluedIn uses a more statistical and automatic approach.
First and foremost, it is about matching data. As mentioned in our blog on data integration and modelling, you choose one or more unique references for your data. The references use CluedIn to find other data with the same unique references so that the data entries can be combined. For example, if you chose the CVR (company registration) number as a unique reference for business customers and a certain CVR number appears 6 times in your data (typically across sources) – then it is a match, and they can be consolidated into one customer rather than 6 individual customers. And this is how the journey towards a Golden Record begins.
Automatically merge, choose yourself or do both
Duplicates often have different data attributes – either they are input incorrectly at different times or they have not been updated with newer data. When data entries are combined, considerations must therefore be made as to which attributes are the most correct, i.e. the “winning” attributes, which will be part of your Global Record. The starting point is that CluedIn compares data across sources and considers 3 factors: the most recent date of creation or update, “trust levels” for the individual sources or attributes and, finally, the data’s accuracy measurement. However, it is rare that it gets this far since one of the previous factors usually exists. If you are nonetheless not satisfied with the “winning attribute” chosen by CluedIn, you have the option to make corrections later.
However, not all data entries have the same reference keys – or perhaps they entirely lack the unique references you have chosen. The next step is therefore to use fuzzy merging to reduce the quantity of manual work. Fuzzy merging is a merging of data that is nearly identical – i.e. the values are very alike but not 100% identical. This could include different spellings, spelling errors or different formats. Using fuzzy merging on selected fields where it is likely to be a match allows you to localise additional possible duplicate data entries. An example of a match that is not 100% but where fuzzy matching will likely find a near match is firstname.lastname@example.org and email@example.com. These are largely identical, but without being 100% so.
Some of the data fields are free text fields which can contain variations of data, so it could make sense to use fuzzy merging logic here. Among other things, you could look for names that are not quite identical – for example, one data entry could include a middle name or the name could be spelled incorrectly in another entry. It’s also possible that the telephone number appears with the country code in one place and without it in another place. You can choose the specific percentage match. When CluedIn finds a match for the fuzzy references you have selected, it combines the duplicate entries, and you are one step closer to your Golden Record.
Fuzzy merging is a good supplement to tracking down duplicates, but it is important that the rules and the data content are continuously evaluated. Otherwise, you could easily start merging data that should not be merged. An example of an attribute where excessively aggressive fuzzy matching can take place is first names. Two first names can be so identical that the algorithm may regard them as being nearly identical if not considered in relation to other data attributes in your golden record. For example, if Tim and Tom appear alone in a fuzzy match, it is likely that they will be merged and considered as one. Some may quickly conclude that they are not the same, but the algorithm is no smarter than we make it, so therefore, it is important to have as many attributes in play as possible in a fuzzy match.
Following a unique match on reference keys and a given fuzzy match, you may still have a set of duplicates that should be handled manually – possibly without your knowledge. Therefore, lists of possible duplicate entries are a good help in identifying these. It’s about identifying individual attributes and possible connections in data which you think can be identical data sets. The selected attributes and rules from the fuzzy logic can be a good starting point, particularly if you have chosen a less aggressive approach. You may have left out names in your fuzzy logic entirely; however, as an example, if the same customer appears in the system several times, it would be nice to have a list of the possible duplicates. This does not mean that all customers with the same name ARE duplicates, but it does provide an overview that you can work with further. As standard, CluedIn offers a duplicate list of names, however you can compose additional data queries to meet your needs. For example, name + address or name + town + country code. Based on the list of duplicates, you can manually combine the entries that you deem to be a match. The goal here is to end up with as little manual merging as possible, but if you are not sure which fuzzy matching rules best fit your data, you could make several different lists of duplicates and test the manual merging first.
Cleaning and enriching data
One way to increase the probability of being able to identify and merge duplicates with fuzzy matching is to clean the data for errors and deficiencies as well as to enrich it with additional attributes. Errors could be spelling mistakes and different formats that result in the data not matching. Deficiencies could be a middle name or a post code on addresses etc. With the CluedIn Clean tool, you can easily identify and clean your data of errors and deficiencies so that the values are identical and are identified by the fuzzy logic and merged with the now more accurate attributes.
Deficient data can also make it a challenge to match data, and this is particularly the case if it is the unique reference keys that are missing for the data entries. Missing data can either be input into CluedIn clean or retrieved from external sources, websites, public data banks, etc. You can either retrieve data using a standard connector to, for example, the Central Business Register, Dawa or the Central National Register or one that is adapted to the third-party supplier you need.
CluedIn also offers good insight into the quality of data. In the next blog post on data quality measurements, we will return to how you can gain this overview.
The illustration below shows the merge steps that data goes through in CluedIn so as to arrive at a Golden Record.