Which other challenge do data-driven enterprises face in addition to the hurdle of poor or dirty data? The answer is – finding where their data, both structured and unstructured, is stored.
Corporations that rely on copious amounts of data run into this problem sooner than later. Where exactly is the particular data that is needed? Where does it reside? How does a team member access it?
In the last “big data” decade, enterprises relying on traditional data warehouses woke up to the “benefits” of Hadoop. After all, the former was unable to control complex hierarchical data types and other unstructured data. But of late, the debate is back to square one – will Hadoop eventually replace data warehouses or not?
Some in the industry say Hadoop may not necessarily be the answer to all big data problems. For the last two years or so, a “new” trend has come about – instead of centralizing data in HDFS clusters aimed at fulfilling the data needs of the entire enterprise, corporations are back to building systems to handle specific data storage, processing, and analytic tasks. Called data catalog, this is a way of recording the databases across an enterprise, adding a description (metadata), and so on. Rather than being faced with “finding” the relevant data from an enterprise-wide Hadoop implementation, such catalogs allow users to quickly find the source of information they want.
One definition of data catalog by Gartner calls it a tool that “creates and maintains an inventory of data assets through the discovery, description, and organization of distributed datasets.”
What is metadata in this process? These are nuggets of information that identify (and help) in locating data that cannot be found in the data itself. Basically, they are to be seen as keywords to help find the location of data kept in a repository or a catalog.
Do catalogs worsen the data silo problem?
There are no sure answers to this. There are studies out there that have a majority of respondents claiming that the deployment of data catalog had worsened the degree of difficulty of finding the data about their data. Others, however, have claimed the complete opposite. Data catalog, they say, cut down the difficulty of finding data, and they would rather have their data residing in silos than face the gargantuan task of hunting it down.
The biggest advantage says users of a data catalog is that it “delivers context to the corporation’s data”. It means not only can you find the data about your data, but it also helps you understand the meaning of the data that is about to be used (or not used). Data catalogs are also of great help in two areas – data governance and data management. It is of great benefit to organizations that work in heavily regulated fields like healthcare, for example.
Some of the other advantages of a data catalog are:
- to capture and manage technical, operational, or BI metadata
- to provide collaboration capabilities that enable the capture of additional user-provided or social metadata
- to help establish the data lineage
Lineage is becoming a priority as enterprises go about building knowledge about their data. Nobody wants any degree of chaos around their data, i.e. who collected it, what kind of transformation was done on it, and so on while building insights. This is where data cataloging, too, helps.
Now, here’s the thing. Any data catalog is as good as the relevance of the information it keeps. Which means it needs to be constantly updated in a rapidly shifting work environment. This is where enterprises have started applying artificial intelligence (AI) and machine learning (ML) to enhance data lineage.
MLDCs or machine learning data catalogs are the new “in” in the data cataloging process. A Forrester report titled “Machine Learning Data Catalogs Put The Entire Business In Full View,” says companies that utilize MLDCs are more than two times more effective at democratizing the use of data and enabling self-service.
We will look at this in part 2 of this blog post.
Image by homestead1997 from Pixabay