Data profiling is used to get the details about data from an existing data source and collect the statistics or summaries.
This activity helps in:
- a. Find out the possible purposes of data.
- b. Accurately tag data for discoverability
- c. Know the expectations on quality
- d. Know the expectations and challenges of joining the data with other data sources
- e. Understanding data challenges for Master Data Management and Data Governance and to improve data quality.
When to Profile?
If you are convinced of the value of profiling data, you need to time it well during your development cycles to get optimal results. Few recommendations are as below:
– Do it several times during the requirements, development of the DWH project.
– Definitely, just after identifying any source of data.
– More intense profiling is needed to support dimensional data modeling.
– Intense profiling needed in the staging areas and the target tables populated.
How are you doing it?
Current approaches are mainly manual; depending on what you are comfortable with, you may:
– use python/R data frames to get the data profile metrics values.
– use custom ETL to get the values for your data.
– use your skills in excel to get such summaries from data that is small enough to fit in excel.
– use specialized tools to get the profile values, store them for future reference, compare them with different data sources, and track the profile changes with time.
The last approach has the benefits of getting more than just the basic profile information from the data.
DataQ’s profiler gives you the ability to get basic and advanced data profile information. The ability to compare data profiles across data sources and across time, streamline the data requirements, decision making during development, and align with your organization’s data governance goals.