Data profiling is used to get the details about data from an existing data source and collect the statistics or summaries.
This activity helps in:
When to Profile?
If you are convinced of the value of profiling data, you need to time it well during your development cycles to get optimal results. Few recommendations are as below:
– Do it several times during the requirements, development of the DWH project.
– Definitely, just after identifying any source of data.
– More intense profiling is needed to support dimensional data modeling.
– Intense profiling needed in the staging areas and the target tables populated.
How are you doing it?
Current approaches are mainly manual; depending on what you are comfortable with, you may:
– use python/R data frames to get the data profile metrics values.
– use custom ETL to get the values for your data.
– use your skills in excel to get such summaries from data that is small enough to fit in excel.
– use specialized tools to get the profile values, store them for future reference, compare them with different data sources, and track the profile changes with time.
The last approach has the benefits of getting more than just the basic profile information from the data.
DataQ’s profiler gives you the ability to get basic and advanced data profile information. The ability to compare data profiles across data sources and across time, streamline the data requirements, decision making during development, and align with your organization’s data governance goals.