Data profiling ("DP") is a methodology of deriving metadata about a given data attribute which can range from the simplistic to the relatively complex.
DP is often one of the first steps performed when beginning to build the extract-transform-load ("ETL") subsystem back room process of a data warehouse. As such data profiling often provides the first highly-focused, objective look at the quality of data BEFORE it goes through the ETL process. The types of metadata produced by data profiling should include at a minimum for each data attribute (field in a file or column in a table):
1. Based on a fixed sample of data, how often is the field blank, empty, NULL or missing and what is this percentage?
2. How many different unique values does the attribute contain over the entire data population and what is the frequency of occurrence of each value?
3. If the data has a fixed domain of unique values, are all of the places where it has a value, represented in the fixed domain of values? (Example: if gender may only have a value of "F" or "M", does it have any "W", "U" or " "? Are there dates which have 01/01/0001 or 99/99/9999? These can be signs of archaic "smart" data elements assigned to attributes to have "special" meanings to applications and may require special handling before being loaded into a more strictly typed data representation which requires higher quality data.
4. For each unique set of values for a given field / column, what is the frequency of occurrence of each value in the sample data set? This is the equivalent of writing the following SQL query for each column in a relational table: SELECT <column_name_here>, count(*) FROM <table_name_here> GROUP BY <column_name_here>
5. Is the value of this column unique for all records or rows? This answers the question, "Does the count of unique values for this column equal the number of rows in the sample dataset?" Unique values may often serve as primary keys in relational, transactional systems or as business keys in dimensional data warehouses.
6. What is the data type to be contained in this field/column? Numeric, w/how many decimal places?; Date?; character value, what is the min/max length found?
From this basic set of data profiling metadata one can begin to get a feel for the quality, integrity and uniqueness of each data attribute. This can help data warehouse architects (ETL, BI, BA) get a view into the areas where remedial efforts may be required in source systems and strategies for data suspense and repair will need to be accomodated in the design of the subsequent data warehouse ETL processes and publishing.
Good data quality requires the design, execution and committment to a pratical, common-sense, sustainable, comprehensive data quality management plan supported by both the business and technical staff alike.
FOR DATA WAREHOUSE BUILDERS Continous data profiling should be a part of all data quality plans for a data warehouse. Profile the data BEFORE you build the data warehouse. ALWAYS!!!