Smart questions
Smart answers
Smart people
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Member Login




Remember Me
Forgot Password?
Join Us!

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips now!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

Join Tek-Tips
*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Donate Today!

Do you enjoy these
technical forums?
Donate Today! Click Here

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.
Jobs from Indeed

Link To This Forum!

Partner Button
Add Stickiness To Your Site By Linking To This Professionally Managed Technical Forum.
Just copy and paste the
code below into your site.

Design

What is data profiling?
Posted: 1 May 07 (Edited 14 Jan 08)

Please check out my *FREEWARE* data profiling toolkit at: http://www.ipcdesigns.com/data_profiling/index.html  No registration required.  Please be sure to leave a comment.

Data profiling ("DP") is a methodology of deriving metadata about a given data attribute which can range from the simplistic to the relatively complex.  

DP is often one of the first steps performed when beginning to build the extract-transform-load ("ETL") subsystem back room process of a data warehouse.  As such data profiling often provides the first highly-focused, objective look at the quality of data BEFORE it goes through the ETL process.  The types of metadata produced by data profiling should include at a minimum for each data attribute (field in a file or column in a table):

1. Based on a fixed sample of data, how often is the field blank, empty, NULL or missing and what is this percentage?

2. How many different unique values does the attribute contain over the entire data population and what is the frequency of occurrence of each value?

3. If the data has a fixed domain of unique values, are all of the places where it has a value, represented in the fixed domain of values?  (Example: if gender may only have a value of "F" or "M", does it have any "W", "U" or " "?  Are there dates which have 01/01/0001 or 99/99/9999?  These can be signs of archaic "smart" data elements assigned to attributes to have "special" meanings to applications and may require special handling before being loaded into a more strictly typed data representation which requires higher quality data.

4. For each unique set of values for a given field / column, what is the frequency of occurrence of each value in the sample data set?  This is the equivalent of writing the following SQL query for each column in a relational table: SELECT <column_name_here>, count(*) FROM <table_name_here> GROUP BY <column_name_here>

5. Is the value of this column unique for all records or rows?  This answers the question, "Does the count of unique values for this column equal the number of rows in the sample dataset?" Unique values may often serve as primary keys in relational, transactional systems or as business keys in dimensional data warehouses.  
 
6. What is the data type to be contained in this field/column?  Numeric, w/how many decimal places?; Date?; character value, what is the min/max length found?

From this basic set of data profiling metadata one can begin to get a feel for the quality, integrity and uniqueness of each data attribute.  This can help data warehouse architects (ETL, BI, BA) get a view into the areas where remedial efforts may be required in source systems and strategies for data suspense and repair will need to be accomodated in the design of the subsequent data warehouse ETL processes and publishing.  

Good data quality requires the design, execution and committment to a pratical, common-sense, sustainable, comprehensive data quality management plan supported by both the business and technical staff alike.  

FOR DATA WAREHOUSE BUILDERS
Continous data profiling should be a part of all data quality plans for a data warehouse.  
Profile the data BEFORE you build the data warehouse. ALWAYS!!!

Good luck,
-Don McMunn



Back to Data warehousing general discussion FAQ Index
Back to Data warehousing general discussion Forum

My Archive

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close