INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS
Come Join Us!
Are you a
Computer / IT professional?
Join Tek-Tips now!
- Talk With Other Members
- Be Notified Of Responses
To Your Posts
- Keyword Search
- One-Click Access To Your
- Automated Signatures
On Your Posts
- Best Of All, It's Free!
*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.
Partner With Us!
"Best Of Breed" Forums Add Stickiness To Your Site
(Download This Button Today!)
"...Just a quick note to say, "THANKS!" for these forums...The site is very well layed out and easy to use. Thanks for bringing us together - we need each other."
Where in the world do Tek-Tips members come from?
What is data profiling?
Posted: 1 May 07 (Edited 14 Jan 08)
Please check out my *FREEWARE* data profiling toolkit at: http://www.ipcdesigns.com/data_profiling/index.html No registration required. Please be sure to leave a comment.
Data profiling ("DP") is a methodology of deriving metadata about a given data attribute which can range from the simplistic to the relatively complex.
DP is often one of the first steps performed when beginning to build the extract-transform-load ("ETL") subsystem back room process of a data warehouse. As such data profiling often provides the first highly-focused, objective look at the quality of data BEFORE it goes through the ETL process. The types of metadata produced by data profiling should include at a minimum for each data attribute (field in a file or column in a table):
1. Based on a fixed sample of data, how often is the field blank, empty, NULL or missing and what is this percentage?
2. How many different unique values does the attribute contain over the entire data population and what is the frequency of occurrence of each value?
3. If the data has a fixed domain of unique values, are all of the places where it has a value, represented in the fixed domain of values? (Example: if gender may only have a value of "F" or "M", does it have any "W", "U" or " "? Are there dates which have 01/01/0001 or 99/99/9999? These can be signs of archaic "smart" data elements assigned to attributes to have "special" meanings to applications and may require special handling before being loaded into a more strictly typed data representation which requires higher quality data.
4. For each unique set of values for a given field / column, what is the frequency of occurrence of each value in the sample data set? This is the equivalent of writing the following SQL query for each column in a relational table: SELECT <column_name_here>, count(*) FROM <table_name_here> GROUP BY <column_name_here>
5. Is the value of this column unique for all records or rows? This answers the question, "Does the count of unique values for this column equal the number of rows in the sample dataset?" Unique values may often serve as primary keys in relational, transactional systems or as business keys in dimensional data warehouses.
6. What is the data type to be contained in this field/column? Numeric, w/how many decimal places?; Date?; character value, what is the min/max length found?
From this basic set of data profiling metadata one can begin to get a feel for the quality, integrity and uniqueness of each data attribute. This can help data warehouse architects (ETL, BI, BA) get a view into the areas where remedial efforts may be required in source systems and strategies for data suspense and repair will need to be accomodated in the design of the subsequent data warehouse ETL processes and publishing.
Good data quality requires the design, execution and committment to a pratical, common-sense, sustainable, comprehensive data quality management plan supported by both the business and technical staff alike.
FOR DATA WAREHOUSE BUILDERS
Continous data profiling should be a part of all data quality plans for a data warehouse.
Profile the data BEFORE you build the data warehouse. ALWAYS!!!
Back to Data warehousing general discussion FAQ Index
Back to Data warehousing general discussion Forum
Join Tek-Tips® Today!
Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.
Here's Why Members Love Tek-Tips Forums:
- Talk To Other Members
- Notification Of Responses To Questions
- Favorite Forums One Click Access
- Keyword Search Of All Posts, And More...
Register now while it's still free!
Already a member? Close this window and log in.
Join Us Close