×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

need solution for IIel load late arriving dimension

need solution for IIel load late arriving dimension

need solution for IIel load late arriving dimension

(OP)
Hello,

Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool.

I need to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc…) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system =account number).

The trouble lies in

- The ETL flow must be able to run in parallel
- When they run in parallel, they could load the same reference row in the common table.As the insert of a value by one load won’t be seen by another session until that batch insert completes and implicitly commits. So if 2 loads load the same account not existing in the target at the same time, we will have dups.


We have "one commit principle" at our site due to which we can’t load the account_target during the processing of the ETL flow. It can be loaded only as the last step in the ETL flow. Thus, the issue….

I suggested using an intermediate table which we commit/write to during the execution of the job by using the netezza connector’s brute force solution https://www-01.ibm.com/support/knowledgecenter/SSZ...

and then using this intermediate table to load the facts/relationship tables.

So, with the brute force method, I ensure that only one of the conflicting/potential duplicate finds it’s way to the DB and the rest of the jobs which build the relationship tables/sub-types utilize the surrogate key which made it to the intermediate table.

However, my solution was turned down by the architect :( and i am now looking for alternates…..

by experience do you know one apart from

- Having a separate flow for the common table/lookup
- serializing the ETL flows….

Thanks

RE: need solution for IIel load late arriving dimension

You can load the fact record with a "Late Arriving/Unknown" value for the Dimension. Then, at some later time, maybe the next day, or when you receive notification that the dimension job has completed, run a separate "Late Arriving" process. That process will go across the fact table looking for rows that have the value of "Late Arriving" and replace it (UPDATE the record) with the proper dimension value.

==================================
adaptive uber info galaxies (bigger, better, faster, and more adept than agile innovative cognitive big data clouds)


RE: need solution for IIel load late arriving dimension

(OP)
Thanks for the response. However, the decision to apply the design pattern for late arriving dimension is already taken (create the reference/lookup/super type records for late arriving dimension using the fact (Events/Transactions....) records).

RE: need solution for IIel load late arriving dimension

I see. Due to the massively parallel of Netezza, you probably need to load all the dimensions before the facts (if practical). If not, you need to load as many as you can, then load the fact, then do the Late Arriving Cleanup process. Another method i have used is to let the duplicates happen, then have a Duplicate Resolution process to cleanup.

==================================
adaptive uber info galaxies (bigger, better, faster, and more adept than agile innovative cognitive big data clouds)


RE: need solution for IIel load late arriving dimension

(OP)
As the cleanup based on the duplication resolution process is expensive, we want to avoid it in the first place.....

RE: need solution for IIel load late arriving dimension

If you want to avoid the rework, then design the process so that no rework is involved.

This is a situation that arises in Project Management quite often. You have time/schedule, scope/quality, and budget/cost. These form the "iron triangle" of Project Management. If you want to reduce one leg of the triangle, one or both of the others will grow. I don't see how this situation is any different. If you want to save time, you're going to increase scope (duplicate processing) or increase computing power (cost). Discuss this among your team and decide what's most important: time, quality, or money. Or perhaps some intermediate compromise.

==================================
adaptive uber info galaxies (bigger, better, faster, and more adept than agile innovative cognitive big data clouds)


RE: need solution for IIel load late arriving dimension

(OP)
I totally agree :)

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close