Compare 2 datasets side by side

KUZZ · Jan 26, 2011

Hi all!

I have a table (tblBOM) that contains bill-of-materials information for a bunch of subassemblies.

The BOM table looks something like this (simplified version):
parentPtnr / childPtnr / qty_ofChild_perParent

I need to be able to compare programatically all pairwise combinations of subassembly bills-of-materials and identify the % probability of bill-of-materials match.

The % probability match could be a function of childPtnr count per parent subassembly, and quantities count of child per parent.

Any help/suggestions would be greatly appreciated.

Cheers!

The output table would look something like:
parentPtnr A / parentPtnr B / probability match Z%

Good luck,
Kuzz

"Time spent debating the impossible subtracts from the time during which you
can try to accomplish it."

MajP · Jan 26, 2011

can you explain the fields and how used? Normally when I do this it looks like
partNumber quantity parentPartNumber
then if I had data

A 1
AA 1 A
AB 2 A
AC 1 A
AAA 1 AA
ABA 1 AB
ABB 2 AB
ABC 1 AB

This would define a relationship
A
|--AA
|--AAA
|--AB (2)
|--ABA
|--ABB (2)
|--ABC
|--AC

This is a self refercing table where each part references it parent part.
That may be what you are doing, but I can not tell. You list the parent first which normally for me is the foreign key. So my guess is you are not using the same type of self referencing relationship.

parentPtnr / childPtnr / qty_ofChild_perParent

So if I had this
A
|--AA
|--AAA
|--AB (2)
|--ABA
|--ABB (2)
|--ABC
|--AC

and I want to compare to another assembly "System 2"
A
|--AA
|--AAA
|--AB (2)
|--ABB (2)
|--ABX
|--ABY
|--AD

You could do lots of calculations, but not sure what you are looking for.
Is there any industry standards in your buisiness for doing this.

I have components at different levels, and I assume I would weight these somehow. In System 2 I have an additional level 2 component "AD" which I think I would give more weight then the swapped "ABX" and "ABY".
Also a swapped component should be weigthed less then a dropped component. Without really knowing what you are doing this would be very hard to come up with a model. At a Mininum I would think you want to identify
components in system 1 not in system 2 by level
components in system 2 not in system 1 by level
But these are all guesses without knowing your buisiness model. Any additional details would be helpful.

KUZZ · Jan 26, 2011

MajP. Basically your understanding of my situation so far is correct. I have the self-referencing tblBOM that I can loop to see how many assembly levels I have and which children components are part of which level.

From here i have created a flattened BOM, that lists all children components and indicates which assy level the component is at.

Assuming that I have, per your example 2 subassemblies: System 1 and System 2 and I wanted to compute the % probability of match between the two, I could run various checksums and compare the variance of the checksums.

I have 300 boms that I need to compare to one another, and identify the % probability that any one BOM is a match with another. Once I identify high % probabilities (top X%), I will do a human's eye comparison to verify that indeed the 2 subassemblies are very similar. The end goal is to find close substitutes, if a specific subassembly is not in stock.

Good luck,
Kuzz

"Time spent debating the impossible subtracts from the time during which you
can try to accomplish it."

MajP · Jan 26, 2011

It sounds like you got most of this done, so not sure what part you have a question with. My first guess would be something like this. I would build a table to write my results to. Basically store metadata on my comparisons.

TblCompResults
systemOneID
systemTwoID
sytemOneComponentCount
sytemTwoComponentCount
countSystemOneComponentsInSystemTwo
countSytemTwoComponenentsInSystemOne

So now loop the systems. I believe you have to do
(1+(n-1))*(n-1)/2 comparisons
44,850

So nest two loops running a several queries based on your "flat" BOMs to get the count metrics. Write the counts to the results table. Run the code and go to bed. Once done you can do several comparisons. %Similar components 1 in 2, % Similar 2 in 1, Percent Disimlar 1 in 2,..

Now this may make sense if the basic structure is the same, but different components are used. However, certain components might have a weight. Two computers with the same motherboard/chip but different peripherals are probably more similar than two different motherboard/chip and similar peripherals.

MajP · Jan 27, 2011

So my pseudo code would be something like

dim rsSystems1 as recordset
dim rsSystems2 as recordset
dim sys1ID as yourDatatype
dim sys2ID as yourDatatype
dim sys1ComponentCount as long
dim sys2ComponentCount as long
dim sys1in2Count as long
dim sys2in1Count as long

set rsSystems1 to a recordset of the distinct System
set rsSystems2 to a recordset of the distinct system

do while not rsSystems1.eof
sys1id = rsSystems!yourIDField
'reduce rs2 by one system each time
delete record from rsSystem2 where idField = sys1ID
do while not rsSystem2.eof
sys1ComponentCount = count of components where parent is Sys1ID (could be a dcount) from flat BOM
sys2ComponentCount = count of components where parent is Sys2ID
sys1in2Count = count of a query joining the flat bom where left is Sys1 and right is Sys2 and sys2 not null
sys2In1Count = count of a query joining the flat bom where left is Sys2 and right is Sys1 and sys1 not null
write these values to output table
rsSystem2.moveNext
loop
rsSystem1.moveNext
loop

So you would have a table of all comparisons with fields for
system1ID
system2ID
number of components in system 1
number of components in system 2
number of similar components of 1 in 2
number of similar components of 2 in 1
Then you could do some additional calculations
Number disimilar components 1 in 2
number disimilar components 2 in 1
%similar: number similar 1 in 2 / total components in 1
% 2 in 1 / total components in 2
etc.

I would also ahead of time weight all component types. Lets say I am building a desktop computer. I would assign importance weight from 0 to 1 (could be any scale)
Mouse .1
Harddrive .4
motherboard .8
monitor .8
keyboard .2

Then you could do something like

sum of (each similar component * its weight) / sum of (all components in system x * its weight)

Example sys1, sys2, and sys3
Sys1 has 3 similar components to both 2 and 3, but in my mind clearly system 1 and 2 are more alike.

Sys1 has similar motherboard,harddrive, monitor to 2
Sys1 has similar keyboard, mouse, harddrive to 3.

1v2 = (.4+.8+.8) / (.4+.8+.8 +.2 + .1)
= 2.0 / 2.3 = 87%
1v3 = (.1 + .2 + .4) /(.4+.8+.8 +.2 + .1)
= .7/2.3 = 30%

Mo Better solution in my mind.

MajP · Jan 27, 2011

You could probably refine this even further.

That model should give an ok sense of similarity if all swap outs for each type of component are basically equal.
The previous model would show differences in harddrives as pretty important to overall similarity, but the swaps could actually make little/big difference.

Assume you build desktops with 4 harddrive.
1 BrandX 250GB
2 BrandX 500GB
3 BrandY 500GB
4 BrandY 1TB

A swap of 2 and 3 might be meaningless (if cost, warranty, etc the same)
1 and 4 is a big deal. You could then use a replacement similarity matrix (my term)

1 2 3 4
1 .5 .5 .1
2 .9 .5
3 .6
4

So 1 and 4 are not similar (.1), 2 and 3 are (.9)

Theoretically this sounds good, but building this would be much harder. You would have to have a swap table showing each component, each possible swap, and the similarity value.

MajP · Jan 27, 2011

BTW, although I pulled this out of ... There is actually logical science behind this.

http://en.wikipedia.org/wiki/Jaccard_index

http://en.wikipedia.org/wiki/Sørensen_similarity_index

The first approach is a simple Jaccard index/sorensen index
My second approach is a weighted Jaccard index

Looking at the Hamming distance

http://en.wikipedia.org/wiki/Hamming_distance

Leads to maybe an even better model. You would compare system1 to system2 and begin makeing corrections to 2 until it is the same as 1. For each swap,add, delete you would have a weight (penalty) value. Then calculate the total cost. (Sounds like a great thesis topic. "Relational Database Modeling to Calculate Hamming Distance of Similarity for Subassembly BOMs")

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Compare 2 datasets side by side

KUZZ

Technical User

MajP

Technical User

KUZZ

Technical User

MajP

Technical User

MajP

Technical User

MajP

Technical User

MajP

Technical User

Similar threads

Part and Inventory Search

Sponsor