Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need Genius Help - Finding multiple duplicates between columns - EXCEL 1

Kim296

MIS
Aug 24, 2012
116
US
I really need some help from a genius. I'm working on a dynamic excel spreadsheet. There are over 1,500 columns with list data below each one. Does anyone know how I can dynamically find two or more like values that appear in any of the 1,500 columns. I don't just want to highlight duplicates. I need to determine if more than two (same values) appear in any of the columns independently of each other. I've attached a sample of the column row values. Any direction in the right way to calculate this using a formula is appreciated. I've tried alot of different ideas over two weeks, but I am coming up empty.
 

Attachments

  • screen_shot4.jpg
    screen_shot4.jpg
    54.8 KB · Views: 12
I don't know if you ever worked with power query. Since a couple of recent Excel versions it is a built-in com add-in. A single query is a sequence of transformations that are recorded in power query desktop. The steps can be edited, it is also possible to write query without recording. The full M code in M language (as below) is displayed in advanced editor. All code below is a result of recording with small clean up.

I started with converting data to structured table and named it 'tData'.
The first query started as query from table, with some future steps. It is named 'tData' and remained as connection only. The M code:
Code:
let
    Source = Excel.CurrentWorkbook(){[Name="tData"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}),
    #"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Column", "Code")
in
    #"Unpivoted Other Columns"

The second query, 'qAggregate' is displayed below the table with source data. M code:
Code:
let
    Source = tData,
    #"Grouped Rows" = Table.Group(Source, {"Code"}, {{"Count", each Table.RowCount(_), Int64.Type}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Count] <> 1))
in
    #"Filtered Rows"

The final four columns query M code:
Code:
let
    Source = qAggregate,
    #"Merged Queries" = Table.NestedJoin(Source, {"Code"}, tData, {"Code"}, "tData", JoinKind.LeftOuter),
    #"Expanded {0}" = Table.ExpandTableColumn(#"Merged Queries", "tData", {"Index", "Column"}, {"Index", "Column"})
in
    #"Expanded {0}"
 
BTW, it is a result of the queries with your data:
View attachment 2350
I definitely appreciate your time and response.

I originally uploaded a few columns of my actual data in a screen shot, but it was confusing to the people trying to understand my question. So, I replaced it with sample data; hence, abc, bcd, etc..

After looking your suggestion in comparison to my worksheet, this approach will not work. In my worksheet, every code (within all the columns) are already narrowed down to duplicate values. There are no unique values in my worksheet to filter out.

I'm sorry that I can't explain this better. Thank you.
 
I don't know if you ever worked with power query. Since a couple of recent Excel versions it is a built-in com add-in. A single query is a sequence of transformations that are recorded in power query desktop. The steps can be edited, it is also possible to write query without recording. The full M code in M language (as below) is displayed in advanced editor. All code below is a result of recording with small clean up.

I started with converting data to structured table and named it 'tData'.
The first query started as query from table, with some future steps. It is named 'tData' and remained as connection only. The M code:
Code:
let
    Source = Excel.CurrentWorkbook(){[Name="tData"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}),
    #"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Column", "Code")
in
    #"Unpivoted Other Columns"

The second query, 'qAggregate' is displayed below the table with source data. M code:
Code:
let
    Source = tData,
    #"Grouped Rows" = Table.Group(Source, {"Code"}, {{"Count", each Table.RowCount(_), Int64.Type}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Count] <> 1))
in
    #"Filtered Rows"

The final four columns query M code:
Code:
let
    Source = qAggregate,
    #"Merged Queries" = Table.NestedJoin(Source, {"Code"}, tData, {"Code"}, "tData", JoinKind.LeftOuter),
    #"Expanded {0}" = Table.ExpandTableColumn(#"Merged Queries", "tData", {"Index", "Column"}, {"Index", "Column"})
in
    #"Expanded {0}"
I'm not familiar with Power query. This looks foreign to me, but I love a challenge. I will take this and try to learn what you are doing. Thank you so much for your time spent trying to help me find a solution. I really appreciate you!!
 
I guess my question (with possible solution) is ignored :(
No, I did reply. I would love to hear your suggestion on VBA. I haven't see the response. I will go back and look again. I'm sorry. I value everyone's time and responses.
 
OK, here is what I've got, assuming:
1. there are no Empty cells in any columns that separate data (columns can have different number of data, but once empty cell is detected, no data below for that column)
2. Data starts in Row 2 (Row 1 is a Header row)

Code:
Option Explicit

Sub Kim296()
Dim iCol1 As Integer
Dim iCol2 As Integer
Dim iRow1 As Integer
Dim iRow2 As Integer
Dim bFirstIsFound As Boolean
Dim iFoundRow1 As Integer
Dim iFoundRow2 As Integer

'Application.ScreenUpdating = False

For iCol1 = 1 To ActiveSheet.UsedRange.Columns.Count - 1
    For iCol2 = iCol1 + 1 To ActiveSheet.UsedRange.Columns.Count
        iRow1 = 2
        bFirstIsFound = False
        Do While Cells(iRow1, iCol1).Value <> ""
            iRow2 = 2
            Do While Cells(iRow2, iCol2).Value <> ""
                If Cells(iRow1, iCol1).Value = Cells(iRow2, iCol2).Value Then
                    If Not bFirstIsFound Then
                        iFoundRow1 = iRow1
                        iFoundRow2 = iRow2
                        bFirstIsFound = True
                    Else
                        Cells(iFoundRow1, iCol1).Interior.Color = vbYellow
                        Cells(iFoundRow2, iCol2).Interior.Color = vbYellow
                    
                        Cells(iRow1, iCol1).Interior.Color = vbYellow
                        Cells(iRow2, iCol2).Interior.Color = vbYellow
                    End If
                End If
                iRow2 = iRow2 + 1
            Loop
            iRow1 = iRow1 + 1
        Loop
    Next iCol2
Next iCol1

'Application.ScreenUpdating = False

End Sub

If you have a LOT of data, uncomment ScreenUpdating in 2 places, that should speed up this code.
 
OK, here is what I've got, assuming:
1. there are no Empty cells in any columns that separate data (columns can have different number of data, but once empty cell is detected, no data below for that column)
2. Data starts in Row 2 (Row 1 is a Header row)

Code:
Option Explicit

Sub Kim296()
Dim iCol1 As Integer
Dim iCol2 As Integer
Dim iRow1 As Integer
Dim iRow2 As Integer
Dim bFirstIsFound As Boolean
Dim iFoundRow1 As Integer
Dim iFoundRow2 As Integer

'Application.ScreenUpdating = False

For iCol1 = 1 To ActiveSheet.UsedRange.Columns.Count - 1
    For iCol2 = iCol1 + 1 To ActiveSheet.UsedRange.Columns.Count
        iRow1 = 2
        bFirstIsFound = False
        Do While Cells(iRow1, iCol1).Value <> ""
            iRow2 = 2
            Do While Cells(iRow2, iCol2).Value <> ""
                If Cells(iRow1, iCol1).Value = Cells(iRow2, iCol2).Value Then
                    If Not bFirstIsFound Then
                        iFoundRow1 = iRow1
                        iFoundRow2 = iRow2
                        bFirstIsFound = True
                    Else
                        Cells(iFoundRow1, iCol1).Interior.Color = vbYellow
                        Cells(iFoundRow2, iCol2).Interior.Color = vbYellow
                   
                        Cells(iRow1, iCol1).Interior.Color = vbYellow
                        Cells(iRow2, iCol2).Interior.Color = vbYellow
                    End If
                End If
                iRow2 = iRow2 + 1
            Loop
            iRow1 = iRow1 + 1
        Loop
    Next iCol2
Next iCol1

'Application.ScreenUpdating = False

End Sub

If you have a LOT of data, uncomment ScreenUpdating in 2 places, that should speed up this code.
Thank you very much for your time and suggestion. I am not very versed on VBA; I'm learning. I will try your suggestion today and see if I can figure it out. It looks pretty short and simple. Thank you again!
 
If you don't want to use VBA, you can try Python in Excel:

python_in_excel.png

Python:
# read excel data into data frame
df_inp = xl("A1:C6",headers=True)
columns = df_inp.columns.to_list()
my_list = []
for col in columns:
  my_list = my_list + df_inp[col].to_list()
# create dictionary: counting all non-empty elements
count = {}
for elem in my_list:
  if elem != None:
    if elem in count:
      count[elem] += 1
    else:
      count[elem] = 1
# remove all non-duplicates from dictionary
for key in list(count.keys())[:]:
  if count[key] == 1:
    count.pop(key, None)
# create resulting data frame
df_out = pd.DataFrame.from_dict({"data": list(count.keys()), "occurs": list(count.values())})
df_out.index += 1
# output data frame to excel
df_out
 
@mikrom

Instead of building a dictionary, try a set.

Use itertools to generate all the subsets.

Test for presence of each subset in each column.
 
Hi mintjulep,

Yes, thanks for the tip, I'll look at itertools, which I haven't used yet.

But the main purpose of my post was to point out that today we can use Python in Excel alongside VBA. This is a relatively new topic for me, so I tried it and posted a working example for solving OP's issue.
 
Hi mikrom,

I'm surprised to hear your claim of itertools ignorance, since you once helped me with it.

Unfortunately, my version of Excel doesn't seem eligible for python.

Without disparaging your solution, which is cool, your understanding of the OP's problem seems different from mine.

I understand:
"For all possible subsets in each column identify if the subset exists in any other column"

As an outline, I see something like:
Code:
For each source_column in columns
    For each subset in source_column
        For each test_column in columns
            if test_column != source_column
                if subset is in test_column
                    print (f'{subset} found in {test_column})
 
Hi mintjulep,
Then yes, I've used itertools before, but it was a long time ago, and I've already forgotten about it. There are many modules in Python, and if you don't use them regularly, you'll forget...

 
Here it is in raw Python.

Python:
import pandas as pd
import itertools as itt

data = {'column1': ['ABC', 'BCD', 'EFG', 'HU', 'KLM'],
        'column2': ['BCD', 'EFG', None, None, None],
        'column3': ['KLM', 'MNO', 'EFG', None, None]}

df = pd.DataFrame (data)
print (df)
for source_column in df.columns:
    source_column_set = (set(df[source_column].values))
    source_column_set.discard(None)
    for i in range(2, len(source_column_set) + 1):
        for i in itertools.combinations(source_column_set, i):
            subset = set(i)
            for test_column in df.columns:
                    if df[test_column].name != df[source_column].name:
                        test_column_set = (set(df[test_column].values))
                        test_column_set.discard(None)
                        if subset.issubset(test_column_set):
                            print (f'{subset} from {df[source_column].name} exists in {df[test_column].name}')

Which produces this output:
Code:
  column1 column2 column3
0     ABC     BCD     KLM
1     BCD     EFG     MNO
2     EFG    None     EFG
3      HU    None    None
4     KLM    None    None
{'KLM', 'EFG'} from column1 exists in column3
{'BCD', 'EFG'} from column1 exists in column2
{'BCD', 'EFG'} from column2 exists in column1
{'KLM', 'EFG'} from column3 exists in column1
f you don't use them regularly, you'll forget...

Yes. That's my motivation for the above.
 
I want to thank everyone for your suggestions. These are all excellent options for me to try. I appreciate everyone's time and expertise. Unfortunately, I feel as if EXCEL might not be a good option for this project as the file is becoming to large and running really slow. I may have to look at other alternative platforms.
 
If you don't want to use VBA, you can try Python in Excel:

View attachment 2358

Python:
# read excel data into data frame
df_inp = xl("A1:C6",headers=True)
columns = df_inp.columns.to_list()
my_list = []
for col in columns:
  my_list = my_list + df_inp[col].to_list()
# create dictionary: counting all non-empty elements
count = {}
for elem in my_list:
  if elem != None:
    if elem in count:
      count[elem] += 1
    else:
      count[elem] = 1
# remove all non-duplicates from dictionary
for key in list(count.keys())[:]:
  if count[key] == 1:
    count.pop(key, None)
# create resulting data frame
df_out = pd.DataFrame.from_dict({"data": list(count.keys()), "occurs": list(count.values())})
df_out.index += 1
# output data frame to excel
df_out
Wow, I didn't know this was a thing. Thank you so much for your response. Definitely above my knowledge, but I would love to try to learn this method.
 
Here it is in raw Python.

Python:
import pandas as pd
import itertools as itt

data = {'column1': ['ABC', 'BCD', 'EFG', 'HU', 'KLM'],
        'column2': ['BCD', 'EFG', None, None, None],
        'column3': ['KLM', 'MNO', 'EFG', None, None]}

df = pd.DataFrame (data)
print (df)
for source_column in df.columns:
    source_column_set = (set(df[source_column].values))
    source_column_set.discard(None)
    for i in range(2, len(source_column_set) + 1):
        for i in itertools.combinations(source_column_set, i):
            subset = set(i)
            for test_column in df.columns:
                    if df[test_column].name != df[source_column].name:
                        test_column_set = (set(df[test_column].values))
                        test_column_set.discard(None)
                        if subset.issubset(test_column_set):
                            print (f'{subset} from {df[source_column].name} exists in {df[test_column].name}')

Which produces this output:
Code:
  column1 column2 column3
0     ABC     BCD     KLM
1     BCD     EFG     MNO
2     EFG    None     EFG
3      HU    None    None
4     KLM    None    None
{'KLM', 'EFG'} from column1 exists in column3
{'BCD', 'EFG'} from column1 exists in column2
{'BCD', 'EFG'} from column2 exists in column1
{'KLM', 'EFG'} from column3 exists in column1

Yes. That's my motivation for the above.
Thank you so much. This is way above my knowledge, but I would try to learn this method. I appreciate your time.
 
I don't know if you ever worked with power query. Since a couple of recent Excel versions it is a built-in com add-in. A single query is a sequence of transformations that are recorded in power query desktop. The steps can be edited, it is also possible to write query without recording. The full M code in M language (as below) is displayed in advanced editor. All code below is a result of recording with small clean up.

I started with converting data to structured table and named it 'tData'.
The first query started as query from table, with some future steps. It is named 'tData' and remained as connection only. The M code:
Code:
let
    Source = Excel.CurrentWorkbook(){[Name="tData"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}),
    #"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Column", "Code")
in
    #"Unpivoted Other Columns"

The second query, 'qAggregate' is displayed below the table with source data. M code:
Code:
let
    Source = tData,
    #"Grouped Rows" = Table.Group(Source, {"Code"}, {{"Count", each Table.RowCount(_), Int64.Type}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Count] <> 1))
in
    #"Filtered Rows"

The final four columns query M code:
Code:
let
    Source = qAggregate,
    #"Merged Queries" = Table.NestedJoin(Source, {"Code"}, tData, {"Code"}, "tData", JoinKind.LeftOuter),
    #"Expanded {0}" = Table.ExpandTableColumn(#"Merged Queries", "tData", {"Index", "Column"}, {"Index", "Column"})
in
    #"Expanded {0}"
I have not worked with power query. Thank you so much for your time and suggestion.
 
Unfortunately, I feel as if EXCEL might not be a good option for this project as the file is becoming to large and running really slow. I may have to look at other alternative platforms.

I'll restate my first reply.

This sounds like you are asking for help at implementing a possible solution to a problem rather than asking for help to solve the actual underlying problem.

Why don't you take a step back and try to more clearly explain what you have, and what you need.
 
Just to illustrate how this "problem" and "solution" grow:
OP states that the data size is 1,500 x 300.
OP provides trivial example with size 3 x 5
The trivial example produced 5 results.

Increasing the data size to 4 x 7 produces 198 results.
How many results will be produced from the full set of data?
Code:
  column1 column2 column3 column4
0     ABC     BCD     KLM      HU
1     BCD     EFG     MNO     MNO
2     EFG    None     EFG     FOO
3      HU    None    None     BAR
4     KLM    None    None     ABC
5     FOO     FOO     BAR     EFG
6     BAR    None      HU     KLM
Result 1: {'KLM', 'EFG'} from column1 exists in column3
Result 2: {'KLM', 'EFG'} from column1 exists in column4
Result 3: {'KLM', 'BAR'} from column1 exists in column3
Result 4: {'KLM', 'BAR'} from column1 exists in column4
Result 5: {'KLM', 'FOO'} from column1 exists in column4
Result 6: {'KLM', 'HU'} from column1 exists in column3
Result 7: {'KLM', 'HU'} from column1 exists in column4
Result 8: {'KLM', 'ABC'} from column1 exists in column4
Result 9: {'BAR', 'EFG'} from column1 exists in column3
Result 10: {'BAR', 'EFG'} from column1 exists in column4
Result 11: {'EFG', 'FOO'} from column1 exists in column2
Result 12: {'EFG', 'FOO'} from column1 exists in column4
Result 13: {'EFG', 'HU'} from column1 exists in column3
Result 14: {'EFG', 'HU'} from column1 exists in column4
Result 15: {'ABC', 'EFG'} from column1 exists in column4
Result 16: {'BCD', 'EFG'} from column1 exists in column2

Not shown because of 10,000 character limit in a code block

Result 184: {'KLM', 'ABC', 'FOO', 'HU'} from column4 exists in column1
Result 185: {'BAR', 'EFG', 'FOO', 'HU'} from column4 exists in column1
Result 186: {'BAR', 'ABC', 'EFG', 'FOO'} from column4 exists in column1
Result 187: {'BAR', 'ABC', 'EFG', 'HU'} from column4 exists in column1
Result 188: {'BAR', 'EFG', 'MNO', 'HU'} from column4 exists in column3
Result 189: {'ABC', 'EFG', 'FOO', 'HU'} from column4 exists in column1
Result 190: {'BAR', 'ABC', 'FOO', 'HU'} from column4 exists in column1
Result 191: {'KLM', 'EFG', 'BAR', 'FOO', 'HU'} from column4 exists in column1
Result 192: {'KLM', 'EFG', 'BAR', 'FOO', 'ABC'} from column4 exists in column1
Result 193: {'KLM', 'EFG', 'BAR', 'HU', 'ABC'} from column4 exists in column1
Result 194: {'KLM', 'EFG', 'BAR', 'HU', 'MNO'} from column4 exists in column3
Result 195: {'KLM', 'EFG', 'FOO', 'HU', 'ABC'} from column4 exists in column1
Result 196: {'KLM', 'BAR', 'FOO', 'HU', 'ABC'} from column4 exists in column1
Result 197: {'EFG', 'BAR', 'FOO', 'HU', 'ABC'} from column4 exists in column1
Result 198: {'KLM', 'EFG', 'BAR', 'FOO', 'HU', 'ABC'} from column4 exists in column1
 

Part and Inventory Search

Sponsor

Back
Top