dirtyholmes
Programmer
Hi Guys,
I am trying to come up with a suitable file comparison algorithm, and hoped someone here would have done something similar in the past. I am trying to compare 2 files to get an idea of how much they match each other. The files consist of a number of columns of numbers and text ( tel numbers , account numbers etc ) , each one representing a record.
However the records in each file may or may not be in the same order, so matching line by line is out of the question. Some files may also have additional lines or records and some records may 99% match a record in the other file ( eg 4 of the 5 columns may be identical but the 5th column may mismatch the equivalent record in the other file.
Can anyone provide a watertight algorithm to quantify how much these files match.
Data example
193 03989337 000060003+00060009.69003858
194 03989337 000060003+00060009.69003858
195 989337 000060003+00060009.69003858
I am trying to come up with a suitable file comparison algorithm, and hoped someone here would have done something similar in the past. I am trying to compare 2 files to get an idea of how much they match each other. The files consist of a number of columns of numbers and text ( tel numbers , account numbers etc ) , each one representing a record.
However the records in each file may or may not be in the same order, so matching line by line is out of the question. Some files may also have additional lines or records and some records may 99% match a record in the other file ( eg 4 of the 5 columns may be identical but the 5th column may mismatch the equivalent record in the other file.
Can anyone provide a watertight algorithm to quantify how much these files match.
Data example
193 03989337 000060003+00060009.69003858
194 03989337 000060003+00060009.69003858
195 989337 000060003+00060009.69003858