Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Program to compare 2 text files 1

Status
Not open for further replies.

Pyramus

Programmer
Dec 19, 2001
237
GB
Whats the best strategy to do this? I started by reading in all the lines into two list (a list for each text file). I then used 2 while loops, cycling through the lists and if they found matches, they would delete the matching lines from the lists. I was then left with only the lines that didn't match.

This works fine, except when it comes to large files. I'm trying to compare two files of 500kb, which is about 13,000 lines of text. The above strategy just cant handle that many lines.

Any ideas?
 
Is it a memory problem or is it just a long CPU time? If it's a memory problem you can decrease the memory used by keeping one of the files on the disk and then just cycling through it instead of cycling through a "list." You could actually do that for both of the files, and just output to a third file the lines that don't match. What exactly is the problem, where your computer "can't handle" that many lines? If you program it right it shouldn't take up that much memory, so unless you're on an Atari that shouldn't really be a problem. Help us help you. MYenigmaSELF:-9
myenigmaself@yahoo.com
"If debugging is the process of removing bugs, then programming must be the process of putting them in." --Dykstra
 
It can't handle it in that I left it for 45 mins idling on 98% cpu usage and it still didn't finish the process. Here's the bit of code that cycles through the list looking for matches. I think its this taking ages but I can't think of another way to do it.

Code:
while (it1 != listFile1.end())
	{
		foundit = false;
		
		while (it2 != listFile2.end() && !foundit)
		{
			line1 = *it1;
			line2 = *it2;
			
			if ((_tcscmp(line1.c_str(), line2.c_str())) == 0)
			{
				//the line has been matched
				foundit = true;
				//if the line has been found, erase this line from both lists
				listFile2.erase(it2);
				it2 = listFile2.begin();
				listFile1.erase(it1);
				it1 = listFile1.begin();
			}
			else
				it2++;
		}
		
		it2 = listFile2.begin();
		if (!foundit)
			it1++;
	}
	
	//next just output remaining items in list into relevant sections 
	//in report file
 
Ok, so using total file I/O would probably just make that worse. My suggestion would be to sort one of the lists and put it in a tree structure. You've got too much data to do a linear search. Doing even a binary search can cut down your search time by a few orders of magnitude. Modify your search algorighm and/or your data and that should cut down the time it takes to run. MYenigmaSELF:-9
myenigmaself@yahoo.com
"If debugging is the process of removing bugs, then programming must be the process of putting them in." --Dykstra
 
I dont know much about algorithms :( How do I put it in a tree structure exactly?
 
Well it depends on your tree structure. I've never dealt with any of the MFC tree stuff nor the tree structures provided by the STL. I'm not even sure if there is any such data structure in the STL. I've always created my own tree structure class and just created a method that populates it for me. If you don't want to get into anything that complicated, and with trees you can make them as complicated as you like, with as many children/leaves as you like, branches... etc... It can get messy, so if you want to avoid that just sort your list. Then, I can't remember the name of the sort, but it's not terribly complicated. I learned it in High School. All you have to do is create a recursive function that returns a boolean value. Within the function you look at the middle value. Now note that you're comaring lexographicaly on the first character of the string. If they are equal you return true. Otherwise you process the list. If the comparison, such as a CString.Compare comparison is less than 0 then you pass the value and the left half of the list to the function again. Otherwise you pass the value and a reference to the right half (pass by reference so you don't create duplicates). Repeat this until your list is of size one or 0. If your list is of size one you simply compare the two values, and return true or false. If it is of size 0 you obviously return false since the value was never found. Then you allow the returned value to trickle back down through the recursion. Sounds complicated but implementation is very mathematicaly precise. This is an example of a lexographical binary search, since the maximum number of steps is the floor function of the number of elements divided by 2(binary) added to 2. Hope this isn't to confusing. I know data structures and especially complicated searching algorithm's can get complicated. Maybe your best bet is to just find a data structure that already does that for you. Find out if you can sort a CStringList and then see if the sorting algorithm is maximized for a sorted list. Good luck, and if you need more help just hollar.

p.s. you can also look into hashing functions and hashed data structures. eg. a hash map or hash list. the map would be more effective but I think it has a list back end anyway. basically it removes the string comparison problem by providing a hashing function that converts each string into a semi-unique integer. Good Luck! MYenigmaSELF:-9
myenigmaself@yahoo.com
"If debugging is the process of removing bugs, then programming must be the process of putting them in." --Dykstra
 
wow cheers m8. Looks like i should read up on this area. Voted you :)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top