Operations with large text files 1

ChessPro · Feb 20, 2006

I’ve got a large text file – about 400 MB. The text has a special format, so it’s easy to find a particular line within the file, etc, but the format itself is not important in my question. I need to insert some new text at a specific point. The rest of the text remains unchanged. What is the most efficient technique on Object Pascal for that purpose (in terms of machine time used)?

Glenn9999 · Feb 24, 2006

towerbase wrote:

"It seems to me that what you guys are doing is developing a crude database. Whilst this may be an interesting exercise, why bother when good DBMSs already exist?"

While a DBMS could do this, this seems to be a perfect 1-1 relationship across the board, which would mean only one table. A DBMS would be definite overkill, and there's no
assurance that ChessPro would want to have the weight of installing the DBMS along with his/her app.

Glenn9999 · Feb 24, 2006

ChessPro wrote:

"Current version of my program does the following:
1. Copies and deletes selected games.
2. Moves selected games to a specified position in the list.
3. Adds new games and changes the contents of the existing games.
4. Allows user to access any game in the list specifying the game number.
5. Performs search by any field or a combination of fields specified by user (e.g., player’s name, place, result, etc.)

So now I have to preserve all those features reorienting my program to work with an index file. I think that the main requirement to the index file, given the possible big size of the main files, is its size and hence the time needed to build it. That’s why I think that a 3-field structure representing each game is too expensive. 35 bytes for each record, it makes 70 MB index file if we have 2,000,000 games in the main file. I think the index file should by one-dimensional – just the offset for each game."

Okay, let's look at how index file(s) - nothing says this 3-field structure has to exist in all of them. It depends
on what you're looking for in your functionality.

1. OK, fine. This is where you "inuse" field works. Copy the game out, or delete it by setting the flag. You don't have to restructure the main file.
2. Perfect for an index file. You can relate a game number as a key field to a record offset.
3. Good for your index files. Append the new game to the end of the main file, and adjust the index files accordingly.
4. Load your game number/offset index, find the appropriate game number.
5. This one is much trickier. It might be more preferential to just do a sequential search through the main file.

There's really a lot of logic and thought that could be applied here. In a lot of ways, the DBMS is better for that
purpose - it's going to handle all of this in the best way for you. The main question is if you're going to want the DBMS overhead.

ChessPro wrote:

"If I need to modify a game for example, I can do the following:
1. Copy the part of the file after the game being changed into a temporary file.
2. Delete the old game and all subsequent games in the main file.
3. Append the modified game.
4. Append the temporary file.
5. Update offsets for the games after the modified game by adding or subtracting the size difference between the
deleted block and the inserted block."

It doesn't need to be that involved. Simply open the file in read/write mode, seek to the appropriate record, and
overwrite it. If you want to add a record, seek to the end, and do the write.

The main point for the indexes is to minimize the amount of reading of the disk that you do (which takes a ton of time compared to memory writing) - I keep having these scary visions of you trying to copy around 400MB of data every time you want to make a change, or reading through 400MB of data every time you do a search. Not good, especially since you said multiple PROGRAMS are possibly going to access this.

Especially if you do a lookup, if you can get it done by reading an 8MB index file and then seeking to the exact spot you need in the 400MB file, then you're saving time. It seems an example of binary file manipulation might be in order, especially since there's a bit of confusion.

If I get the time, I'll try to come up with a quick example of what we're talking about. Depending on what your exact requirements are, though, it probably wouldn't hurt in the end for you to just go with the DBMS.

Glenn9999 · Feb 24, 2006

ChessPro wrote:

It creates a file with .pgi extension – must be an index file, but then for some reason deletes it after the main file is closed.

Or perhaps it translates the PGN into binary so it can work with it, and then translates it back to PGN upon exit?

Glenn9999 · Feb 25, 2006

(sorry about so many responses in a row)

The odd thing about a problem like this is that there is a medium option between "binary file" and "database". The funny thing about that, is that when I was working with COBOL, that option was there and it was called VSAM. It was supported innately through the system, and didn't require any additional code, besides describing the index field and any alternate indexes. The only requirement for it was that you had to code the logic to handle navigating it (read sequentially or set index field to search value and read and it would produce that exact record if it existed).

Basically what VSAM is, is a modified b-tree style indexed file, which is what we've been sort of hinting at. Ultimately, this would be the perfect solution for ChessPro's program if we were playing with COBOL, but we're playing with Delphi. That's one of those wierd legacy things about Pascal being Niklaus Wirth's teaching language. You basically have to construct everything for yourself, even what seem to might be the simplest thing. Borland's done a wonderful job in extending that toolset out, but this is still somewhat of a hole.

I understand some companies/groups produced a b-tree style file driver back in the Pascal days (the name "Turbo Power" comes to mind, but my memory is not good) , but I don't know if such a thing exists for Delphi. Perhaps, that might be another solution to search for.

buho · Feb 25, 2006

> towerbase
I would have thought that chess openings were a set moves that can be related to more than one game.
<

He is not saying he wants to do opening analisys or statistical analisys. Lets our solutions stick to what he actually have in mind.

> glenn

The problem is the edition. In the original file the "records" have not the same size; adding/deleting moves in a game is easy from a starting point of view, but compressing the text file can become a major pain.

What I'm saying is:

a) He need to mark the edited entries as "unused" and add the edited records to the end of the file and
b) He need to keep a "clean" text file to not trick another programs opening it (what I'm calling "compressing").

(Of course, growing/shrinking the record space on the fly is out of question, as it is a very expensive operation).

So, for an edition every other week the external index approach is fine; but if he is facing a lot of editing a dbase turns to be better, as an "exportation" will be easiest than a "compactation".

He don't need to use a sophiticated dbase, JET/Access will do well and, nowaday, it is nearly part of Windows. No need for drivers or engines install; it is there, and ADO is easy to use.

The only thing I dont like in this approach is the MDB grow, as the MDB compression is tricky (or impossible) to do with ADO.

buho (A).

towerbase · Feb 25, 2006

VSAM (Virtual Storage Access Method) was an IBM mainframe data access method. It was not specifically part of COBOL. One could access VSAM datasets from almost any language - APL, Assembler, PL/I and so on as well as from COBOL.

I am not really sure of the relevance of VSAM here as I don't think it is available on Windows. But Glenn9999's idea of using a proven piece of software to manage the indexes for ChessPro's application is a good one.

Glenn9999 said:
A DBMS would be definite overkill, and there's no assurance that ChessPro would want to have the weight of installing the DBMS along with his/her app.

The footprint for SQLite is quite small - around 200KB. I don't see that as either a large overhead or overkill as you aren't obliged to use all the facilities of SQLite.

While there might be a "perfect 1-1 relationship" at present, requirements may change. In the future, ChessPro might want to include biobraphical information about a chess player. In which case he would need a two table database. These kind of enhancement can be implemented more easily if the right choices are made now.

Andrew
Hampshire, UK

towerbase · Feb 25, 2006

I was simply querying the statement

buho said:
A move alone have not meaning and in no way a set of moves can be related with more than one game.

which seemed to me to be incorrect.

A good design will allow enhancements to the system which could answer questions such as "which chess opening gives the highest percentage of wins to White in the millions of games stored on the database". I would have thought that the answer to that question would be of interest to chess players.

Andrew
Hampshire, UK

buho · Feb 25, 2006

I'm not making a philosophical assertion or stating the nature of the universe

but working within the constraints/scope ChessPro described.

And in such scenario "A move alone have not meaning and in no way a set of moves can be related with more than one game." He is not making statistical analisys, so no need to break down the moves.

Outside that scope all bets are off, but is futile to discuss what can be needed in a specification which no one is proposing.

buho (A).

Glenn9999 · Feb 25, 2006

VSAM (Virtual Storage Access Method) was an IBM mainframe data access method. It was not specifically part of COBOL. One could access VSAM datasets from almost any language - APL, Assembler, PL/I and so on as well as from COBOL.

I am not really sure of the relevance of VSAM here as I don't think it is available on Windows. But Glenn9999's idea of using a proven piece of software to manage the indexes for ChessPro's application is a good one.

I was using VSAM as an example of a tool that I know exists which would handle the requirement perfectly, and to boot handled the index stuff to the point that you didn't have to worry about it in the code other than specifying it.

I also went on to say that I remember of similar tools back in the Pascal days, and even mentioned what I thought was one group/company that provided this tool.

The footprint for SQLite is quite small - around 200KB. I don't see that as either a large overhead or overkill as you aren't obliged to use all the facilities of SQLite.

I'm not talking about any resource usage or anything of that nature. Wouldn't this require additional software outside of what ChessPro would provide?

Come to think of it, I like the Jet solution perhaps the best. Little to no chance of not already having the software present.

ChessPro · Feb 25, 2006

buho said:
He is not saying he wants to do opening analisys or statistical analisys.

Correct. My program is a viewer with some unique replay and graphical annotation features, and not a search engine. Other established chess database management software can be used for that purpose. There is no sense in copying their features unless you come up with something new. I’ve just provided the basic management features, the use of which is justified for my program purpose.

Glenn9999 said:
Or perhaps it translates the PGN into binary so it can work with it, and then translates it back to PGN upon exit?

I’ve experimented on a 190MB PGN file holding 273144 games; the PGI file produced by ChessBase makes about 1MB. This is what we get as the size of an index file if we use 4-byte Cardinals to store games’ offsets. Besides there is an ‘I’ as the last letter in the extension, so I bet it’s an index file. The thing I can’t get though is why ChessBase deletes this PGI file after the file is closed and creates a new one every time the file is opened. It takes quite a while for such a large file even for ChessBase.
By the way, I tried this file on my program – one can take a nap while the games are loaded/ saved.

Glenn9999 said:
Simply open the file in read/write mode, seek to the appropriate record, and
overwrite it.

Unfortunately it can’t be done. Records are of various sizes. But certainly the algorithm I’ve suggested is too expensive. I think the following method is a good practical approach:
1. Add new information to the end of the file (thank you, buho!)
2. Store the unused index in a separate index file, which will hold all unused indexes (here I disagree with buho and Glenn9999 – no “inuse” Boolean should be used).
3. Update the index in the main index file (now the index points to the record in the end of the file).
4. On program exit delete unused records from the main text file using “unused” index file, delete “unused” index file, rebuild main index file.
5. If the main text file is too large (e.g. more than 10 MB) offer a choice for user (1. Compress changed file(s) now (recommended); 2. Compress changed file(s) later) with a notice that this procedure may take time.

buho · Feb 28, 2006

No inexpensive way to mark the edited records in the main file?

Say, all records start with "[Event ...]" and you have a pointer to this line.

Lets say the record start with:

Code:

[Event "F/S Return Match"]

What about changing it to:
[Event "EDITED"          ]
or may be
[Event "ERROR"           ]
adding the due #32 (spaces) chars to keep the record size unchanged?

It is a very inexpensive operation, as you can launch a quick block-write to change the field (the overall file layout is not changed).

The idea is warning the users opening the file in another program that this particular game is fishy. Of course the foreign program will open the game, but at least you are giving the user some warning.

The problem comes with games where the "Event" value ("F/S Return Match" in the example) have less than six/five chars. No idea about the likelihood of this; may be you can work it out in some other way (like hijacking not the "Event" field but the moves, so the game show no moves).

HTH.
buho (A).

ChessPro · Mar 6, 2006

buho said:
The idea is warning the users opening the file in another program that this particular game is fishy

I am afraid that the users would rather think that the software producing such changes to the game record is fishy.

buho · Mar 7, 2006

Matter of interface.

I think you have two options:

a) Compressing the file when the program closes, which will take some sensible time and some sensible disk resources.

b) Letting the user compress the file at will.

I personally doubt being using a program behaving like "a" for too much time, as I hate hoggers.

On the other side, if you are using "b" and you detect the user made changes in some games, you can show a messagge telling the user that a compress can be of use, and explaining him the situation if he chooses not to do so (the situation is, of course, some file entries being marked as "ERROR" or having not moves at all). Explain it in the help file too.

You are producing changes to the whole games file layout; having a record with no moves and/or marked as ERROR, CHANGED, EDITED or whatever is way better than having a fishy record other program can show as a valid one.

buho (A).

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Operations with large text files 1

ChessPro

Programmer

Glenn9999

Programmer

Glenn9999

Programmer

Glenn9999

Programmer

Glenn9999

Programmer

buho

Programmer

towerbase

Programmer

towerbase

Programmer

buho

Programmer

Glenn9999

Programmer

ChessPro

Programmer

buho

Programmer

ChessPro

Programmer

buho

Programmer

Similar threads

Part and Inventory Search

Sponsor