Operations with large text files 1

ChessPro · Feb 20, 2006

I’ve got a large text file – about 400 MB. The text has a special format, so it’s easy to find a particular line within the file, etc, but the format itself is not important in my question. I need to insert some new text at a specific point. The rest of the text remains unchanged. What is the most efficient technique on Object Pascal for that purpose (in terms of machine time used)?

buho · Feb 20, 2006

If you need to insert only one block (ie: some content at ONE point) the best way is to copy the first part of the file to another, add the new block and copy the remain of the file.

If you need to insert at more than one point, the best way is to load all the file in memory (like in a TStringList), insert the new blocks and save all the file at end.

Anyway, if you are inserting only one time or every other week, efficience is not important. Do it as you see best.

buho (A).

ChessPro · Feb 20, 2006

Thank you buho,

Still I have some doubts. If I am going to use the first method (copying to a new file), I don’t see another way but copying all lines of the old file to the new one using ReadLn and WriteLn procedures. But there are millions lines in the file (it’s 400 MB and one line is less than 80 characters), isn’t this operation going to be too long?
The second method you advise seems to be perfect, but again I have doubts because of the file size. All 400 MB are going to be loaded to the operating memory, right? And not all computers have such a large memory. Some users of my program may have problems because of that. Besides loading of 400 MB to the operating memory may also take quite a while…
Would appreciate more of your comments on that issue.

Glenn9999 · Feb 20, 2006

There is no way to physically insert bytes in a file in the middle of it. So you have to copy it. And yes, typically, you are going to use readln and writeln.

Investigate SetTextBuf for your speed issues. But realize, too, this is going to be inherently slow, simply because you are copying 400MB around.

buho · Feb 21, 2006

The file being bigger than the physical memory does not matter; Windows will swap to disk as needed. You can perfectly load it in a TStringList without any problem (other than slowiness).

As Glenn said, setting the file text buffer on steroids can help a lot. And it is gonna be slow anyway, copying 400 MB is a hard cookie for most machines. Pray for the user keeping the file in a local disk

.

Please, tell me more about the file format, if you don't need to read line by line to find the insertion point you don't need to copy line by line either.

On the other hand, if you need to read line by line to find the point, go copying while walking the file, you have no options.

buho (A).

ChessPro · Feb 21, 2006

Thank you guys,

It’s good to know that the physical memory is not a question, as the current version of my program (Smart Chess Viewer) is currently loading the whole file to a TList and then uses a TStringList to save changes to disk. The format of the file is pgn (portable game notation) used to store chess games collections. A sample chess game record in pgn format is as follows:

[Event "F/S Return Match"]
[Site "Belgrade, Serbia JUG"]
[Date "1992.11.04"]
[Round "29"]
[White "Fischer, Robert J."]
[Black "Spassky, Boris V."]
[Result "1/2-1/2"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 d6 8. c3
O-O 9. h3 Nb8 10. d4 Nbd7 1/2-1/2

As shown above it typically consists of seven tags and a movetext section. I’ve never had customers reporting that the program works too slow opening and saving pgn files. But that’s because pgn files are typically small (less than 10 MB). Other non-text formats are used to store large databases (e.g. cbv). But there is one truly gigantic pgn file presumably holding all the official games played till present time (2000000) and it’s about 400 MB. I don’t have this file and I try to figure out theoretically how SCV would work with this file. I am now also in doubt whether my method of loading and saving is the most efficient. The problem is that I have to resave all 2000000 games in the file in order to make changes just to one game. The user also has to wait until the whole file is loaded to TList before he can work with the games. Now I think that it’s better to load the games from the disk on the as-needed basis. The problem here is to reach a particular game in the file though. E.g., using a counter and ReadLn procedure.

Would appreciate your thoughts on that matter, guys.

Glenn9999 · Feb 21, 2006

sounds like a tailor made problem for a binary file, as opposed to text. Once you convert the 400MB to binary records (one record representing each game, perhaps), then you would have the option to seek the file to specific records as well as overwrite each record if you need to update them.

Then you can create an index file (s) that you can load to search them for whatever criteria you are interested in.

Or if you want to go that route, try setting it up in a database. All the file allocation and stuff should be automatic then, and all it takes is SQL on your end (or however BDE/firebird/Paradox does things) to make things happen.

Of course the first trick would be converting the data properly to binary file. Then writing your program so you can read whatever you end up with. That'll probably make your program easier, since reading binary data records is much more straightforward than translating out text.

towerbase · Feb 21, 2006

This is definitely a database appliction.

I wouldn't recommend Paradox for this quantity of data.

SQLite is probably your best bet. It is free and there are no significant licensing implications for distributing your application.

MySQL is excellent but you can't distribute it free with your application.

If you use a properly designed and indexed database you will be amazed at the responsiveness of your program.

Andrew
Hampshire, UK

ChessPro · Feb 22, 2006

I do realize that a database would be a perfect solution in my case, but the problem is that I have to save the file in the pgn (and not other) format for the purpose of compatibility. It’s a standard chess format read by many conventional chess software – databases, engines, viewers, etc.
Anyway the issue appears to be clear to me now. If I decide to develop a special format as an alternative to pgn in future, I will definitely think of a database! Thank you all!

Glenn9999 · Feb 22, 2006

There's nothing that says that you can't import this 400MB of text into your program as a binary file for processing and then provide an export option in your program to get any changes that might be made into this PGN format (in part or total).

About like a word processor that plays ball with files in its own format, but gives you an option to export to a more universal accepted format like PDF.

ChessPro · Feb 22, 2006

My main concern was file export. But I understand your idea: create an index file and process the file as binary. Thanks for the help!

buho · Feb 23, 2006

> create an index file

That is my bet too.

1) Create an external index file, so you can quickly positionate and block-read the game you need. The index have the key-field (Event and/or Site and/or Date and/or whatever, the offset into the file and the size of the block plus some flags -see "3").

2) Add any thing new to the end of the file (a very inexpensive action).

3) If the user changes something, mark the old index as "not valid", let the old content where it is and proceed like in "2".

4) Add a "Compact" command to your program. "Compact" works deleting the changed items in the main file (a long process, as you need to make a lot of partial copies).

5) Add an "Export" command. Export exports the selected block.

6) Add a "Query" command too, so the user can locate all games Capablanca played with blacks or all games Petrosian won in less than 12 moves.

The only problem I can see is your key being very long; for what you have shown us, the info header is bigger than the moves themselves. You will need to limit your index (and your search capabilites -forget "6"

.

buho (A).

Glenn9999 · Feb 23, 2006

Create an external index file, so you can quickly positionate and block-read the game you need. The index have the key-field (Event and/or Site and/or Date and/or whatever, the offset into the file and the size of the block plus some flags -see "3").

Even multiple index files that ties to the main binary file for each field he would want to search on. If this index search file is read into memory as a binary-tree style structure, then it can be used to key to the main file. This isn't unlike what indexes do when they are created in databases.

A index record would look like:

siteinfo: string[30];
recordnum: longint;
inuse: boolean;

Where siteinfo is one of your "sites", and recordnum is the physical record position in your main binary file. Of course, the trick comes in updating the records or deleting them, and that's partly where the inuse flag can come. If it's a valid record or not (i.e. the user hasn't deleted it).

Basically whatever works best for the data you have and what you are wanting to search on.

towerbase · Feb 23, 2006

It seems to me that what you guys are doing is developing a crude database. Whilst this may be an interesting exercise, why bother when good DBMSs already exist?

Surely what ChessPro needs to do is implement a way of importing pgn format files into a database and exporting data from the database in pgn format. This should be fairly trivial.

Andrew
Hampshire, UK

ChessPro · Feb 23, 2006

Creating an efficient index file seems to be an interesting problem. Let me describe some of my program’s functionalities to give you an idea of the scope of problems connected with rebuilding the code to work with an index file.
Current version of my program does the following:
1. Copies and deletes selected games.
2. Moves selected games to a specified position in the list.
3. Adds new games and changes the contents of the existing games.
4. Allows user to access any game in the list specifying the game number.
5. Performs search by any field or a combination of fields specified by user (e.g., player’s name, place, result, etc.)
Those features are not the purpose of my program though, as it serves mainly for showing the games (in animated mode using special chess annotation) and not for the games management. But I’ve provided basic management features for the user’s convenience.
So now I have to preserve all those features reorienting my program to work with an index file.
I think that the main requirement to the index file, given the possible big size of the main files, is its size and hence the time needed to build it. That’s why I think that a 3-field structure representing each game is too expensive. 35 bytes for each record, it makes 70 MB index file if we have 2,000,000 games in the main file. I think the index file should by one-dimensional – just the offset for each game. The end of the game block is the offset for the next game. To split the block into separate records is nothing in terms of machine time. In this case I will have 8MB index file for 2,000,000 games. If I need to modify a game for example, I can do the following:
1. Copy the part of the file after the game being changed into a temporary file.
2. Delete the old game and all subsequent games in the main file.
3. Append the modified game.
4. Append the temporary file.
5. Update offsets for the games after the modified game by adding or subtracting the size difference between the deleted block and the inserted block.
In this case I don’t need the “inuse” flag as all my records are active. I think such index file should work OK.
My other concern is whether to keep the index file after closing the main file or create a new one every time I open the file. The former would be perfect, but the problem is that other programs may access the main file and make changes to it. In this case my index file would work incorrectly. I think this can be solved using FileAge. If timestamp is wrong the program will create a new index file.

Would appreciate your opinion, guys.

buho · Feb 24, 2006

If you are going to delete/insert a lot, get all the info into a dbase and make an "Export All" command as Tower say.

An external index is a good idea only if you have little deletion/edition work.

buho (A).

Stretchwickster · Feb 24, 2006

towerbase said:
It seems to me that what you guys are doing is developing a crude database...Surely what ChessPro needs to do is implement a way of importing pgn format files into a database and exporting data from the database in pgn format.

I agree.

ChessPro, the most efficient way of doing this would be to design how your database is structured i.e. list all the data you need to store, normalize it into tables (e.g. you will at least have Game and Move tables in your database) and describe the relationships between the tables. All this falls under the banner of Entity-Relationship Modelling (ERM), so you should be able to come up with an E-R Diagram which clearly shows tables and relationships.
It should then be a straightforward exercise to implement your E-R Diagram as a database.

You will need to write code to import/export the data to/from the database tables as towerbase said.

Your requirements...

ChessPro said:
Current version of my program does the following:
1. Copies and deletes selected games.
2. Moves selected games to a specified position in the list.
3. Adds new games and changes the contents of the existing games.
4. Allows user to access any game in the list specifying the game number.
5. Performs search by any field or a combination of fields specified by user (e.g., player’s name, place, result, etc.)

...can easily be satisfied by constructing SQL queries for each of the above tasks.

Clive

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"To err is human, but to really foul things up you need a computer." (Paul Ehrlich)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get the best answers from this forum see: faq102-5096

buho · Feb 24, 2006

Only one table will do.

A move alone have not meaning and in no way a set of moves can be related with more than one game.

buho (A).

ChessPro · Feb 24, 2006

A move alone have not meaning and in no way a set of moves can be related with more than one game.

That’s why I am somewhat reluctant to use a database for such kind of text files – there is no relationship between records except the games order. Using a database seems a sort of overkill in this case. I’ve also checked how ChessBase (a recognized chess database management program) handles pgn files. It creates a file with .pgi extension – must be an index file, but then for some reason deletes it after the main file is closed.

towerbase · Feb 24, 2006

buho said:
in no way a set of moves can be related with more than one game.

I would have thought that chess openings were a set moves that can be related to more than one game.

ChessPro said:
there is no relationship between records except the games order

Aren't the all games played by a particular player of interest?

The reason for using an established DBMS such as SQLite is that it saves you the effort in developing your code to create and manage the index file. Someone has done it already for you. It is not a trivial task because your requirements will keep changing and you will need to keep modifying the index handling code.

ChessPro said:
I do realize that a database would be a perfect solution in my case, but the problem is that I have to save the file in the pgn (and not other) format for the purpose of compatibility.

What is the problem with extracting data from the database in pgn format?

Andrew
Hampshire, UK

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Operations with large text files 1

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor