Performance on importing data into DB

SSJ · Apr 28, 2004

I'm trying to import data created from a production scheduling system into a database that serves my own system.
Thing is, the scheduler generates its result as text files (in a specific format) and I need to import that data into my system DB.

This file can have huge amount of data, like 10.000+ lines(each line in the text file will be a record in DB). The way I was thinking on importing it would be reading the file line by line, parsing it accordingly and insert into DB. But as you can see if 1 line = 1 INSERT statment this will mean my program will hit the RDBMS 10.000+ times and the performance will suffer really bad. And since a rescheduling might be performed several times a day, this might need to be run also several times a day.

I can't see another lighter way of doing this, so I'd appreciate to hear some alternatives that could help me out reducing the load of the importing.

TIA

stefanwagner · Apr 28, 2004

1) check if your rdbms has a import-facility. Postgresql has.
Is this data for only one table? Which rdbms?

2) If your data is correct and the program works well, set the autocommit to off before inserting, and commit manually after all the rows are inserted. This will speed up, especially when the table has indexes.

3) Use prepared statements. Your statement looks like

Code:

"INSERT INTO foo (
    bar
    , baz)
VALUES ?, ?)";

where the '?' act as placeholders for your values.

stevexff · Apr 29, 2004

If your table gets completely replaced at each reschedule, then stefan's suggestion to use import (or load, or whatever your DBMS calls it) is usually the quickest way.

I know this is a java forum, but you might want to consider using perl to parse the text file and reformat it for input to the import. It's fast, and execution time seems to be a priority if it happens several times a day.

SSJ · Apr 29, 2004

thanks for the replies.

I'm using SQL Server 2K. As for the import suggestion, I'm not sure if I understood what you meant. This would be something that would need to be run from within SQL Server (I know I can do it using DTS import system from SQL Server)? Or could I call it from my java app? I really need to trigger the import from my java code.

Also, I explained myself in a wrong way, since what I need to do isn't exactly INSERTs into the DB, but updates to already existing records (Basically updating the scheduler planned dates for operations, and their necessary resources according to the scheduling result), so I believe this way, import in such a way would fail?

I'm already using prepared statements for importing has it is, but I still have to use 1 UPDATE statement per line in the text file, and that's what I'm trying to avoid here.

And stevexff what would exactly be the benefits of using Perl instedad of java for parsing the file? My current parser seems to be quick enough, would using Perl give me extra performance here?

stevexff · Apr 29, 2004

If you already have a parser, and you are happy with the performance, then that's OK. Like they say, "If it's not broke, don't fix it...". In your original post it seemed like you were still at the design stage

Assuming you get several versions of the scheduler output per day, would it be possible to compare current vs. previous on each iteration? Maybe run a diff on them to get the deltas?

If you could identify only those records that had changed during the last schedule run, you would be able to run a much smaller database update with just the changes.

SSJ · Apr 29, 2004

well, theoretically that might be a good idea, I could compare the new scheduler result with the currently stored in the database. Though I'd need to read all the necessary data from the database and then do the dirty work in memory I guess.
Though I'm not sure if I'd have that much of benefits since a scheduler result consists besides the start and end date, of a group of resources (like machines, operators and tools) associated to each operation. So a lot of work would need to be done to compare changes between 2 scheduler results.

And still I suppose that each time a rescheduling is performed most of the start and end planned dates at least will change, since when a rescheduling is made it usuallly means that the real production system is getting behind or ahead of what the scheduler originally forecasted. So most of the times all the planned times will change (even if it's just a few minutes).

I'd have to perform some tests with the system fully working to measure the benefits of such approach. But this is a solution to think about I think.

stefanwagner · Apr 29, 2004

When talking about an import, i meant a non-java solution, that's right.

I didn't use 'addBatch' till now - I gues it's intended for different structured statements, while same statements with changing values might better be handled by prepared statements.

Rereading your post I'm getting astonished: 10.000 updates isn't much. I recently inserted 1 Mio. rows in 70 mins in a localnet - one 1 Ghz machine running the java-program, one 1 Ghz machine running the database.

So my program would have needed about 70 sec. for 10.000 rows. And your equipement might be faster...

stevexff · Apr 29, 2004

SSJ

I wasn't suggesting that you compare the new scheduler output with the data on your database. Compare it with the previous iteration of the scheduler output, outside of the DBMS, which will be much faster. Then use the deltas as input to your database update.

But if you think almost every row will have some kind of change, then you won't get much benefit. The only other thing would be to split your tables so that the most volatile data (start and end times) are held on a separate table from the data that doesn't change much, like descriptions and tasks. This would cut down on the amount of I/O.

Generally it's best to go for the easy solution, as it's easy to understand and maintain. If you find it doesn't perform fast enough, you can make it more complicated...

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Performance on importing data into DB

SSJ

Programmer

stefanwagner

Programmer

stevexff

Programmer

SSJ

Programmer

stevexff

Programmer

SSJ

Programmer

stefanwagner

Programmer

stevexff

Programmer

Similar threads

Part and Inventory Search

Sponsor