Perl on the server, does it fit the bill for this job? 2

hughed00 · Feb 8, 2001

I have a SQL server machine that has Perl on it. No one currently with the company knows how to use Perl. Unfortunately for me, I am supposed to come up with the answer to the following question, oh, like yesterday!

I apologize in advance for my ignorance, but what little I do know of Perl is that it seems like it would solve the current problem readily.

We have a very large fixed length field text file which is used as the source for a reporting database. It has only one 'unique' field (message ID). The problem is that in freakish system situations bogus records get written to the log which result in duplicate records or records with the same message ID.

How would I go about stripping out all occurences of records containing the non unique references to the message ID, EXCEPT the last record (this is the one that actually got delivered)?

Any guidance is greatly appreciated. I will be reading my recently aquired O'reilly Perl books on the train....

stillflame · Feb 8, 2001

Perl is great at text manipulation and searching, and was known in it's early days simply for that task alone. Getting only the unique elements out of something is a chore that i wrote a FAQ on here at tek-tips, and i'm sure you could change it to suit your task. You could probably even use sed or awk (maybe both) to do this chore too, but i suggest you get a programmer to do it, as the details will be necessarily meticulus. "If you think you're too small to make a difference, try spending a night in a closed tent with a mosquito."

goBoating · Feb 9, 2001

Agreeing with stillflame...... Perl would do the job well.

How large is 'very large'? Large is a relative term..... relative to the amount of RAM on the box. Perl will handle a lot of text very well. Perl has very strong regex and memory structure capabilities. I have written applicatons that convert 20 and 30 meg XML files to HTML very nicely. However, if you are repeatedly producing reports from this text database and it is truely very large ( hundreds of megs), and if you expect the data set to grow, then........ it seems like the real fix would be to put your SQL server software to work using your message ID as the primary key. You can setup the SQL server as an ODBC source and use the Perl DBI/DBD::ODBC modules to connect for your reporting.

good luck..

keep the rudder amid ship and beware the odd typo

hughed00 · Feb 9, 2001

By "very large" I am talking about 250 meg a month added to the text file source for the database. The database table is dropped every night and refreshed with the updated data, which actually goes pretty quickly.

The problem is the primary key issue since the source text file is "dirty" (about ~100 bogus records per month out of ~1000000). The only way I can see using the message ID is to clean out the bad records prior to importing. Since Perl is on the system, I thought it would be a possible tool for this that would not be overly difficult to use.

I get the impression by the posts that I should still lobby for some programming help as this is not a trivial bit of Perl scripting. Correct?

Thanks for your help

goBoating · Feb 9, 2001

You could use Perl/DBI/DBD to do the entire trick. Depending on a number of issues, you could either use Perl to rewrite the file and then do the insert , OR, as you read the file, do an update or insert as needed. OR, TMTOWTDI...OR....OR....

If you are going to use Perl to connect to the database, get 'Programming the Perl DBI', by Alligator Descarte and Tim Bunce, from O'Reilly. It would be $35 (US) spent on 333 very valuable pages.

The following lays out a flow(leaves out a lot) for doing an insert or an update if the message ID aready exists. This is a little verbose to allow for some clarity.

#!/usr/local/bin/perl

use DBI;
# do the database connect stuff and construct the sql statements

# open the text file.
open(IPF,">theLargeFile&quot

die "Failed to open input file , $!\n";

# while reading the file....
while ($line = <IPF>)
{
# split the line on a delimiter, maybe a pipe |
my @fields= split(/\|/,$line);

# if message ID is the first field.
my $id = $fields[0];
if ($messageIDs{$id})
{
# id already exists in DB.
# do an update instead of an insert.
}
else
{
# stuff the id into an associative array to record the fact that it exists
$messageID{$id} = 1;
# insert to the db for a new message ID
}
}

'hope this helps...
Good Luck.

keep the rudder amid ship and beware the odd typo

hughed00 · Feb 9, 2001

Thanks much.
I misled you though when I described the data, the fields are fixed length, but no delimiter. Fields are padded with spaces. Each record is delimited by a newline. The message ID is always 48 characters long and sits at the same position in the record.

Here is an example of the data with record 2 and 3 having identical message IDs;

XSBH_HUB00052001020105:34:37414D5120584E41495F4E54585F30303300006B003A78F57D
EANCM5034316404080 EANCM5034316372020
ENTERPRISE LIVE NO DG0000001250

XSBH_HUB00052001020105:34:37414D5120584E41495F4E54585F30303300006B003A78F57E
EANCM5034316404080 EANCM5034316372020
ENTERPRISE LIVE NO DG0000001250

XSBH_HUB00052001020105:34:38414D5120584E41495F4E54585F30303300006B003A78F57E
EANCM5034316404080 EANCM5034316372020
ENTERPRISE LIVE NF DG0000001317

goBoating · Feb 9, 2001

while ($line = <IPF>)
{
# use substr to clip id from line - substr($line,start_where,how_long);
$id = substr($line,28,48); # I may not have counted correctly. maybe 27,49 or other.
if ($messageIDs{$id})
{
# id already exists in DB.
# do an update instead of an insert.
}
else
{
# stuff the id into an associative array to record the fact that it exists
$messageID{$id} = 1;
# insert to the db for a new message ID
}
}

keep the rudder amid ship and beware the odd typo

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Perl on the server, does it fit the bill for this job? 2

hughed00

Technical User

stillflame

Programmer

goBoating

Programmer

hughed00

Technical User

goBoating

Programmer

hughed00

Technical User

goBoating

Programmer

Similar threads

Part and Inventory Search

Sponsor