Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Weird error with reading a .txt file :/

Status
Not open for further replies.

youradds

Programmer
Jun 27, 2001
817
GB
Hi,

I'm trying to write some code, that reads the IAN.com datafeed, this is the file:


..the code I'm using it:

Code:
    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";
    my $i = 0;
    while (<GRAB>) {
         $i++;
         chomp;
         print $_ . "\n";
         unless ($i % 1000)  { print " $i"; }
         unless ($i % 10000) { print "\n"; }
    }

    close(GRAB);

..however, ALL of the data comes up on one line :/

I managed to fix it with this:

Code:
# just a custom cleanup routine.. seeing as IAN.com don't seem 
# to want to fix up their database at their end :(
sub clean_string {

 my @rules_test =split //, $_[0];

 my $back;
 foreach (@rules_test) {
   $back .= $_ unless ord($_) == 0;
 }

 return $back;

}

..but this doesn't work on large files (cos it just times out, and buggers up my server, as its doing a search on every single charachter, on a 40mb+ file :()

Does anyone have any suggestions?

TIA!

Andy
 
$/ can also be called $INPUT_RECORD_SEPARATOR (under "use English;")

I believe $/ defaults to "\n", but can be changed to "\r" or 0x00 or whatever (seeing a couple of different character strings above to guess from).

If you happen to know the source of your data, you may try something like
Code:
# do these two at startup
$line_sep{'default'} = $/ ;
$line_sep{'that_buggered_foreign_feed'} = "\r"; # or whatever

# <later in your code>
$feed_type = &whats_my_feed_type();
$/ = $line_sep{$feed_type};
while (<GRAB>) {
   # process the input
}
OK - obviously I'm doing a lot of hand waving here but your code above doesn't look like you need tutoring as much as a starting idea.

I've had to do similar but had smaller files to work with and used the following construct to accomplish it
Code:
foreach $line (split /$escape_chr/, <FH>) {
   # process line
}
 
I don't think anything funky is going on, it's just that CuteFTP (and in a way, your scripts) are not really catering for the fact that the file is encoded in UTF-16. Those first two funny bytes you see are the ones marking the file as such. I recommend you use recode to convert it to, say, ISO-8859-1 (a.k.a Latin1), or some other character set if you prefer, before processing the data:

Code:
recode -f UTF-16LE/CR..ISO-8859-1 Hotel_Description_FR.txt > Hotel_Description_FR_L1.txt

Annihilannic.
 
Looking at the hex dump, it looks like whatever OS they are using to create this file uses just the carriage return character (0x0D) for its end-of-line marker. From Wikipedia, that implies they are creating the file on one of Commodore machines, Apple II family, Mac OS up to version 9, or OS-9. I would assume the use of OS-9 as it is the only server-class OS on that list.

I don't think any of that is relevant, but I do have a suggestion. Bear in mind that I am just now reading Learning Perl, so I am the ultimate n00b, but I do use PHP a lot. PHP has a function fread() that let's you set the maximum number of bytes to read from a resource (usually a file pointer, but sometimes a socket, etc). If Perl has a similar function, you could use it to read a set number of characters (call it a line) into an input queue and then search each line for that carriage return. Once you hit it, pull out everything before it and process it and put everything after it back into a re-initialized input queue. The PHP code would look something like this:

PHP:
while (!feof($fh)) { // $fh is your file handle opened earlier with fopen()
  $line = fread($fh, 1024); //read one killobyte at a time
  $cr_loc = strpos($line, "\r");
  if ($cr_loc === FALSE) {
    $inputqueue .= $line;
  }
  else {
    $inputqueue .= substr($line, 0, $cr_loc);
    // process $inputqueue

    // start over with the rest of that line
    $start = $cr_loc + 1;
    $inputqueue = substr($line, $start);
  }
}

Not the best, but you get the general idea, I hope. Again, sorry that I can't provide that snippet in Perl.
 
Hi,

Thanks guys - still no joy with any of that stuff :(

Think I'm just gonna give up, and go with the method I have now (where it has to be downloaded, saved in WordPad, and re-uploaded, in its fixed format). Not idea - but at least it works :)

Thanks for all the help and suggestions though guys - much appreciated :)

Cheers

Andy
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top