Weird error with reading a .txt file :/

youradds · Aug 19, 2008

Hi,

I'm trying to write some code, that reads the IAN.com datafeed, this is the file:

https://www.ian.com/affiliatecenter/include/Hotel_description_fr.zip

..the code I'm using it:

Code:

    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";
    my $i = 0;
    while (<GRAB>) {
         $i++;
         chomp;
         print $_ . "\n";
         unless ($i % 1000)  { print " $i"; }
         unless ($i % 10000) { print "\n"; }
    }

    close(GRAB);

..however, ALL of the data comes up on one line :/

I managed to fix it with this:

Code:

# just a custom cleanup routine.. seeing as IAN.com don't seem 
# to want to fix up their database at their end :(
sub clean_string {

 my @rules_test =split //, $_[0];

 my $back;
 foreach (@rules_test) {
   $back .= $_ unless ord($_) == 0;
 }

 return $back;

}

..but this doesn't work on large files (cos it just times out, and buggers up my server, as its doing a search on every single charachter, on a 40mb+ file

)

Does anyone have any suggestions?

TIA!

Andy

PinkeyNBrain · Aug 20, 2008

$/ can also be called $INPUT_RECORD_SEPARATOR (under "use English;")

I believe $/ defaults to "\n", but can be changed to "\r" or 0x00 or whatever (seeing a couple of different character strings above to guess from).

If you happen to know the source of your data, you may try something like

Code:

# do these two at startup
$line_sep{'default'} = $/ ;
$line_sep{'that_buggered_foreign_feed'} = "\r"; # or whatever

# <later in your code>
$feed_type = &whats_my_feed_type();
$/ = $line_sep{$feed_type};
while (<GRAB>) {
   # process the input
}

OK - obviously I'm doing a lot of hand waving here but your code above doesn't look like you need tutoring as much as a starting idea.

I've had to do similar but had smaller files to work with and used the following construct to accomplish it

Code:

foreach $line (split /$escape_chr/, <FH>) {
   # process line
}

youradds · Aug 20, 2008

Hi,

Thanks, will give that a go

Cheers

Andy

Annihilannic · Aug 20, 2008

I don't think anything funky is going on, it's just that CuteFTP (and in a way, your scripts) are not really catering for the fact that the file is encoded in UTF-16. Those first two funny bytes you see are the ones marking the file as such. I recommend you use recode to convert it to, say, ISO-8859-1 (a.k.a Latin1), or some other character set if you prefer, before processing the data:

Code:

recode -f UTF-16LE/CR..ISO-8859-1 Hotel_Description_FR.txt > Hotel_Description_FR_L1.txt

Annihilannic.

jet042 · Aug 20, 2008

Looking at the hex dump, it looks like whatever OS they are using to create this file uses just the carriage return character (0x0D) for its end-of-line marker. From Wikipedia, that implies they are creating the file on one of Commodore machines, Apple II family, Mac OS up to version 9, or OS-9. I would assume the use of OS-9 as it is the only server-class OS on that list.

I don't think any of that is relevant, but I do have a suggestion. Bear in mind that I am just now reading Learning Perl, so I am the ultimate n00b, but I do use PHP a lot. PHP has a function fread() that let's you set the maximum number of bytes to read from a resource (usually a file pointer, but sometimes a socket, etc). If Perl has a similar function, you could use it to read a set number of characters (call it a line) into an input queue and then search each line for that carriage return. Once you hit it, pull out everything before it and process it and put everything after it back into a re-initialized input queue. The PHP code would look something like this:

PHP:

while (!feof($fh)) { // $fh is your file handle opened earlier with fopen()
  $line = fread($fh, 1024); //read one killobyte at a time
  $cr_loc = strpos($line, "\r");
  if ($cr_loc === FALSE) {
    $inputqueue .= $line;
  }
  else {
    $inputqueue .= substr($line, 0, $cr_loc);
    // process $inputqueue

    // start over with the rest of that line
    $start = $cr_loc + 1;
    $inputqueue = substr($line, $start);
  }
}

Not the best, but you get the general idea, I hope. Again, sorry that I can't provide that snippet in Perl.

youradds · Aug 21, 2008

Hi,

Thanks guys - still no joy with any of that stuff

Think I'm just gonna give up, and go with the method I have now (where it has to be downloaded, saved in WordPad, and re-uploaded, in its fixed format). Not idea - but at least it works

Thanks for all the help and suggestions though guys - much appreciated

Cheers

Andy

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Weird error with reading a .txt file :/

youradds

Programmer

PinkeyNBrain

IS-IT--Management

youradds

Programmer

Annihilannic

MIS

jet042

MIS

youradds

Programmer

Similar threads

Part and Inventory Search

Sponsor