Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Weird error with reading a .txt file :/

Status
Not open for further replies.

youradds

Programmer
Jun 27, 2001
817
GB
Hi,

I'm trying to write some code, that reads the IAN.com datafeed, this is the file:


..the code I'm using it:

Code:
    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";
    my $i = 0;
    while (<GRAB>) {
         $i++;
         chomp;
         print $_ . "\n";
         unless ($i % 1000)  { print " $i"; }
         unless ($i % 10000) { print "\n"; }
    }

    close(GRAB);

..however, ALL of the data comes up on one line :/

I managed to fix it with this:

Code:
# just a custom cleanup routine.. seeing as IAN.com don't seem 
# to want to fix up their database at their end :(
sub clean_string {

 my @rules_test =split //, $_[0];

 my $back;
 foreach (@rules_test) {
   $back .= $_ unless ord($_) == 0;
 }

 return $back;

}

..but this doesn't work on large files (cos it just times out, and buggers up my server, as its doing a search on every single charachter, on a 40mb+ file :()

Does anyone have any suggestions?

TIA!

Andy
 
Did you unzip the file? Is it a text file?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Hi,

Yup, its a text file (seperated by |)

The problem is, that for some reason - when a Perl script (or even when I download it, and open in EditPlus), it all comes up as one line.

..doing this:

my $back;
foreach (@rules_test) {
$back .= $_ unless ord($_) == 0;
}

..works fine (once that runs, you can download the file - and it has one entry per line, as it should be).

The problem is - that uses up HUGE amounts of memory - as it has to read 11mb into memory, and then go through every charachter, and check if it needs to be changed.

Basically, I just need to convert the buggered "new lines", into proper ones - so that I can read it properly with my script (at the moment, it takes about 1 hour to even process the data, and write it to a "clean" version, where newlines exist - but I need something that can do it in a single process, and then I can run it)

To be honest, downloading the link I gave in my first post, and opening it to see what I mean - should give you an idea of what problems I'm having.

The most annoying bit about it - is that I reported this error 5-6 years ago, and they still havn't fixed it. Shows what they think of the affiliates that bring them in money *mad*

TIA

Andy
 
Sounds like a CRLF problem. As in, they create it on Unix, you are trying to read it under Windows. So as far as they are concerned, it isn't broken...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Try opening the file in a different editor, Word seems to be able to handle \n or \r\n.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Make sure if you download the file you download as binary and not ascii.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
travs69 said:
Make sure if you download the file you download as binary and not ascii.

I'd recommend the reverse actually, that way the line terminators would be converted to those used by the local system for free.

Annihilannic.
 
It's a .zip file. It's binary, downloading at as ascii doesn't sound like a good deal.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Err, yeah. Sorry, was paying too much attention to the title of the thread. :-$

Since the link provided is a https:// and not FTP I guess you aren't given the choice anyway.

Annihilannic.
 
Hi,

Thanks for the help guys - still no joy though :(

Please note, this is going to be an automtated script - so downloading the file, saving in Word/another editor etc, and then uploading manually isn't really any option :()

Heres the full example of what I'm using:

Code:
#!/usr/local/bin/perl

    use strict;

    my $datafile = "./Hotel_Description_FR.txt";

#####################################

    system('rm -f Hotel_description_fr.zip');
    system('wget [URL unfurl="true"]https://www.ian.com/affiliatecenter/include/Hotel_description_fr.zip');[/URL]
    system('unzip -o Hotel_description_fr.zip');

#####################################

    print "Cleaning up datafile ... \n";

    my $back;
    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";       
       while (<GRAB>) {
         $back .= $_;
       }
    close(GRAB);

    $back =~ s/\r/\n/sig;

    open(WRITEIT, ">$datafile.new") || die "Error writing $datafile.new. Reason: $!";

      print WRITEIT $back;

    close(WRITEIT); 

  print qq|done...|;

....then, to test reading it:

Code:
#!/usr/local/bin/perl

    my $datafile = "./Hotel_Description_FR.txt";

    open(READIT, "$datafile.new") || die "Error writing $datafile.new. Reason: $!";
     while (<READIT>) {
      print $_;
     }
    close(READIT); 

  print qq|done...|;

..and all that does is return it all on on line again :(

Any more suggestions?

TIA!

Andy
 
Ok, just tried loads of stuff here:


I tried these ones:

Code:
tr -d '\15\32' < Hotel_Description_FR.txt > Hotel_Description_FR.txt.2

awk '{ sub("\r$", ""); print }' Hotel_Description_FR.txt > Hotel_Description_FR.txt.2

perl -p -e 's/\r$//' < Hotel_Description_FR.txt > Hotel_Description_FR.txt.2


..and then to see if it worked, I just run my test.cgi script, which just does a while(<READIT>) { print $_ } thing - but nothing was returned. This is getting really annoying :(

Any suggestions are much appreciated.

Cheers

Andy
 
Hmmm.. in your original code it looks like you're removing all the occurances of the character 0x00, is there one (or more) next to each 'real' character if you look at the text file in a hex editor?

If so, the file is probably encoded in UCS-2 (if there is only one of those 0 characters) or UCS-4 (if there are three.) UFT-8 is probably more what you're used to using.

For more info, you might want to have a look at perldoc perlunicode and the Encode module.
 
Just tested your scripts on a Linux box after downloading the data file and they seem to be working fine to me.

The file has extremely long lines in it, are you sure you just aren't noticing the carriage returns in all of that data? For e.g., if I run perl addy2 | head -1 I only get the column headings, so they are definitely correctly terminated.

Annihilannic.
 
rharsh, FYI:

Code:
$ file Hotel_Description_FR.txt
Hotel_Description_FR.txt: Little-endian UTF-16 Unicode English character data, with very long lines, with CR line terminators

Annihilannic.
 
Hi,

Ok, just opened the file up in CuteFTP Pro 8 (using the inline editor), and this is what I saw:


..then, run this script:

Code:
#!/usr/local/bin/perl
# French Import Script for IAN.com. 
# (c) UltraNerds.com, a division of PUGDOG(r) Enterprises.
# Brought to you via HotelSQL.com.
 
    use strict;
    use GT::SQL;
    use GT::SQL::Condition;    
    use lib './';
    use Links qw/$IN $DB $CFG/;

    Links::init('./');

    my $datafile = $CFG->{'admin_root_path'} . "/Hotel_Description_FR.txt";


#####################################

    chdir($CFG->{'admin_root_path'});
    system('rm -f Hotel_description_fr.zip');
    system('wget [URL unfurl="true"]https://www.ian.com/affiliatecenter/include/Hotel_description_fr.zip');[/URL]
    system('unzip -o Hotel_description_fr.zip');

#####################################

    print "Cleaning up datafile ... fff\n";

    my $back;
    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";       
       while (<GRAB>) {
         $back .= $_;
       }
    close(GRAB);

    $back =~ s/\r/blaq/sig;

    open(WRITEIT, ">$datafile.new") || die "Error writing $datafile.new. Reason: $!";

      print WRITEIT $back;

    close(WRITEIT); 

  print qq|done...| ;

..and it then looks like:


There is something really funky going on :(

Cheers

Andy
 
Hi,

PostAnnihilannic

Just tested your scripts on a Linux box after downloading the data file and they seem to be working fine to me.

The file has extremely long lines in it, are you sure you just aren't noticing the carriage returns in all of that data? For e.g., if I run perl addy2 | head -1 I only get the column headings, so they are definitely correctly terminated.

..so it works fine with the scripts I proved in the above posts? Even doing a:

tail -1 Hotel_Description_FR.txt come up with just a long line (if I left it, I would imagine it would just load all 11mb of the data :/)

Hmmm.. in your original code it looks like you're removing all the occurances of the character 0x00, is there one (or more) next to each 'real' character if you look at the text file in a hex editor?

If so, the file is probably encoded in UCS-2 (if there is only one of those 0 characters) or UCS-4 (if there are three.) UFT-8 is probably more what you're used to using.

For more info, you might want to have a look at perldoc perlunicode and the Encode module.

Not sure - what kind of Hex Editor would you recommend?

TIA

Andy
 
Welp, far from an ideal solution - but what I've found, is if you download the file - open in WordPad, then save - and re-upload - then it works fine. So this is what I ended up using:

Code:
#!/usr/local/bin/perl

    use strict;

    my $datafile = "./Description.txt";
    my $datafile_base = "./Hotel_Description_FR.txt";


#####################################

    chdir($CFG->{'admin_root_path'});
    system('rm -f Hotel_description_fr.zip');
    system('wget [URL unfurl="true"]https://www.ian.com/affiliatecenter/include/Hotel_description_fr.zip');[/URL]
    system('unzip -o Hotel_description_fr.zip');

#####################################

   print qq|\n\n|;

   my $username = &promptUser("Please now download $datafile_base, open in WordPad, then save it as Description.txt - then upload to your server in this folder, and type \"yes\" here.");

    
    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";       
       while (<GRAB>) {
         print $_ . "\n";
       }
    close(GRAB);



  print qq|done...| ;

sub promptUser {

   my ($promptString,$defaultValue) = @_;

   if ($defaultValue) {
      print $promptString, "[", $defaultValue, "]: ";
   } else {
      print $promptString, ": ";
   }

   $| = 1;               # force a flush after our print
   $_ = <STDIN>;         # get the input from STDIN (presumably the keyboard)

   chomp;

   if ("$defaultValue") {
      return $_ ? $_ : $defaultValue;    # return $_ if it has a value
   } else {
      return $_;
   }
}


Basically, it downloads and de-compresses, and then asks you to download the .txt file, save it as Description.txt, and then upload - and finally confirm its uploaded, before it carries on.

I was hoping to be able to get it so that it works via cron - but I guess I'm just hoping for too much with this feed :(

The VERY weird thing, is that their other feed files work fine (for images, hotel details, and the english versions of their data - just the foreign ones that are buggered :'()

Anyway, if anyones got any more ideas - I'm all ears - but if not, thanks to everyone whos tried helping in this thread :)

Cheers

Andy
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top