Weird error with reading a .txt file :/

youradds · Aug 19, 2008

Hi,

I'm trying to write some code, that reads the IAN.com datafeed, this is the file:

https://www.ian.com/affiliatecenter/include/Hotel_description_fr.zip

..the code I'm using it:

Code:

    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";
    my $i = 0;
    while (<GRAB>) {
         $i++;
         chomp;
         print $_ . "\n";
         unless ($i % 1000)  { print " $i"; }
         unless ($i % 10000) { print "\n"; }
    }

    close(GRAB);

..however, ALL of the data comes up on one line :/

I managed to fix it with this:

Code:

# just a custom cleanup routine.. seeing as IAN.com don't seem 
# to want to fix up their database at their end :(
sub clean_string {

 my @rules_test =split //, $_[0];

 my $back;
 foreach (@rules_test) {
   $back .= $_ unless ord($_) == 0;
 }

 return $back;

}

..but this doesn't work on large files (cos it just times out, and buggers up my server, as its doing a search on every single charachter, on a 40mb+ file

)

Does anyone have any suggestions?

TIA!

Andy

KevinADC · Aug 19, 2008

Did you unzip the file? Is it a text file?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

youradds · Aug 19, 2008

Hi,

Yup, its a text file (seperated by |)

The problem is, that for some reason - when a Perl script (or even when I download it, and open in EditPlus), it all comes up as one line.

..doing this:

my $back;
foreach (@rules_test) {
$back .= $_ unless ord($_) == 0;
}

..works fine (once that runs, you can download the file - and it has one entry per line, as it should be).

The problem is - that uses up HUGE amounts of memory - as it has to read 11mb into memory, and then go through every charachter, and check if it needs to be changed.

Basically, I just need to convert the buggered "new lines", into proper ones - so that I can read it properly with my script (at the moment, it takes about 1 hour to even process the data, and write it to a "clean" version, where newlines exist - but I need something that can do it in a single process, and then I can run it)

To be honest, downloading the link I gave in my first post, and opening it to see what I mean - should give you an idea of what problems I'm having.

The most annoying bit about it - is that I reported this error 5-6 years ago, and they still havn't fixed it. Shows what they think of the affiliates that bring them in money *mad*

TIA

Andy

stevexff · Aug 19, 2008

Sounds like a CRLF problem. As in, they create it on Unix, you are trying to read it under Windows. So as far as they are concerned, it isn't broken...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

KevinADC · Aug 19, 2008

Try opening the file in a different editor, Word seems to be able to handle \n or \r\n.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

travs69 · Aug 19, 2008

Make sure if you download the file you download as binary and not ascii.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

Annihilannic · Aug 19, 2008

travs69 said:
Make sure if you download the file you download as binary and not ascii.

I'd recommend the reverse actually, that way the line terminators would be converted to those used by the local system for free.

Annihilannic.

travs69 · Aug 19, 2008

It's a .zip file. It's binary, downloading at as ascii doesn't sound like a good deal.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

Annihilannic · Aug 19, 2008

Err, yeah. Sorry, was paying too much attention to the title of the thread. :-$

Since the link provided is a https:// and not FTP I guess you aren't given the choice anyway.

Annihilannic.

youradds · Aug 20, 2008

Hi,

Thanks for the help guys - still no joy though

Please note, this is going to be an automtated script - so downloading the file, saving in Word/another editor etc, and then uploading manually isn't really any option

)

Heres the full example of what I'm using:

Code:

#!/usr/local/bin/perl

    use strict;

    my $datafile = "./Hotel_Description_FR.txt";

#####################################

    system('rm -f Hotel_description_fr.zip');
    system('wget [URL unfurl="true"]https://www.ian.com/affiliatecenter/include/Hotel_description_fr.zip');[/URL]
    system('unzip -o Hotel_description_fr.zip');

#####################################

    print "Cleaning up datafile ... \n";

    my $back;
    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";       
       while (<GRAB>) {
         $back .= $_;
       }
    close(GRAB);

    $back =~ s/\r/\n/sig;

    open(WRITEIT, ">$datafile.new") || die "Error writing $datafile.new. Reason: $!";

      print WRITEIT $back;

    close(WRITEIT); 

  print qq|done...|;

....then, to test reading it:

Code:

#!/usr/local/bin/perl

    my $datafile = "./Hotel_Description_FR.txt";

    open(READIT, "$datafile.new") || die "Error writing $datafile.new. Reason: $!";
     while (<READIT>) {
      print $_;
     }
    close(READIT); 

  print qq|done...|;

..and all that does is return it all on on line again

Any more suggestions?

TIA!

Andy

youradds · Aug 20, 2008

Ok, just tried loads of stuff here:

http://kb.iu.edu/data/acux.html

I tried these ones:

Code:

tr -d '\15\32' < Hotel_Description_FR.txt > Hotel_Description_FR.txt.2

awk '{ sub("\r$", ""); print }' Hotel_Description_FR.txt > Hotel_Description_FR.txt.2

perl -p -e 's/\r$//' < Hotel_Description_FR.txt > Hotel_Description_FR.txt.2

..and then to see if it worked, I just run my test.cgi script, which just does a while(<READIT>) { print $_ } thing - but nothing was returned. This is getting really annoying

Any suggestions are much appreciated.

Cheers

Andy

rharsh · Aug 20, 2008

Hmmm.. in your original code it looks like you're removing all the occurances of the character 0x00, is there one (or more) next to each 'real' character if you look at the text file in a hex editor?

If so, the file is probably encoded in UCS-2 (if there is only one of those 0 characters) or UCS-4 (if there are three.) UFT-8 is probably more what you're used to using.

For more info, you might want to have a look at perldoc perlunicode and the Encode module.

Annihilannic · Aug 20, 2008

Just tested your scripts on a Linux box after downloading the data file and they seem to be working fine to me.

The file has extremely long lines in it, are you sure you just aren't noticing the carriage returns in all of that data? For e.g., if I run perl addy2 | head -1 I only get the column headings, so they are definitely correctly terminated.

Annihilannic.

Annihilannic · Aug 20, 2008

rharsh, FYI:

Code:

$ file Hotel_Description_FR.txt
Hotel_Description_FR.txt: Little-endian UTF-16 Unicode English character data, with very long lines, with CR line terminators

Annihilannic.

youradds · Aug 20, 2008

Hi,

Ok, just opened the file up in CuteFTP Pro 8 (using the inline editor), and this is what I saw:

http://ultradev.com.nmsrv.com/error-fr-conversion-before.gif

..then, run this script:

Code:

#!/usr/local/bin/perl
# French Import Script for IAN.com. 
# (c) UltraNerds.com, a division of PUGDOG(r) Enterprises.
# Brought to you via HotelSQL.com.
 
    use strict;
    use GT::SQL;
    use GT::SQL::Condition;    
    use lib './';
    use Links qw/$IN $DB $CFG/;

    Links::init('./');

    my $datafile = $CFG->{'admin_root_path'} . "/Hotel_Description_FR.txt";


#####################################

    chdir($CFG->{'admin_root_path'});
    system('rm -f Hotel_description_fr.zip');
    system('wget [URL unfurl="true"]https://www.ian.com/affiliatecenter/include/Hotel_description_fr.zip');[/URL]
    system('unzip -o Hotel_description_fr.zip');

#####################################

    print "Cleaning up datafile ... fff\n";

    my $back;
    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";       
       while (<GRAB>) {
         $back .= $_;
       }
    close(GRAB);

    $back =~ s/\r/blaq/sig;

    open(WRITEIT, ">$datafile.new") || die "Error writing $datafile.new. Reason: $!";

      print WRITEIT $back;

    close(WRITEIT); 

  print qq|done...| ;

..and it then looks like:

http://ultradev.com.nmsrv.com/error-fr-conversion.gif

There is something really funky going on

Cheers

Andy

youradds · Aug 20, 2008

Hi,

PostAnnihilannic

Just tested your scripts on a Linux box after downloading the data file and they seem to be working fine to me.

The file has extremely long lines in it, are you sure you just aren't noticing the carriage returns in all of that data? For e.g., if I run perl addy2 | head -1 I only get the column headings, so they are definitely correctly terminated.

..so it works fine with the scripts I proved in the above posts? Even doing a:

tail -1 Hotel_Description_FR.txt come up with just a long line (if I left it, I would imagine it would just load all 11mb of the data :/)

Hmmm.. in your original code it looks like you're removing all the occurances of the character 0x00, is there one (or more) next to each 'real' character if you look at the text file in a hex editor?

If so, the file is probably encoded in UCS-2 (if there is only one of those 0 characters) or UCS-4 (if there are three.) UFT-8 is probably more what you're used to using.

For more info, you might want to have a look at perldoc perlunicode and the Encode module.

Not sure - what kind of Hex Editor would you recommend?

TIA

Andy

youradds · Aug 20, 2008

Hi,

Just downloaded a Hex Editor, and took a look - all looks ok to me? :/

http://ultradev.com.nmsrv.com/hex.gif

TIA

Andy

youradds · Aug 20, 2008

Welp, far from an ideal solution - but what I've found, is if you download the file - open in WordPad, then save - and re-upload - then it works fine. So this is what I ended up using:

Code:

#!/usr/local/bin/perl

    use strict;

    my $datafile = "./Description.txt";
    my $datafile_base = "./Hotel_Description_FR.txt";


#####################################

    chdir($CFG->{'admin_root_path'});
    system('rm -f Hotel_description_fr.zip');
    system('wget [URL unfurl="true"]https://www.ian.com/affiliatecenter/include/Hotel_description_fr.zip');[/URL]
    system('unzip -o Hotel_description_fr.zip');

#####################################

   print qq|\n\n|;

   my $username = &promptUser("Please now download $datafile_base, open in WordPad, then save it as Description.txt - then upload to your server in this folder, and type \"yes\" here.");

    
    open(GRAB, "<$datafile") || die "Error reading $datafile. Reason: $!";       
       while (<GRAB>) {
         print $_ . "\n";
       }
    close(GRAB);



  print qq|done...| ;

sub promptUser {

   my ($promptString,$defaultValue) = @_;

   if ($defaultValue) {
      print $promptString, "[", $defaultValue, "]: ";
   } else {
      print $promptString, ": ";
   }

   $| = 1;               # force a flush after our print
   $_ = <STDIN>;         # get the input from STDIN (presumably the keyboard)

   chomp;

   if ("$defaultValue") {
      return $_ ? $_ : $defaultValue;    # return $_ if it has a value
   } else {
      return $_;
   }
}

Basically, it downloads and de-compresses, and then asks you to download the .txt file, save it as Description.txt, and then upload - and finally confirm its uploaded, before it carries on.

I was hoping to be able to get it so that it works via cron - but I guess I'm just hoping for too much with this feed

The VERY weird thing, is that their other feed files work fine (for images, hotel details, and the english versions of their data - just the foreign ones that are buggered :'()

Anyway, if anyones got any more ideas - I'm all ears - but if not, thanks to everyone whos tried helping in this thread

Cheers

Andy

PinkeyNBrain · Aug 20, 2008

Have you tried messing around with the $/ variable?

youradds · Aug 20, 2008

Hi,

Nope - in what sense do you mean?

TIA

Andy

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Weird error with reading a .txt file :/

Programmer

Technical User

Programmer

Programmer

Technical User

MIS

MIS

MIS

MIS

Programmer

Programmer

Technical User

MIS

MIS

Programmer

Programmer

Programmer

Programmer

IS-IT--Management

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor