Convert some files from html to plaintext

mailint1 · Nov 11, 2007

I have many html files named like these:

c:\dir\femo-black.html
c:\dir\loren-white.html
c:\dir\spark-white.html
c:\dir\kim-black.html
c:\dir\paul-white.html

How can I convert only the files named "c:\dir\*-white.html" to plaintext files named c:\dir\(original filename)-text.txt?

BTW do you know a better Perl module than HTML::FormatText (

http://search.cpan.org/~sburke/HTML-Format-2.04/lib/HTML/FormatText.pm)

to convert HTML to plaintext?

travs69 · Nov 11, 2007

What do you mean "Plain text"? HTML is plain text. Are you wanting to pull some text of the documents without the HTML code or something?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

mailint1 · Nov 11, 2007

Yes I mean the text without the HTML code, like that you see with a text-mode browser (e.g. Lynx or W3M)

MillerH · Nov 11, 2007

You need to be more specific. There are lots of different meanings for html to text. One would be to simply remove the html tags. Another would be to have some quasi rendering of the html in text.

For rendering look into HTML::FormatText. For stripping look into HTML:

arser, specifically the hstrip example in the examples directory.

There are plenty of other solutions, but I suggest that you just search on cpan until you find something that fits whatever your purpose is.

- Miller

chrismassey · Nov 12, 2007

Hey, I have a question.

I wrote a short script which performs a task similar to what mailint1 requires for my own benefit. I recieve the syntax error "Bad name after txt' at Script line 12.". Does anybody know why there is a "bad name".

Code:

###############
#! /usr/bin/perl
use strict;
use CGI ':standard';
###############

my @file = ('c:\dir\femo-black.html', 'c:\dir\loren-white.html', 'c:\dir\spark-white.html', 'c:\dir\kim-black.html', 'c:\dir\paul-white.html');

print "Content-type: text/html\n\n";

my $new_name_p1 = 'c:\dir\';
my $new_name_p2 = '-text.txt';

foreach (@file) {
	my @split_file = split(/\\/, $_);
	my @split_off_ext = split(/\./, $split_file[-1]);
	my @split_file_name = split(/\-/, $split_off_ext[0]);
	if ($split_file_name[1] eq "white") {
		$_ = "$new_name_p1 . $split_file_name[0] . $new_name_p2";
	}
print "<p>$_";
}

Thanks, Chris

MillerH · Nov 12, 2007

Chris,

I didn't even bother trying to count which line #12 was. However, here is how I would suggest that you modify your coding style to be more self documenting.

Code:

[gray]#! /usr/bin/perl[/gray]

[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]CGI[/green] [red]'[/red][purple]:standard[/purple][red]'[/red][red];[/red]
[black][b]use[/b][/black] [green]File::Basename[/green] [red]qw([/red][purple]fileparse[/purple][red])[/red][red];[/red]
[black][b]use[/b][/black] [green]File::Spec::Functions[/green] [red]qw([/red][purple]catfile[/purple][red])[/red][red];[/red]

[black][b]use[/b][/black] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]@files[/blue] = [red]qw([/red][purple][/purple]
[purple]	c:\dir\femo-black.html[/purple]
[purple]	c:\dir\loren-white.html[/purple]
[purple]	c:\dir\spark-white.html[/purple]
[purple]	c:\dir\kim-black.html[/purple]
[purple]	c:\dir\paul-white.html[/purple]
[purple][/purple][red])[/red][red];[/red]

[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple]Content-type: text/html[purple][b]\n[/b][/purple][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]

[maroon]FILE[/maroon][maroon]:[/maroon]
[olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$filename[/blue] [red]([/red][blue]@files[/blue][red])[/red] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$basename[/blue], [blue]$path[/blue], [blue]$suffix[/blue][red])[/red] = [maroon]fileparse[/maroon][red]([/red][blue]$filename[/blue], [red]qr{[/red][purple][purple][b]\.[/b][/purple][^.]*[/purple][red]}[/red][red])[/red][red];[/red]
	
	[black][b]my[/b][/black] [red]([/red][blue]$person[/blue], [blue]$color[/blue][red])[/red] = [blue]$basename[/blue] =~ [red]m{[/red][purple]^(.*)-(.*)$[/purple][red]}[/red]
		? [red]([/red][blue]$1[/blue], [blue]$2[/blue][red])[/red]
		: [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/warn.html][black][b]warn[/b][/black][/url] [red]"[/red][purple]Unrecognized file: [blue]$filename[/blue][/purple][red]"[/red][red];[/red] [olive][b]next[/b][/olive] FILE[red]}[/red][red];[/red]
	
	[olive][b]if[/b][/olive] [red]([/red][blue]$color[/blue] eq [red]"[/red][purple]white[/purple][red]"[/red][red])[/red] [red]{[/red]
		[blue]$filename[/blue] = [maroon]catfile[/maroon][red]([/red][blue]$path[/blue], [red]"[/red][purple][blue]$[/blue]{person}-text.txt[/purple][red]"[/red][red])[/red][red];[/red]
	[red]}[/red]

	[black][b]print[/b][/black] [red]"[/red][purple]<p>[blue]$filename[/blue][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[red]}[/red]

[fuchsia]1[/fuchsia][red];[/red]

[teal]__END__[/teal]

[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
Core (perl 5.8.8) Modules used :
[ul]
[li]CGI - Simple Common Gateway Interface Class[/li]
[li]File::Basename - Parse file paths into directory, filename and suffix.[/li]
[li]File::Spec::Functions - portably perform operations on file names[/li]
[/ul]
[/tt]

- Miller

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Convert some files from html to plaintext

mailint1

Technical User

travs69

MIS

mailint1

Technical User

MillerH

Programmer

chrismassey

Programmer

MillerH

Programmer

Similar threads

Part and Inventory Search

Sponsor