Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Convert some files from html to plaintext

Status
Not open for further replies.

mailint1

Technical User
Nov 10, 2007
13
IT
I have many html files named like these:

c:\dir\femo-black.html
c:\dir\loren-white.html
c:\dir\spark-white.html
c:\dir\kim-black.html
c:\dir\paul-white.html

How can I convert only the files named "c:\dir\*-white.html" to plaintext files named c:\dir\(original filename)-text.txt?

BTW do you know a better Perl module than HTML::FormatText (
to convert HTML to plaintext?
 
What do you mean "Plain text"? HTML is plain text. Are you wanting to pull some text of the documents without the HTML code or something?



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Yes I mean the text without the HTML code, like that you see with a text-mode browser (e.g. Lynx or W3M)
 
You need to be more specific. There are lots of different meanings for html to text. One would be to simply remove the html tags. Another would be to have some quasi rendering of the html in text.

For rendering look into HTML::FormatText. For stripping look into HTML::parser, specifically the hstrip example in the examples directory.

There are plenty of other solutions, but I suggest that you just search on cpan until you find something that fits whatever your purpose is.

- Miller
 
Hey, I have a question.

I wrote a short script which performs a task similar to what mailint1 requires for my own benefit. I recieve the syntax error "Bad name after txt' at Script line 12.". Does anybody know why there is a "bad name".

Code:
###############
#! /usr/bin/perl
use strict;
use CGI ':standard';
###############

my @file = ('c:\dir\femo-black.html', 'c:\dir\loren-white.html', 'c:\dir\spark-white.html', 'c:\dir\kim-black.html', 'c:\dir\paul-white.html');

print "Content-type: text/html\n\n";

my $new_name_p1 = 'c:\dir\';
my $new_name_p2 = '-text.txt';

foreach (@file) {
	my @split_file = split(/\\/, $_);
	my @split_off_ext = split(/\./, $split_file[-1]);
	my @split_file_name = split(/\-/, $split_off_ext[0]);
	if ($split_file_name[1] eq "white") {
		$_ = "$new_name_p1 . $split_file_name[0] . $new_name_p2";
	}
print "<p>$_";
}

Thanks, Chris
 
Chris,

I didn't even bother trying to count which line #12 was. However, here is how I would suggest that you modify your coding style to be more self documenting.

Code:
[gray]#! /usr/bin/perl[/gray]

[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]CGI[/green] [red]'[/red][purple]:standard[/purple][red]'[/red][red];[/red]
[black][b]use[/b][/black] [green]File::Basename[/green] [red]qw([/red][purple]fileparse[/purple][red])[/red][red];[/red]
[black][b]use[/b][/black] [green]File::Spec::Functions[/green] [red]qw([/red][purple]catfile[/purple][red])[/red][red];[/red]

[black][b]use[/b][/black] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]@files[/blue] = [red]qw([/red][purple][/purple]
[purple]	c:\dir\femo-black.html[/purple]
[purple]	c:\dir\loren-white.html[/purple]
[purple]	c:\dir\spark-white.html[/purple]
[purple]	c:\dir\kim-black.html[/purple]
[purple]	c:\dir\paul-white.html[/purple]
[purple][/purple][red])[/red][red];[/red]

[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple]Content-type: text/html[purple][b]\n[/b][/purple][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]

[maroon]FILE[/maroon][maroon]:[/maroon]
[olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$filename[/blue] [red]([/red][blue]@files[/blue][red])[/red] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$basename[/blue], [blue]$path[/blue], [blue]$suffix[/blue][red])[/red] = [maroon]fileparse[/maroon][red]([/red][blue]$filename[/blue], [red]qr{[/red][purple][purple][b]\.[/b][/purple][^.]*[/purple][red]}[/red][red])[/red][red];[/red]
	
	[black][b]my[/b][/black] [red]([/red][blue]$person[/blue], [blue]$color[/blue][red])[/red] = [blue]$basename[/blue] =~ [red]m{[/red][purple]^(.*)-(.*)$[/purple][red]}[/red]
		? [red]([/red][blue]$1[/blue], [blue]$2[/blue][red])[/red]
		: [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/warn.html][black][b]warn[/b][/black][/url] [red]"[/red][purple]Unrecognized file: [blue]$filename[/blue][/purple][red]"[/red][red];[/red] [olive][b]next[/b][/olive] FILE[red]}[/red][red];[/red]
	
	[olive][b]if[/b][/olive] [red]([/red][blue]$color[/blue] eq [red]"[/red][purple]white[/purple][red]"[/red][red])[/red] [red]{[/red]
		[blue]$filename[/blue] = [maroon]catfile[/maroon][red]([/red][blue]$path[/blue], [red]"[/red][purple][blue]$[/blue]{person}-text.txt[/purple][red]"[/red][red])[/red][red];[/red]
	[red]}[/red]

	[black][b]print[/b][/black] [red]"[/red][purple]<p>[blue]$filename[/blue][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[red]}[/red]

[fuchsia]1[/fuchsia][red];[/red]

[teal]__END__[/teal]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
Core (perl 5.8.8) Modules used :
[ul]
[li]CGI - Simple Common Gateway Interface Class[/li]
[li]File::Basename - Parse file paths into directory, filename and suffix.[/li]
[li]File::Spec::Functions - portably perform operations on file names[/li]
[/ul]
[/tt]

- Miller
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top