Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations MikeeOK on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RTF File Parsing 2

Status
Not open for further replies.

audiopro

Programmer
Apr 1, 2004
3,165
GB
I am writing a very basic parser for RTF files but I am having problems understanding the file format. Has anyone had any experience on these?
The subject is documented with a file of over 290 pages but it only really lists the control codes and other control structures.


Keith
 
I have read a many documents about RTF parsers and I am just about making sense of them all. There is a logical file format in there somewhere but it is not an easy format to work with. There are parsers out there but they all seem to be beta versions with their own little quirks.
I needed a way for club members to upload text articles to their website so that they could be added to an articles database. The articles are indexed on content and are dynamically inserted into an existing page so uploading them as html was not really an option. The articles needed basic formatting ie. underline, bold, italic and font sizing. RTF files seemed like the best way and this is my attempt at a parser. It works for us but does have some quirky problems which need sorting out. The main one being, if the document contains a backslash, the whole thing falls over. The members are happy to work under that restriction as it is much better than the previous method which invloved them editing the HTML. They post over thirty items each week so this has saved a lot of editing.

Please be gentle with your criticism of my writing style but your comments and sugggested improvements would be much appreciated.
Code:
#!/bin/perl
use CGI::Carp qw(fatalsToBrowser);

#######################################################################

#   	Basic RTF File Parser by Keith Ward
#	file name - rtfparse.cgi
#     call - [URL unfurl="true"]www.domain.ext/dir/rtfparse.cgi?filename=<file>.rtf[/URL]

#######################################################################

use warnings;
use strict;
use CGI;

print "Content-type: text/html\n\n";	# prepare for HTML output
my $query = new CGI;


my $WholeFile;
my $x;
my $LENF;
my $ThisChar;
my $Curlies;
my $ControlCode;
my $FontSize;
my $ActiveControl=0;
my $TextContent='';
my $BFlag=0;
my $IFlag=0;
my $UFlag=0;
my $SpanCode=0;
my $Stile;
my $CurLev=0;
my $CurLev2=0;
my $CurLev3=0;
my $DataStream=0;
my $FontTable=0;
my $StyleTable=0;
my $BreakFlag;


		# File name is input as part of query string

	my $file=$query->param("filename") || "";
	

		# Open target file

	open (LOG, "<"."$file") || print "Cannot open test.rtf";
	{ local $/; $WholeFile = <LOG> }
	close (LOG);

		#Get the length of the file for the process loop

	$LENF=length($WholeFile);

		#Check that the selected file is a valid RTF document.

	my $FileCheck=substr($WholeFile,2,3);
	if($FileCheck ne 'rtf'){
		print "Invalid File Type<br>";
		exit;
	}


		# Start of main program loop

	for($x=0; $x< $LENF; ++$x){

			# Take chars in 1 at a time

		$ThisChar =substr($WholeFile,$x,1);

			# Various functions use the curly bracket count as a file position pointer

		if($ThisChar eq '{'){++$Curlies}
		if($ThisChar eq '}'){--$Curlies}

			# RTF files contain sets of data we do not need in this instance
			# Datastream, Font table and stylesheet so we can just bypass them.
			# {\* symbol marks the start of a data stream

		if(substr($WholeFile,$x,3) eq '{\*'){
			$CurLev=$Curlies;
			$DataStream = 1;
		}

			# End of datastream, Font table or Stylesheet - set relevant variable back to zero

		if(($DataStream == 1) && ($Curlies < $CurLev)){$DataStream = 0}
		if(($FontTable == 1) && ($Curlies < $CurLev2)){$FontTable = 0}
		if(($StyleTable ==1) &&($Curlies < $CurLev3)){$StyleTable = 0}


			# Do not process if one of the ignored areas 

		if(($FontTable == 0) && ($DataStream == 0) && ($StyleTable == 0)){

				# If $ThisChar is a possible end of a control word

			if(($ThisChar eq ' ') || ($ThisChar eq '\\') || ($ThisChar eq '{') || ($ThisChar eq '}')){

					# If the assembled word starts with a backslash it is probably a control word
					# Any backslashes in the document body will break the document

				if(substr($ControlCode,0,1) eq '\\'){

						# This is a list of unwanted commands which try to display text
						# These are ignored by making the text content nothing

					if($ControlCode eq '\\title'){$TextContent=''}
					if($ControlCode eq '\\author'){$TextContent=''}
					if($ControlCode eq '\\operator'){$TextContent=''}
					if($ControlCode eq '\\creatim'){$TextContent=''}

					if($TextContent){

						# T E X T   O U T P U T T I N G

							# This section processes the text part of the document
							# various flags are set in the control section below and 
							# the status of those flags sets HTML formatting codes.

							# First the start codes

						if($BFlag == 1){print "\n<b>"}
						if($IFlag == 1){print "\n<i>"}
						if($UFlag == 1){print "\n<u>"}

							# Then the text

						print "$TextContent\n";

							# Then the end codes - notice they are in reverse order

						if($UFlag == 1){print "</u>\n"}
						if($IFlag == 1){print "</i>\n"}
						if($BFlag == 1){print "</b>\n"}
						if($BreakFlag == 1){print "<br>"}

							# Then reset the formatting flags

						$BFlag=0;
						$IFlag=0;
						$UFlag=0;
						$SpanCode=0;

					}else{

						# C O N T R O L   C O D E   H A N D L I N G

						if($ControlCode eq '\\b'){$BFlag=1}
						if($ControlCode eq '\\par'){$BreakFlag=1}
						if($ControlCode eq '\\i'){$IFlag=1}
						if($ControlCode eq '\\ul'){$UFlag=1}
						if($ControlCode eq 'fs'){
							$SpanCode=1;
							$FontSize=12;
						}
					}

						# Reset the content vars to '' after processing

					$ControlCode='';
					$TextContent='';

				}else{

					# Content is not a control word so should be text
					if($ControlCode){
						chomp $ControlCode;

							# Add space which was ignored in the earlier 
							# Control word end check routine 

						if($ThisChar eq ' '){$ControlCode.=' '}

							# Add text to TextContent variable

						$TextContent.=$ControlCode;
					}
				}

					# Reset ControlCode after processing

				$ControlCode='';

			}else{
					# Some stray line end and null chars can get through so ignore them

				if(ord($ThisChar) > 29){
					$ControlCode.=$ThisChar;
				}

					# Set $FontTable to 1 to skip over unwanted font table information

				if($ControlCode eq '\\fonttbl'){
					$FontTable = 1;
					$CurLev2=$Curlies;
				}

					# Set $StyleTable to 1 to skip over unwanted stylesheet information

				if($ControlCode eq '\\stylesheet'){
					$StyleTable = 1;
					$CurLev3=$Curlies;
				}
			}

				# When a control word is followed by another one, the start backslash in the 
				# control word would have been ignored - This line replaces the backslash

			if($ThisChar eq '\\'){$ControlCode=$ThisChar}
		}
	}

Keith
 
What does this mean?

"The main one being, if the document contains a backslash, the whole thing falls over."

Do you get script errors or the RTF to HTML translation does not work or what?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Sorry, that was a bit vague.

If the document to be parsed contains a backslash, the following word is treated as a control code so is removed.
This was one such example from a posted document.
Code:
username\email address
became
Code:
username
address
There are some other minor glitches but the overall function works.
I was really wondering if there is a better approach


Keith
 
You could create a simple hash table that checks of the control sequence is one you want to allow, and if it is not in the hash table don't do anything with that particular sequence.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
I am not sure how to incorporate the functions into a hash table. I have googled for hash tables but the examples I have seen have a single key and a single value. I would need to put a formula into a hash ie. '$BreakFlag=1' if it were to be a neater solution. Some of the functions require more than one action which is why I chose the 'if' statement route.
At the moment I am only using 12 control codes but I keep fine tuning it and add additional codes as the need arises.
I am looking into including pictures in the documents but work keeps getting in the way.

Keith
 
Using the hash should have no affect on your flagging system. The hash is just there to tell the script if the \xxxxx sequence is a valid sequence and will be translated into html code. A very short example:

Code:
my %RTF = (
   '\b' => 1 (just so it has a true value)
);

if( $RTF{'\b'} ){$BFlag=1}


------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top