Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help with file converting

Status
Not open for further replies.

Kirsle

Programmer
Joined
Jan 21, 2006
Messages
1,179
Location
US
I'm attempting to write a script to take Prologue SundayPlus media files and convert them into plain text format, keeping track of the different slides in the file.

SP media files are a type of RichText file, with \'s everywhere to define the font styles and everything. From dissecting these files in Notepad, I found that \par is a newline and \qc is the code to tell SP that a new slide is beginning (media files work kinda like PowerPoint in that they all have slides).

Here's an example input file:
Code:
[#GLOBAL_RECT: rect(0, 0, 1024, 768), #opacity: 100, #SHADOW_ON: 0, #SHADOW_COLOR: rgb( 0, 0, 0 ), #SHADOW_OPACITY: 100, #SHADOW_POSITION: "RB", #SHADOW_OFFSET: [4, 4], #FILE_TYPE: "Song", #title: "", #Author: "", #Copyright: "", #CCLI: "", #Cell1: [#RTF: "{\rtf1\ansi\deff0 {\fonttbl{\f0\fswiss Arial;}{\f1\fmodern Monotype Corsiva;}{\f2\froman Times New Roman;}}{\colortbl\red0\green0
\blue0;\red0\green0\blue224;\red224\green0\blue0;\red224\green0\blue224;\red102\green102\blue153;\red51\green153\blue102;\red0\green255
\blue0;\red255\green255\blue0;\red248\green248\blue0;}{\stylesheet{\s0\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Normal Text;}{\s2\fs24\ql\li0
\ri0\fi0\sb0\sa0\sl0 Normal;}{\s3\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Plain Text;}{\s4\fs130\cf4\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 1;}
{\s5\fs192\cf5\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 2;}{\s6\fs96\cf4 Body Text;}{\s7\b\fs96\cf6 Author;}{\s8\b\fs40 CCLI;}{\s9\b\fs40 Copyright;}
{\s10\fs120\cf7 Lyrics;}{\s11\b\f1\fs144 Title;}}\margl1800 \margr1800 \margt1440 \margb1440 \pard \f0\fs24{\pard \b\f2\fs192\cf8
\qc\par
American Dream\par
Casting Crowns\par
}}", #Align: #center], #CELL2: [#RTF: "{\rtf1\ansi\deff0 {\fonttbl{\f0\fswiss Arial;}{\f1\fnil EUROPA;}{\f2\fmodern Monotype Corsiva;}{\f3\froman Times New Roman;}}{\colortbl
\red0\green0\blue0;\red0\green0\blue224;\red224\green0\blue0;\red224\green0\blue224;\red40\green168\blue208;\red102\green102\blue153;
\red51\green153\blue102;\red0\green255\blue0;\red255\green255\blue0;\red248\green248\blue0;}{\stylesheet{\s0\fs24\ql\li0\ri0\fi0\sb0
\sa0\sl0 Normal Text;}{\s2\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Normal;}{\s3\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Plain Text;}{\s4\fs130\cf5
\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 1;}{\s5\fs192\cf6\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 2;}{\s6\fs96\cf5 Body Text;}{\s7\b\fs96\cf7 Author;}
{\s8\b\fs40 CCLI;}{\s9\b\fs40 Copyright;}{\s10\fs120\cf8 Lyrics;}{\s11\b\f2\fs144 Title;}}\margl1800 \margr1800 \margt1440 \margb1440 
\pard \f0\fs24{\pard \b\f3\fs192\cf9\qc\par
All work no play\par
may have made Jack\par
a dull boy\par
}}", #Align: #center], #CELL3: [#RTF: "{\rtf1\ansi\deff0 {\fonttbl{\f0\fswiss Arial;}{\f1\fmodern Monotype Corsiva;}{\f2\froman Times New Roman;}}{\colortbl\red0\green0
\blue0;\red0\green0\blue224;\red224\green0\blue0;\red224\green0\blue224;\red102\green102\blue153;\red51\green153\blue102;\red0\green255
\blue0;\red255\green255\blue0;\red248\green248\blue0;}{\stylesheet{\s0\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Normal Text;}{\s2\fs24\ql\li0
\ri0\fi0\sb0\sa0\sl0 Normal;}{\s3\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Plain Text;}{\s4\fs130\cf4\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 1;}
{\s5\fs192\cf5\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 2;}{\s6\fs96\cf4 Body Text;}{\s7\b\fs96\cf6 Author;}{\s8\b\fs40 CCLI;}{\s9\b\fs40 Copyright;}
{\s10\fs120\cf7 Lyrics;}{\s11\b\f1\fs144 Title;}}\margl1800 \margr1800 \margt1440 \margb1440 \pard \f0\fs24{\pard \b\f2\fs192\cf8
\qc\par
but all work no God\par
has left Jack with\par
a lost soul.\par

You can skim through that and easily find where the plaintext is in it. That bit of file data would display, in 96 pt yellow Times New Roman, three slides:

Code:
American Dream
Casting Crowns

--------

All work no play
may have made Jack
a dull boy

--------

but all work no God
has left Jack with
a lost soul

Now first off, here's my code:

Code:
use Data::Dumper;
opendir (DIR, ".");
foreach my $ptf (sort(grep(/\.ptf$/i, readdir(DIR)))) {
	print "Reading $ptf...\n";

	open (FILE, $ptf);
	my @data = <FILE>;
	close (FILE);
	chomp @data;

	my $out = {}; # Save output in their slide blocks
	my @text = ();
	my $i = -1;

	# Parse out everything except for text
	my $c = 0;
	foreach my $line (@data) {
		$c++;
		print "\t$c\n";
		next unless $line =~ /\\par$/i; # \par tends to mean end-of-lines

		# Look for a keyword that we're starting a new slide.
		if ($line =~ /\\qc/) {
			$i++;
			# print "\a$c\n";
			$out->{$i} = [];
			if ($ptf eq 'I Will Sing Of My Redeemer.ptf') {
				print "$c: $line\n";
			}
		}

		# Skip the -1 paragraph
		next if $i < 0;

		# Remove all unnecessary formatting.
		$line =~ s/\\(\w+)\s*//ig;

		# Change the formatting of special chars.
		$line =~ s~\\:~:~ig;  # \: = :
		$line =~ s~\^\^~"~ig; # ^^ = "
		$line =~ s~\^~'~ig;   # ^  = '
		$line =~ s~{~~ig;     # {}
		$line =~ s~}~~ig;     # {}

		# print "\t$line\n";
		push (@{$out->{$i}}, $line);
	}

	print "$c lines\n";

	print Dumper($out);
	open (OUT, ">$ptf.txt");
	for (my $j = 0; exists $out->{$j}; $j++) {
		print OUT join ("\n", @{$out->{$j}}) . "<para>\n\n";
	}
	close (OUT);
}
closedir (DIR);

On a lot of the files I've tested this against, it works fine. There is one file so far that doesn't split itself up into different slides.

This bit of code should check if \qc exists anywhere in the line, right?

Code:
if ($line =~ /\\qc/) {

I had the code block here to print out helpful details about that line, but it seems to skip line 49 in the file I'm working with, when I know that line 49 has a \qc in it:

Code:
{\s10\fs120\cf7 Lyrics;}{\s11\b\fs144 Title;}}\margl1800 \margr1800 \margt1440 \margb1440 \pard \f0\fs24{\pard \b\f2\fs204\cf8\qc I will tell the

It's the last escaped code in that line before "I will tell"

I've been trying to debug this for the longest time, but it doesn't seem to be doing much good. I've tried specifically printing out $data[48] to show what's on line 49 to the console window, and it shows exactly what IS on line 49. But for some reason, this IF statement just isn't matching here.

The output file looks like this:

I will sing of my Redeemer and His wondrous love to me;<para>


Hymn #539
I Will Sing of
My Redeemer
<para>

Sing, O sing
of my Redeemer,
with His blood He
purchased me;<para>

on the cross He
sealed my pardon,
paid the debt and made me free.

wondrous story,
how, my lost
estate to save,

love and mercy,
He the ransom
freely gave.

of my Redeemer,
with His blood He purchased me;<para>


on the cross He
sealed my pardon,
paid the debt and made me free.<para>

I will praise my
dear Redeemer,
His triumphant
pow'r I'll tell,<para>

how the victory He giveth, over sin and death and hell.<para>

Sing, O sing
of my Redeemer,
with His blood He purchased me;<para>

on the cross He
sealed my pardon,
paid the debt and made me free.<para>

I will sing of my Redeemer and His heav'nly love to me;<para>

He from death to
life hath brought me, Son of God with
Him to be.<para>

Sing, O sing
of my Redeemer,
with His blood He purchased me;<para>

on the cross He
sealed my pardon,
paid the debt and made me free.<para>

The highlighted section should've been broken into different slides with that <para> tag, but for some reason it's just not matching the \qc on these lines... specifically line 49 which is what I've been debugging with.

Does anybody have any idea what could be wrong?

Thanks in advance.
 
Just a thought, if '\qc' matches when it appears at the beginning of a line but not in the middle of a line, is the syntax correct. I always include the 'm' switch but this may be the default for this type of search.
Just seems too much of a coincidence to ignore.

Keith
 
Thanks for the effort, but that wasn't the problem. I added a bit of test code outside of that foreach loop:

Code:
	if ($ptf eq 'I Will Sing Of My Redeemer.ptf') {
		print "Line 49: $data[48]\n";

		if ($data[48] =~ /\\qc/) {
			print "It contains \\qc\n";
		}
		else {
			print "It doesn't contain \\qc\n";
		}

		exit;
	}

It matched, so I looked back up at the other if statement and the surrounding code. I found a little bit of an issue a few lines up: this line 49 that had the \qc on it DID NOT end with a \par. The code was set to skip all lines that didn't end with \par.

I changed the code accordingly, but not having \par be a requirement anymore, every now and then a few bits and pieces of that formatting surrounding the text gets put in with the "plain text"

i.e. there are bits and pieces of the formatting that look like this:

Code:
{\s10\fs120\cf7 Lyrics;}

And what I end up with are some lines in the output file begin with "Lyrics;" and other similar words. I put in a regexp to take out {.+?} from the output files, since these characters are rarely (if ever) used in SundayPlus files, but this just led to more problems.

I'll continue working with it. I'm sure I'll figure it out eventually. If worst comes to worst, though, I can always just use Win32::GuiTest and have it physically copy-and-paste the text from SundayPlus so that it doesn't have to worry about the formatting characters at all.
 
I figured that organizing the input data in a way that's easier to read, that it would help me in parsing it.

Here's what I converted so far:

Code:
[
	#GLOBAL_RECT: rect(0, 0, 1024, 768),
	#opacity: 100,
	#SHADOW_ON: 1,
	#SHADOW_COLOR: rgb( 0, 0, 0 ),
	#SHADOW_OPACITY: 100,
	#SHADOW_POSITION: "RB",
	#SHADOW_OFFSET: [4, 4],
	#FILE_TYPE: "Song",
	#title: "",
	#Author: "",
	#Copyright: "",
	#CCLI: "",
	#CELL2: [
		#RTF: "{
				\rtf1\ansi\deff0
				{\fonttbl{\f0\fswiss Arial;}
				{\f1\fmodern Monotype Corsiva;}
				{\f2\froman Times New Roman;}
			}
			{
				\colortbl\red0\green0\blue0;
				\red0\green0\blue224;
				\red224\green0\blue0;
				\red224\green0\blue224;
				\red102\green102\blue153;
				\red51\green153\blue102;
				\red0\green255\blue0;
				\red255\green255\blue0;
				\red248\green248\blue0;
			}
			{
				\stylesheet
				{\s0\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Normal Text;}
				{\s2\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Normal;}
				{\s3\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Plain Text;}
				{\s4\fs130\cf4\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 1;}
				{\s5\fs192\cf5\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 2;}
				{\s6\fs96\cf4 Body Text;}
				{\s7\b\fs96\cf6 Author;}
				{\s8\b\fs40 CCLI;}
				{\s9\b\fs40 Copyright;}
				{\s10\fs120\cf7 Lyrics;}
				{\s11\b\f1\fs144 Title;}
			}
			\margl1800
			\margr1800
			\margt1440
			\margb1440
			\pard
			\f0\fs24
			{
				\pard \b\f2\fs204\cf8
				\qc I will sing of my Redeemer and His wondrous love to me;\par
			}
		}",
		#Align: #center,
		#MARKER_NAME: "verse 1",
		#Hotkey: "1"
	],
	#Cell1: [
		#MARKER_NAME: "title and page",
		#RTF: "{
				\rtf1\ansi\deff0
				{\fonttbl{\f0\fswiss Arial;}
				{\f1\fmodern Monotype Corsiva;}
				{\f2\froman Times New Roman;}
			}
			{
				\colortbl\red0\green0\blue0;
				\red0\green0\blue224;
				\red224\green0\blue0;
				\red224\green0\blue224;
				\red102\green102\blue153;
				\red51\green153\blue102;
				\red0\green255\blue0;
				\red255\green255\blue0;
				\red248\green248\blue0;
			}
			{
				\stylesheet
				{\s0\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Normal Text;}
				{\s2\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Normal;}
				{\s3\fs24\ql\li0\ri0\fi0\sb0\sa0\sl0 Plain Text;}
				{\s4\fs130\cf4\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 1;}
				{\s5\fs192\cf5\ql\li0\ri0\fi0\sb0\sa0\sl0 heading 2;}
				{\s6\fs96\cf4 Body Text;}
				{\s7\b\fs96\cf6 Author;}
				{\s8\b\fs40 CCLI;}
				{\s9\b\fs40 Copyright;}
				{\s10\fs120\cf7 Lyrics;}
				{\s11\b\f1\fs144 Title;}
			}
			\margl1800
			\margr1800
			\margt1440
			\margb1440
			\pard
			\f0\fs24
			{
				\pard
				\b\f2\fs204\cf8
				\qc\par
				Hymn #539\par
				I Will Sing of\par
				My Redeemer\par
			}
		}",
		#Align: #center
	],

It seems to follow this pattern for all the cells. I think I'll code a new converter program (keeping the one I already have in case I mess it up) that will try to follow this format to know how many levels deep to go in order to find the actual lyrics, for each CELL#.
 
Sorry for all the posting, but I found the solution to my problems here!

I went ahead and wrote a new converter program and it works great!

Code:
opendir (DIR, ".");
foreach my $ptf (sort(grep(/\.ptf$/i, readdir(DIR)))) {
	print "Reading $ptf...\n";

	open (FILE, $ptf);
	my @data = <FILE>;
	close (FILE);
	chomp @data;

	# Ignore newline formatting.
	my $content = join ("<<<newline>>>", @data);

	# Search for the individual cells.
	my $cells = {};
	my $regexp = '#CELL(\d+): \[(.*?)\]';
	while ($content =~ s/$regexp//i) {
		my $number = $1;
		my $text = $2;

		# Create an arrayref for this cell number.
		$cells->{$number} = [];

		# Split the cell at the margin definitions, so that
		# RTF and headers are on one side and the lyrics
		# are on the other.
		my ($headers,$lyrics) = split(/\\margb/i, $text, 2);

		# The lyrics are inside the next {block}
		my ($words) = $lyrics =~ /\{(.*?)\}/i;

		# We now have the visible words in this block, along
		# with (maybe) some minor formatting.
		my @lines = split(/<<<newline>>>/, $words);

		foreach my $line (@lines) {
			# Remove all unnecessary formatting.
			$line =~ s/\\(\w+)\s*//ig;

			# Change the formatting of special chars.
			$line =~ s~\\:~:~ig;  # \: = :
			$line =~ s~\^\^~"~ig; # ^^ = "
			$line =~ s~\^~'~ig;   # ^  = '
			$line =~ s~{~~ig;     # {}
			$line =~ s~}~~ig;     # {}
			next unless length $line;

			# Push this line of text into this cell.
			push (@{$cells->{$number}}, $line);
		}
	}

	# Now $cells should be populated with only the plain text
	# that was parsed out of the PTF files.

	open (OUT, ">$ptf.txt");
	for (my $i = 1; exists $cells->{$i}; $i++) {
		print OUT join ("\n", @{$cells->{$i}}) . "<para>\n\n";
	}
	close (OUT);
}
closedir (DIR);
 
I guess that is the problem with writing a routine like that.
You start with nothing and keep adding bits and pieces until you reach a point where the routines contradict each other. I have learnt from experience that once a complicated script is working, re-writing it results in a much smaller, easier to manage script. The benefit of hindsight enables us to include all the criteria we never even thought of when the project started. I do not know any alternative to this development process, maybe some of the more experienced programmers can suggest a better approach.
At least you got it sorted, job done, more knowledge - move on.


Keith
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top