Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

parsing files

Status
Not open for further replies.

la

Programmer
Dec 5, 2004
6
IE
i want to make a file parser and interpreter for a markup language such as HTML, CFML (more like CFML).

i justed wanted to get your opinion on what the easiest way to go about this is and why???

for example you have a HTML file with the following content:

<HTML>
<!-- forget the head we just want page output -->
<BODY>
<P>
<FONT COLOR="red"> How are you? </FONT>
</P>
</BODY>
</HTML>

After you read this file into your java program it should output "How are you?". The text should be red. This would displayed in a frame or something, it doesn't matter the main thing is the actual parsing.

NOTE: beforeu tell me to use the built-in Java HTML parser remember i am not trying to parse and interpert HTML. it is another markup-language. this is just as an example to get an idea on how to parse.

Any advice on just techniques without code would also help. Thanks to all.
 
Write a parser.
Rember that document markup is generally looked at as a tree structure. So you'll probably want to parse the tokens into a tree, but even a list would do what you want.

So I see parsing as (at least) three phases:
1.) get the imput breaking it into token looking for simple syntax errors. (StringTokenizer or Regular expressions)
2.) Build a structure for your tokens to go into, validating structure as you go.
3.) Process the structure and generate output.

 
thanks for your feedback jstreich. basically along the same lines as i am approaching it. simple and to the point.

i find that one of the biggest problems is there is no help or books or articles (at least i couldn't any) on interpreting markup languages (as opposed to regular languages such as C and the rest. i find alot of open source c, c++, python, interpreters and grammars and how to implement them, but none on markup languages such as HTML, CFML and the like.

anyone know of any places to look for resources on writing interpreters or better still markup language interpreters?
 
If your markup language is a kind of SGML application, I think you'd find useful to have a look at the APIs most people use for XML parsing: Xerces, Xalan, Saxon and so on.

Furthermore, to write a parsed a good place so start is a lex/yacc (lexical-synthactical) structure. I'm sure there are plenty of examples of this, and you'll probably be able to find some Java implementation.

Cheers,

Dian

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top