×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

Removing tags from HTML code

Removing tags from HTML code

Removing tags from HTML code

(OP)
I have some HTML code in a TRichEdit and I want to strip out all the tags to leave the text I am interested in. For example, filtering this HTML code:

<HTML>
<HEAD>
  <TITLE> My Site </TITLE>
</HEAD>
<BODY>
  <B> Lots of useful information </B>
  <H1> And some more </H1>
</BODY>
</HTML>

would give the following as output:

  My Site
  Lots of useful information
  And some more


Here is the Delphi code I have so far:

startPos := 0;
  lineNo := 0;
  with richEditHTML do
  begin
    textLen := Length(richEditHTML.Text);
    repeat
      beginFound := richEditHTML.FindText('<', startPos, textLen, []);
      if beginFound <> - 1 then
      begin
        startPos := beginFound;
        textLen := textLen - startPos;
        endFound := richEditHTML.FindText('>', startPos, textLen, []);
        SelStart := beginFound;
        SelLength := (endFound - beginFound) + 1;
        SelText := ' + #13#10;
        SelStart := SelStart + 1;
        Inc(lineNo);
      end;
    until (beginFound = -1) OR (lineNo = 189);
  end;

Unfortunately, I had to put a limit on how many tags it removes because it seems to mess up when it finds the 190th tag! The code works as required up to this point. Another problem is that lots of whitespace is still floating around after doing this. Btw, the text is about 61,000 characters over 1300 lines.

Any help would be much appreciated!

Clive
www.kucu.co.uk
Ex nihilo, nihil fit (Out of nothing, nothing comes)

RE: Removing tags from HTML code

hi Clive

You could use a TDomDocument and read each section by tagElement but there will be some codes that fall thru the net which in my case, I just do a StringReplace.

eg, here's a snippet of my code
:
var   NodeList : IXMLDomNodeList;
      XMLDoc   : TDomDocument;
:
:

    Status:=XMLDoc.load(sFile);
    if (Status=False) then
      raise exception.Create('Could not load the XML file');
:
:
  
    NodeList := XMLDoc.getElementsByTagName('title');
    StoryTitle :=NodeList.item[0].Get_Text;
    StoryTitle := Form_main.CheckForChars(StoryTitle);
    
    //setup Rich edit formatting.

    RichEdit.SelStart := 0;
    RichEdit.SelLength := length(Storytitle);
    RichEdit.SelAttributes.Color := clMaroon;
    RichEdit.SelAttributes.Style := [fsBold];

    RichEdit.lines.Add(Storytitle);
    RichEdit.SelAttributes.Style := [];
    RichEdit.SelAttributes.Color := clBlack;
    //Get Story Body
    NodeList := XMLDoc.getElementsByTagName('fulltext');
    for ii := 0 to NodeList.length -1 do
    begin
      sline := NodeList.item[ii].Get_text;
      sline := StringReplace(sline, '<P>', '', [rfReplaceAll]);
:
etc ...

hth
lou

RE: Removing tags from HTML code

(OP)
Cheers for the suggestions peeps...
Lou, what do I need to stick in my "uses" clause to get access to a TDomDocument and an IXMLDomNodeList?

Clive
www.kucu.co.uk
Ex nihilo, nihil fit (Out of nothing, nothing comes)

RE: Removing tags from HTML code

hi Clive

Ah,yes, one minor detail....you need to import MSXML_TLB type library and put MSXML_TLB in your uses.  Do you have this file?

lou

RE: Removing tags from HTML code

(OP)
Hey Lou,

I'm afraid I don't have this file...where can I get it from?

Clive
www.kucu.co.uk
Ex nihilo, nihil fit (Out of nothing, nothing comes)

RE: Removing tags from HTML code

hi Clive

You need IE5 or newer on your machine and have a look at this link, and search page for MSXML_TLB or here's the snippet

"Select Project/Import Type Library. This will display the Import Type Library dialog. Select "Microsoft XML, Version 2.0 (version 2.0)" from the list box and click the "Create Unit" button. This will add MSXML_TLB to your project."

http://bdn.borland.com/article/0,1410,26882,00.html

OR, another eg

http://delphi.about.com/library/bluc/text/uc050601a.htm

lou

RE: Removing tags from HTML code

hi

Just fyi, if you search on t'internet (northern lass) for MSXML_TLB you'll find a lot of examples of the parser.

lou

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close