×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

C# Noob: HTML object NullReferenceException error - out of clues...

C# Noob: HTML object NullReferenceException error - out of clues...

C# Noob: HTML object NullReferenceException error - out of clues...

(OP)
Hello friends,

I am usually progging in VB6 and still very new to C# but I feel that I can no longer avoid it. tongue
Here's my dilemma (please forgive the amount of detail):
We receive XML files. The node contents vary from plain text to HTML contents. The HTML contents can be either in CDATA sections or escaped into entities or sometimes half escaped or sometimes doubly escaped, partially invalid tags and often rather...umm...just bad!
This means they can be any of these:
<seg>yadda</seg>
<seg>&lt;p&gt;yadda...</seg>
<seg>&lt;p>yadda...</seg>
<seg><![CDATA[<html><p>yadda...]]></seg>

Get an idea what I'm dealing with? rednose

What I'm trying to do is process this XML before production and unify all nodes with HTML content to CDATA sections with proper HTML content.
This is how I achieve this (htm contains the unprocessed content of that node):

CODE --> C#

htm = System.Web.HttpUtility.HtmlDecode(System.Web.HttpUtility.HtmlDecode(htm));
tuv.SelectSingleNode("myns:seg", nsmgr).InnerText = "";
int pos = htm.IndexOf("<");
if (pos >= 0)
{
	System.Xml.XmlCDataSection cd = null;
	cd = tmx.CreateCDataSection(htm);
	tuv.SelectSingleNode("myns:seg", nsmgr).AppendChild(cd);
} 
As you can see, I decode twice - in order to also cover doubly escaped stuff like "<seg>&amp;lt;p&amp;gt;yadda" - yes, we also get stuff like that.

This works - but it also unescapes ampersands and < > within HTML text, which is bad. To remedy this, I thought of loading the string into an HTML object. Reading back the innerHTML I just loaded into the object, it is properly HTML escaped.
I first did this in VB6 and it works just fine. Alas, it doesn't in C#.

Here's what I tried:

CODE --> C#

if (pos >= 0)
{
	HtmlAgilityPack.HtmlDocument htmobj=null;
	//htmobj.LoadHtml("<html></html>");
	//htmobj.OptionReadEncoding = false;
	htmobj.LoadHtml(htm);
	htm = htmobj.ToString(); 
The red line throws a NullReferenceException - and I don't know why!
I've already tried with the two commented lines, as well as with:

CODE --> C#

HtmlAgilityPack.HtmlDocument htmobj=new HtmlAgilityPack.HtmlDocument() 
That however tries to locate HtmlDocument.cs which cannot be found.
I've switched to HAP because MSHTML couldn't hack it for me either (too restrictive).

Can you give me a hint on what I'm doing wrong?
I've already googled my fingers off.

Thanks for any help!
MakeItSo

“Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.” (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.

RE: C# Noob: HTML object NullReferenceException error - out of clues...

(OP)
Aaaargh!hammer

Sorry guys, solved.
Guess what: I do have .Net Framework 4 but I needed the HAP files for .Net 2.
With the .Net 2 DLL, this line works without any hiccups:

CODE --> C#

HtmlAgilityPack.HtmlDocument htmobj = new HtmlAgilityPack.HtmlDocument();
htmobj.LoadHtml(htm); 
The .toString part not yet...

VB6 is so beautifully simple in comparison! cry

Anyway, maybe it helps someone else in the future...

Cheers,
MiS

“Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.” (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.

RE: C# Noob: HTML object NullReferenceException error - out of clues...

(OP)
Last update to conclude this thingy:

CODE --> C#

HtmlAgilityPack.HtmlDocument htmobj = new HtmlAgilityPack.HtmlDocument();
htmobj.LoadHtml(htm);
htm = htmobj.DocumentNode.InnerHtml; 
Works like a charm now and even converts all HTML tags to lower case, so I don't even have mixed case anymore. What a beaut!
smile

“Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.” (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close