Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations wOOdy-Soft on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

XML encoding with Apache XERCES

Status
Not open for further replies.

zeoth

Programmer
Apr 17, 2002
2
US
I am trying to encode an xml document I created to be UTF-8 compliant. So far, I keep getting back a string that hasn't been encoded and I am not sure why. Here's part of what I did.


import java.io.*;
import org.w3c.dom.*;
import org.apache.xml.serialize.*;
import org.apache.xerces.parsers.DOMParser;
import org.apache.xerces.dom.*;
import org.xml.sax.InputSource;
/**
* serialize the DOM tree to a String
*/
public static String serializeDOMTree(Document document, int indent) throws Exception {
StringWriter writer = new StringWriter();
OutputFormat outputFormat = new OutputFormat(document, "UTF-8", true);
outputFormat.setIndent(indent);
outputFormat.setIndenting(indent > 0);
outputFormat.setLineWidth(0);
outputFormat.setPreserveSpace(false);
char[] cr = {0x0d, 0x0a};
outputFormat.setLineSeparator(new String(cr));
XMLSerializer serializer = new XMLSerializer(writer, outputFormat);
serializer.serialize(document);

return writer.toString();
}


public static void main(String[] args) {
Document doc = new DocumentImpl();
Element message = doc.createElement("message");
doc.appendChild(message);

Element agent = doc.createElement("trial");
agent.setAttribute("agentid", "This is only a TEST@");
message.appendChild(agent);
try {
String mess = serializeDOMTree(doc,1);
System.out.println(mess);
} catch (Exception e) {
}
}

Why does it keep giving me non-encoded output?
I get:
<?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?>

<message>

<trial agentid=&quot;This is only a TEST@&quot;/>

</message>

What I should be seeing is something like (I forgot what @ encodes to so I just used %12:
<?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?>

<message>

<trial agentid=&quot;This+is+only+a+TEST%12&quot;/>

</message>


What am I doing wrong?
 
Well... The output you get seems fine to me. I might be missing something, but why do you expect the result you wrote? &quot;@&quot; character IS a UTF-8 character. And why do you expect to get the &quot;+&quot; signs there?

A conversion btw the encodings shouldn't produce additional characters, only if they are not recognized -the case of Unicode strings converted to AScii strings, when you would get a lot of strange looking chars, but you won't get characters that have a meaning only by converting from one encoding to other...

I would appreciate if you let me know if I am missing something there.. [red]Nosferatu[/red]
We are what we eat...
There's no such thing as free meal...
once stated: methane@personal.ro
 
The reason why I think the output looks wrong is because I don't believe that string is UTF-8 compliant. Most of the problem we are having is with funny characters such as the nbsp (none breakable space). I thought UTF-8 encoding is suppose to take care of such funny characters by encoding them into something ascii looking so it could be sent through something like BIZTALK which blows up on something like the nbsp.

I think I am missing something.

Thanks
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top