|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
xerces c 1.7.0 ICU for unicodeHi,
I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some special chinese characters in the xml file. So i am using ICU build to support unicode. I defined encoding as UTF-8 *<?xml version="1.0" encoding="UTF-8"?>* Part of xml file contains the has the following chinese characters. * <Convert> <FromValue>TRUE</FromValue> <ToValue>您是如</ToValue> </Convert> <Convert> <FromValue>FALSE</FromValue> <ToValue>您好</ToValue> </Convert>* I am using DOM to prase the xml file. I have the following code for DOM parsing * static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull }; DOMImplementation *impl = DOMImplementationRegistry::getDOMImplementation(gLS); DOMBuilder *CtlParser = ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, 0);* * CtlParser->setFeature(XMLUni::fgDOMNamespaces, true); CtlParser->setFeature(XMLUni::fgXercesSchema, true); CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true); CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);* * //create our error handler and install it XMLErrorHandler errorHandler; CtlParser->setErrorHandler(&errorHandler); CtlDoc = CtlParser->parseURI(XMLFilePath); if(errorHandler.getSawErrors()) { cout<<errorHandler.ReturnErrorMessage(); } * I am getting the following error. *Message: An exception occurred! Type:UTFDataFormatException, Message:invalid byte 2 (�) of a 2-byte sequence.* I do not understand why i am getting this error even though i am using xercec-c ICU build. ICU build is supposed to work with unicode characters. If i remove the chinese characters, i am not getting any error message while parsing. If any body worked with unicode in xerces-c, please help me. Did i miss any of the parser settings for unicode? Thanks in advance, Jaya Nageswar. |
|
|
Re: xerces c 1.7.0 ICU for unicodeJaya Nageswar wrote:
> Hi, > > I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some > special chinese characters in the xml file. So i am using ICU build to > support unicode. I defined encoding as UTF-8 > > *<?xml version="1.0" encoding="UTF-8"?>* > > Part of xml file contains the has the following chinese characters. > * <Convert> > <FromValue>TRUE</FromValue> > <ToValue>您是如</ToValue> > </Convert> > <Convert> > <FromValue>FALSE</FromValue> > <ToValue>您好</ToValue> > </Convert>* > > I am using DOM to prase the xml file. I have the following code for DOM > parsing > > * static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull }; > DOMImplementation *impl = > DOMImplementationRegistry::getDOMImplementation(gLS); > DOMBuilder *CtlParser = > ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, > 0);* > > * CtlParser->setFeature(XMLUni::fgDOMNamespaces, true); > CtlParser->setFeature(XMLUni::fgXercesSchema, true); > CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true); > CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);* > > * //create our error handler and install it > XMLErrorHandler errorHandler; > CtlParser->setErrorHandler(&errorHandler); > > CtlDoc = CtlParser->parseURI(XMLFilePath); > if(errorHandler.getSawErrors()) > { > cout<<errorHandler.ReturnErrorMessage(); > } * > > > I am getting the following error. > *Message: An exception occurred! Type:UTFDataFormatException, > Message:invalid byte 2 (�) of a 2-byte sequence.* > > I do not understand why i am getting this error even though i am using > xercec-c ICU build. ICU build is supposed to work with unicode characters. > If i remove the chinese characters, i am not getting any error message while > parsing. Xerces-C supports UTF-8 even without using the ICU transcoders. > > If any body worked with unicode in xerces-c, please help me. Did i miss any > of the parser settings for unicode? Your file is not encoded in UTF-8, so the parser reports an error. You can either fix the file so it's properly encoded, or update the encoding in the XML declaration to reflect the actual encoding. Dave |
|
|
Re: xerces c 1.7.0 ICU for unicodeHi David,
Thanks for the update. I translated the characters from UCS-2 to UTF-8 using C APIs. Actually i took these chinese characters(您是如) from Goolge Translate and used in xml file to test the unicode support.When i translated these characters from UCS-2 to UTF-8 using C APIs, i got these characters(귦꺡髧„). Now i am not getting the errors from xerces parser. But i have a question. Will the characters themselves change from one format to another format? If i have a string "abcd", will it change from one format to another format? I understand the encoding in different formats is different but i do not understand why the characters themselves are chaning from one format to another format. Any information related to this will be a great help to me. Thanks, Jaya Nageswar. On Wed, Sep 3, 2008 at 3:18 AM, David Bertoni <dbertoni@...> wrote: > Jaya Nageswar wrote: > >> Hi, >> >> I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some >> special chinese characters in the xml file. So i am using ICU build to >> support unicode. I defined encoding as UTF-8 >> >> *<?xml version="1.0" encoding="UTF-8"?>* >> >> Part of xml file contains the has the following chinese characters. >> * <Convert> >> <FromValue>TRUE</FromValue> >> <ToValue>您是如</ToValue> >> </Convert> >> <Convert> >> <FromValue>FALSE</FromValue> >> <ToValue>您好</ToValue> >> </Convert>* >> >> I am using DOM to prase the xml file. I have the following code for DOM >> parsing >> >> * static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull }; >> DOMImplementation *impl = >> DOMImplementationRegistry::getDOMImplementation(gLS); >> DOMBuilder *CtlParser = >> >> ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, >> 0);* >> >> * CtlParser->setFeature(XMLUni::fgDOMNamespaces, true); >> CtlParser->setFeature(XMLUni::fgXercesSchema, true); >> CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true); >> CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);* >> >> * //create our error handler and install it >> XMLErrorHandler errorHandler; >> CtlParser->setErrorHandler(&errorHandler); >> >> CtlDoc = CtlParser->parseURI(XMLFilePath); >> if(errorHandler.getSawErrors()) >> { >> cout<<errorHandler.ReturnErrorMessage(); >> } * >> >> >> I am getting the following error. >> *Message: An exception occurred! Type:UTFDataFormatException, >> Message:invalid byte 2 (�) of a 2-byte sequence.* >> > This indicates your file is not really encoded in UTF-8. > > >> I do not understand why i am getting this error even though i am using >> xercec-c ICU build. ICU build is supposed to work with unicode characters. >> If i remove the chinese characters, i am not getting any error message >> while >> parsing. >> > Xerces-C supports UTF-8 even without using the ICU transcoders. > > >> If any body worked with unicode in xerces-c, please help me. Did i miss >> any >> of the parser settings for unicode? >> > Your file is not encoded in UTF-8, so the parser reports an error. You can > either fix the file so it's properly encoded, or update the encoding in the > XML declaration to reflect the actual encoding. > > Dave > |
|
|
Re: xerces c 1.7.0 ICU for unicodeJaya Nageswar wrote:
> Hi David, > > Thanks for the update. I translated the characters from UCS-2 to UTF-8 using > C APIs. Actually i took these chinese characters(您如是) from Goolge Translate > and used in xml file to test the unicode support.When i translated these > characters from UCS-2 to UTF-8 using C APIs, i got these characters(귦꺡髧„). > Now i am not getting the errors from xerces parser. I don't think you "got" any characters from the transcoding APIs. Also, you need to be carefully when associating the glyphs you see on a display device with a particular character, since they are dependent on the font and the encoding assumed by the application and rendering system. > > But i have a question. Will the characters themselves change from one format > to another format? If i have a string "abcd", will it change from one format > to another format? I understand the encoding in different formats is > different but i do not understand why the characters themselves are chaning > from one format to another format. Any information related to this will be a > great help to me. I suggest you read this article on Wikipedia: http://en.wikipedia.org/wiki/Character_encoding Dave |
| Free Forum Powered by Nabble | Forum Help |