Preserving the doctype and entity references

View: New views
14 Messages — Rating Filter:   Alert me  

Preserving the doctype and entity references

by andrew welch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I thought I'd have a go at this as it comes up quite often, the goal
being that people can use it from the command line without needed to
write any Java.

I've written an XMLReader replacement that generates PIs for the
doctype and entities.  It works as expected from Java, but not from
the command line using the -x switch.

The test XML is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>lexical example</title>
    </head>
    <body>
        <p>hello world</p>
    </body>
</html>

The transform is:

<xsl:template match="/">
  PIs <xsl:value-of select="count(//processing-instruction())"/>
</xsl:template>

The class itself is below.  The output when run from Java is "PIs 3"
so it's showing some pi's in the output.  The code running the
transform is:

XMLReader customXMLReader = new CustomXMLReader();
SAXTransformerFactory stf =
(SAXTransformerFactory)TransformerFactory.newInstance();
TransformerHandler handler = stf.newTransformerHandler(new
StreamSource("C:\\users\\andrew\\documents\\test.xsl"));

xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
new CustomLexicalHandler(handler));
handler.setResult(new StreamResult(System.out));
customXMLReader.setContentHandler(handler);

customXMLReader.parse("C:\\users\\andrew\\documents\\test.xml");


>From the command line I'm using:

java -cp blah/CustomXMLReader.jar;blah\saxon9.jar
net.sf.saxon.Transform -x:com.andrewjwelch.CustomXMLReader test.xml test.xsl

and getting the output "PIs 0"....

The class is below.  If I litter it with System.out's I can see that
it is used, that parse(String systemId()) is called, but none of the
lexical methods. Any ideas?

package com.andrewjwelch;

import java.io.IOException;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.helpers.XMLReaderFactory;

public class CustomXMLReader extends XMLFilterImpl implements LexicalHandler {

    private boolean isProcessingDTD;
    private XMLReader xmlReader;

    public CustomXMLReader() throws Exception {
        super();
        xmlReader = XMLReaderFactory.createXMLReader();
        xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
this);
        super.setParent(xmlReader);
    }

    @Override
    public void parse(InputSource input) throws SAXException, IOException {
        super.parse(input);
    }

    @Override
    public void parse(String systemId) throws SAXException, IOException {
        super.parse(systemId);
    }

    public void startDTD(String name, String publicId, String
systemId) throws SAXException {
        super.processingInstruction("doctype-public", publicId);
        super.processingInstruction("doctype-system", systemId);
        isProcessingDTD = true;
    }

    public void endDTD() throws SAXException {
        isProcessingDTD = false;
    }

    public void startEntity(String name) throws SAXException {
        if (!isProcessingDTD) {
            super.processingInstruction("entity", name);
        }
    }

    public void endEntity(String name) throws SAXException { }

    public void startCDATA() throws SAXException { }

    public void endCDATA() throws SAXException { }

    public void comment(char[] ch, int start, int length) throws
SAXException { }

}


--
Andrew Welch
http://andrewjwelch.com
Kernow: http://kernowforsaxon.sf.net/

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by Michael Kay :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Saxon nominates itself as the lexical handler by calling
parser.setProperty("...lexical-handler", ce)

(see Sender line 378).

"parser" here is your CustomXmlReader; which doesn't implement setProperty,
so the base class does parent.setProperty(), causing the lexical events to
be sent straight from Xerces to Saxon's ReceivingContentHandler (which
ignores most of them) rather than to your filter.

You simply need to implement setProperty() to intercept this call.

If you want to do things properly you should pass all the lexical events on
to Saxon after dealing with them yourself. Saxon needs to know about
comments, and it needs to know about the start and end of the DTD so that it
can ignore comments and PIs occurring therein. It also likes to be told
about unparsed entities.

Michael Kay
http://www.saxonica.com/
 

> -----Original Message-----
> From: saxon-help-bounces@...
> [mailto:saxon-help-bounces@...] On Behalf
> Of Andrew Welch
> Sent: 18 July 2008 12:50
> To: Mailing list for the SAXON XSLT and XQuery processor
> Subject: [saxon] Preserving the doctype and entity references
>
> I thought I'd have a go at this as it comes up quite often,
> the goal being that people can use it from the command line
> without needed to write any Java.
>
> I've written an XMLReader replacement that generates PIs for
> the doctype and entities.  It works as expected from Java,
> but not from the command line using the -x switch.
>
> The test XML is:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml">
>     <head>
>         <title>lexical example</title>
>     </head>
>     <body>
>         <p>hello world</p>
>     </body>
> </html>
>
> The transform is:
>
> <xsl:template match="/">
>   PIs <xsl:value-of select="count(//processing-instruction())"/>
> </xsl:template>
>
> The class itself is below.  The output when run from Java is "PIs 3"
> so it's showing some pi's in the output.  The code running
> the transform is:
>
> XMLReader customXMLReader = new CustomXMLReader();
> SAXTransformerFactory stf =
> (SAXTransformerFactory)TransformerFactory.newInstance();
> TransformerHandler handler = stf.newTransformerHandler(new
> StreamSource("C:\\users\\andrew\\documents\\test.xsl"));
>
> xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
> new CustomLexicalHandler(handler));
> handler.setResult(new StreamResult(System.out));
> customXMLReader.setContentHandler(handler);
>
> customXMLReader.parse("C:\\users\\andrew\\documents\\test.xml");
>
>
> >From the command line I'm using:
>
> java -cp blah/CustomXMLReader.jar;blah\saxon9.jar
> net.sf.saxon.Transform -x:com.andrewjwelch.CustomXMLReader
> test.xml test.xsl
>
> and getting the output "PIs 0"....
>
> The class is below.  If I litter it with System.out's I can
> see that it is used, that parse(String systemId()) is called,
> but none of the lexical methods. Any ideas?
>
> package com.andrewjwelch;
>
> import java.io.IOException;
> import org.xml.sax.InputSource;
> import org.xml.sax.SAXException;
> import org.xml.sax.XMLReader;
> import org.xml.sax.ext.LexicalHandler;
> import org.xml.sax.helpers.XMLFilterImpl;
> import org.xml.sax.helpers.XMLReaderFactory;
>
> public class CustomXMLReader extends XMLFilterImpl implements
> LexicalHandler {
>
>     private boolean isProcessingDTD;
>     private XMLReader xmlReader;
>
>     public CustomXMLReader() throws Exception {
>         super();
>         xmlReader = XMLReaderFactory.createXMLReader();
>        
> xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
> this);
>         super.setParent(xmlReader);
>     }
>
>     @Override
>     public void parse(InputSource input) throws SAXException,
> IOException {
>         super.parse(input);
>     }
>
>     @Override
>     public void parse(String systemId) throws SAXException,
> IOException {
>         super.parse(systemId);
>     }
>
>     public void startDTD(String name, String publicId, String
> systemId) throws SAXException {
>         super.processingInstruction("doctype-public", publicId);
>         super.processingInstruction("doctype-system", systemId);
>         isProcessingDTD = true;
>     }
>
>     public void endDTD() throws SAXException {
>         isProcessingDTD = false;
>     }
>
>     public void startEntity(String name) throws SAXException {
>         if (!isProcessingDTD) {
>             super.processingInstruction("entity", name);
>         }
>     }
>
>     public void endEntity(String name) throws SAXException { }
>
>     public void startCDATA() throws SAXException { }
>
>     public void endCDATA() throws SAXException { }
>
>     public void comment(char[] ch, int start, int length)
> throws SAXException { }
>
> }
>
>
> --
> Andrew Welch
> http://andrewjwelch.com
> Kernow: http://kernowforsaxon.sf.net/
>
> --------------------------------------------------------------
> -----------
> This SF.Net email is sponsored by the Moblin Your Move
> Developer's challenge Build the coolest Linux based
> applications with Moblin SDK & win great prizes Grand prize
> is a trip for two to an Open Source event anywhere in the
> world http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> saxon-help mailing list archived at
> http://saxon.markmail.org/ saxon-help@...
> https://lists.sourceforge.net/lists/listinfo/saxon-help 


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by Michael Kay :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


By the way, it occurred to me that it would be nice to send the DTD
information to Saxon in the same format that the saxon:doctype extension
uses for output.

http://www.saxonica.com/documentation/extensions/instructions/doctype.html

If you do that, then the application could easily copy the internal DTD to
the output if it chose to, or it could do so selectively, for example
copying only the entity declarations.

There must be some way of bringing saxon:entity-ref into the picture as
well.

Michael Kay
http://www.saxonica.com/ 

> -----Original Message-----
> From: saxon-help-bounces@...
> [mailto:saxon-help-bounces@...] On Behalf
> Of Andrew Welch
> Sent: 18 July 2008 12:50
> To: Mailing list for the SAXON XSLT and XQuery processor
> Subject: [saxon] Preserving the doctype and entity references
>
> I thought I'd have a go at this as it comes up quite often,
> the goal being that people can use it from the command line
> without needed to write any Java.
>
> I've written an XMLReader replacement that generates PIs for
> the doctype and entities.  It works as expected from Java,
> but not from the command line using the -x switch.
>
> The test XML is:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml">
>     <head>
>         <title>lexical example</title>
>     </head>
>     <body>
>         <p>hello world</p>
>     </body>
> </html>
>
> The transform is:
>
> <xsl:template match="/">
>   PIs <xsl:value-of select="count(//processing-instruction())"/>
> </xsl:template>
>
> The class itself is below.  The output when run from Java is "PIs 3"
> so it's showing some pi's in the output.  The code running
> the transform is:
>
> XMLReader customXMLReader = new CustomXMLReader();
> SAXTransformerFactory stf =
> (SAXTransformerFactory)TransformerFactory.newInstance();
> TransformerHandler handler = stf.newTransformerHandler(new
> StreamSource("C:\\users\\andrew\\documents\\test.xsl"));
>
> xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
> new CustomLexicalHandler(handler));
> handler.setResult(new StreamResult(System.out));
> customXMLReader.setContentHandler(handler);
>
> customXMLReader.parse("C:\\users\\andrew\\documents\\test.xml");
>
>
> >From the command line I'm using:
>
> java -cp blah/CustomXMLReader.jar;blah\saxon9.jar
> net.sf.saxon.Transform -x:com.andrewjwelch.CustomXMLReader
> test.xml test.xsl
>
> and getting the output "PIs 0"....
>
> The class is below.  If I litter it with System.out's I can
> see that it is used, that parse(String systemId()) is called,
> but none of the lexical methods. Any ideas?
>
> package com.andrewjwelch;
>
> import java.io.IOException;
> import org.xml.sax.InputSource;
> import org.xml.sax.SAXException;
> import org.xml.sax.XMLReader;
> import org.xml.sax.ext.LexicalHandler;
> import org.xml.sax.helpers.XMLFilterImpl;
> import org.xml.sax.helpers.XMLReaderFactory;
>
> public class CustomXMLReader extends XMLFilterImpl implements
> LexicalHandler {
>
>     private boolean isProcessingDTD;
>     private XMLReader xmlReader;
>
>     public CustomXMLReader() throws Exception {
>         super();
>         xmlReader = XMLReaderFactory.createXMLReader();
>        
> xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
> this);
>         super.setParent(xmlReader);
>     }
>
>     @Override
>     public void parse(InputSource input) throws SAXException,
> IOException {
>         super.parse(input);
>     }
>
>     @Override
>     public void parse(String systemId) throws SAXException,
> IOException {
>         super.parse(systemId);
>     }
>
>     public void startDTD(String name, String publicId, String
> systemId) throws SAXException {
>         super.processingInstruction("doctype-public", publicId);
>         super.processingInstruction("doctype-system", systemId);
>         isProcessingDTD = true;
>     }
>
>     public void endDTD() throws SAXException {
>         isProcessingDTD = false;
>     }
>
>     public void startEntity(String name) throws SAXException {
>         if (!isProcessingDTD) {
>             super.processingInstruction("entity", name);
>         }
>     }
>
>     public void endEntity(String name) throws SAXException { }
>
>     public void startCDATA() throws SAXException { }
>
>     public void endCDATA() throws SAXException { }
>
>     public void comment(char[] ch, int start, int length)
> throws SAXException { }
>
> }
>
>
> --
> Andrew Welch
> http://andrewjwelch.com
> Kernow: http://kernowforsaxon.sf.net/
>
> --------------------------------------------------------------
> -----------
> This SF.Net email is sponsored by the Moblin Your Move
> Developer's challenge Build the coolest Linux based
> applications with Moblin SDK & win great prizes Grand prize
> is a trip for two to an Open Source event anywhere in the
> world http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> saxon-help mailing list archived at
> http://saxon.markmail.org/ saxon-help@...
> https://lists.sourceforge.net/lists/listinfo/saxon-help 


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by andrew welch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2008/7/18 Michael Kay <mike@...>:

> Saxon nominates itself as the lexical handler by calling
> parser.setProperty("...lexical-handler", ce)
>
> (see Sender line 378).
>
> "parser" here is your CustomXmlReader; which doesn't implement setProperty,
> so the base class does parent.setProperty(), causing the lexical events to
> be sent straight from Xerces to Saxon's ReceivingContentHandler (which
> ignores most of them) rather than to your filter.
>
> You simply need to implement setProperty() to intercept this call.

Ahh great, thanks.

> If you want to do things properly you should pass all the lexical events on
> to Saxon after dealing with them yourself. Saxon needs to know about
> comments, and it needs to know about the start and end of the DTD so that it
> can ignore comments and PIs occurring therein. It also likes to be told
> about unparsed entities.

Ok...


--
Andrew Welch
http://andrewjwelch.com
Kernow: http://kernowforsaxon.sf.net/

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by andrew welch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2008/7/18 Michael Kay <mike@...>:

>
> By the way, it occurred to me that it would be nice to send the DTD
> information to Saxon in the same format that the saxon:doctype extension
> uses for output.
>
> http://www.saxonica.com/documentation/extensions/instructions/doctype.html
>
> If you do that, then the application could easily copy the internal DTD to
> the output if it chose to, or it could do so selectively, for example
> copying only the entity declarations.
>
> There must be some way of bringing saxon:entity-ref into the picture as
> well.

Ok, sounds like it could be potentially useful.

I was going to maybe convert cdata sections to markup, wrapped in
<x:cdata> or something... I'll take a look (workload permitting)


--
Andrew Welch
http://andrewjwelch.com
Kernow: http://kernowforsaxon.sf.net/

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by andrew welch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>> Saxon nominates itself as the lexical handler by calling
>> parser.setProperty("...lexical-handler", ce)
>>
>> (see Sender line 378).
>>
>> "parser" here is your CustomXmlReader; which doesn't implement setProperty,
>> so the base class does parent.setProperty(), causing the lexical events to
>> be sent straight from Xerces to Saxon's ReceivingContentHandler (which
>> ignores most of them) rather than to your filter.
>>
>> You simply need to implement setProperty() to intercept this call.
>
> Ahh great, thanks.
>
>> If you want to do things properly you should pass all the lexical events on
>> to Saxon after dealing with them yourself. Saxon needs to know about
>> comments, and it needs to know about the start and end of the DTD so that it
>> can ignore comments and PIs occurring therein. It also likes to be told
>> about unparsed entities.
>
> Ok...

I'm working on this now but the entity event's aren't making sense at
the moment.... (I've never really used dtds)

Given:

<node>hello—world</node>

I get the events:

characters "hello"
startEntity "mdash"
endEntity "mdash"
characters "-world"

I guess the question is, what output is likely to be most useful:

- A pi such as <?entity mdash?> without the entity expanded
- An element <entity name="mdash">-</entity> with the entity expansion
as the contents
- something else...?

And then given the answer to that:

- What's the point of having two events startEntity and endEntity if
they fire one after the other?
- How can I prevent entity expansion, or better still separate the
characters() from the expansion from other characters() calls (in the
example get the dash on its own without 'world' being included)


thanks
--
Andrew Welch
http://andrewjwelch.com
Kernow: http://kernowforsaxon.sf.net/

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by Dave Pawson-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2008/7/23 Andrew Welch <andrew.j.welch@...>:

> Given:
>
> <node>hello—world</node>
>
> I get the events:
>
> characters "hello"
> startEntity "mdash"
> endEntity "mdash"
> characters "-world"
>
> I guess the question is, what output is likely to be most useful:
>
> - A pi such as <?entity mdash?> without the entity expanded
> - An element <entity name="mdash">-</entity> with the entity expansion
> as the contents
> - something else...?

Safer would be an element, since an entity expansion can contain
anything (including markup)
Perhaps with a really odd name or namespaced?
<xxx:entity>Entity expansion text</xxx:entity>

Mike, would the saxon namespace be appropriate?


>
> And then given the answer to that:
>
> - What's the point of having two events startEntity and endEntity if
> they fire one after the other?

Like text, you may not get all the content of the entity expansion in one hit.
Stack it up until you get the end event. Then process it.


> - How can I prevent entity expansion, or better still separate the
> characters() from the expansion from other characters() calls (in the
> example get the dash on its own without 'world' being included)


stack. bottom contains the element/entity you're dealing with last.


HTH



--
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by andrew welch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Safer would be an element, since an entity expansion can contain
> anything (including markup)

yes, I think having an element with expanded entity as its contents is
quite nice....

> Perhaps with a really odd name or namespaced?
> <xxx:entity>Entity expansion text</xxx:entity>

It's namespaced at the moment - as is the marked up cdata section.

> Mike, would the saxon namespace be appropriate?

I'm half tempted to make this some form of... ahem, commercial
software so I may keep it in my namespace. (it's a typical big company
problem).  It would still be freely available to everyone, just
require a license for commerical use.

>> - How can I prevent entity expansion, or better still separate the
>> characters() from the expansion from other characters() calls (in the
>> example get the dash on its own without 'world' being included)
>
>
> stack. bottom contains the element/entity you're dealing with last.

Ok, given:

hello—world

the events are:

characters "hello"
startEntity
endEntity
characters "-world"

so as you can see the end entity event fires before the event with the
expanded contents of that entity.  Even then, the expanded contents
are coming through in the same characters event as text "world".
There doesn't appear to be a way of determing what is the expanded
entity and what isn't...



--
Andrew Welch
http://andrewjwelch.com
Kernow: http://kernowforsaxon.sf.net/

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by David Carlisle :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


aren't you supposed to see

startEntity "mdash"
characters "-"
endEntity "mdash"
characters "world"

________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs.
________________________________________________________________________

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by andrew welch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> aren't you supposed to see
>
> startEntity "mdash"
> characters "-"
> endEntity "mdash"
> characters "world"

That was my expectation... I'll investigate a bit more now.


--
Andrew Welch
http://andrewjwelch.com
Kernow: http://kernowforsaxon.sf.net/

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by David Carlisle :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> I guess the question is, what output is likely to be most useful:

It depends what use one wants...

people asking for unexpanded entities are often doing some kind of
"modified identity transform" and they want entities to go back as they
were.

In which case
- A pi such as <?entity mdash?>
would (perhaps) be fine (except entities in attribute values, but I'm
not sure sax reliably reports those anyway?) as that is easy to pick up
in the xsl and write back as an entity ref.

This is more or less equivalent to what you can do now anyway just doing
a global sed or perl replace of & to [[[amp]]] doing a transform and
then replacing [[[amp]]] back to &.


but if you want an a modified identity transform that "preserves"
entities but where the predicates such as

 test="mo[.='& #2013;']"
or harder
 test="mo[text()='& #2013;']"

test as true whether or not the original document uses an entity then
the options are more limited.

<mo><entity name="mdash">-</entity></mo>

would work for the first form, but not the second.

<mo><?entityStart name="mdash"?>-<?entityEnd name="ndash"?></mo>

would work for both, and be closer to the sax events (possibly)

Of course neither would work for

 test="mo[node()[1]='& #2013;']"

but if you do that you probably shouldn't expect this kind of filter to
work....

David


________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs.
________________________________________________________________________

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by andrew welch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> people asking for unexpanded entities are often doing some kind of
> "modified identity transform" and they want entities to go back as they
> were.
>
> In which case
> - A pi such as <?entity mdash?>
> would (perhaps) be fine

Except that you also need to stop the entity being expanded....


> <mo><entity name="mdash">-</entity></mo>
>
> would work for the first form, but not the second.
>
> <mo><?entityStart name="mdash"?>-<?entityEnd name="ndash"?></mo>
>
> would work for both, and be closer to the sax events (possibly)

I've moved it on to xml-dev now as it's non Saxon specific, but I
think for unparsed entities you get both events at the same time as a
kind of hack to reuse them from parsed entities, instead of having an
event of its own... which makes determining the expanded content
difficult.

> Of course neither would work for
>
>  test="mo[node()[1]='& #2013;']"
>
> but if you do that you probably shouldn't expect this kind of filter to
> work....

Good point... there's probably something that could be done, but for
now there are more basic problems.

thanks
--
Andrew Welch
http://andrewjwelch.com
Kernow: http://kernowforsaxon.sf.net/

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by Michael Kay :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Given:
>
> <node>hello—world</node>
>
> I get the events:
>
> characters "hello"
> startEntity "mdash"
> endEntity "mdash"
> characters "-world"

That's not what I would have expected, but it's not something I have ever
tried to do.
>
> I guess the question is, what output is likely to be most useful:
>
> - A pi such as <?entity mdash?> without the entity expanded
> - An element <entity name="mdash">-</entity> with the entity
> expansion as the contents
> - something else...?

An element is going to be easier to process than a pair of PIs, but on the
other hand it is very likely to stop existing code working if the code isn't
expecting it. Your call!

Another option (if you can get the info from the parser) is to output a PI
giving the entity name only, leaving the user to find the content from the
entity definition in the DTD if they need it - on the theory that they
probably just want to copy the entity reference to the output rather than
processing its expansion.

Michael Kay
http://www.saxonica.com/


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Re: Preserving the doctype and entity references

by Dave Pawson-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2008/7/23 Andrew Welch <andrew.j.welch@...>:

> Ok, given:
>
> hello—world
>
> the events are:
>
> characters "hello"
> startEntity
> endEntity
> characters "-world"
>
> so as you can see the end entity event fires before the event with the
> expanded contents of that entity.

No, that seems wrong?

Should be
hello[xxxxx]world

where xxx is the entity expansion?

What is the text value of the entity Andrew?




 Even then, the expanded contents
> are coming through in the same characters event as text "world".

Ah! Seems like the parser is expanding entities before generating events?
That's wrong somehow? Configuration perhaps?


I'll have a look in sax2 and get back to you if I find anything.


regards





--
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help