[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

View: New views
18 Messages — Rating Filter:   Alert me  

[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


URL:
  <http://savannah.nongnu.org/bugs/?20252>

                 Summary: [gnu.org #336933] RFC2047 header encoding bug
                 Project: MHonArc
            Submitted by: jaginsberg
            Submitted on: Monday 06/25/2007 at 15:13
                Category: Character Sets
                Severity: 3 - Normal
              Item Group: Incorrect Behavior
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
        Operating System: All
            Perl Version: 5.8.3-19.5
       Component Version: 0.7.3
           Fixed Release:

    _______________________________________________________

Details:

From a user report on lists.gnu.org:

"""
To provide a bit more info, I looked at the headers in my mailbox, and
matched them with HTMLised messages from lists.gnu.org.

It seems that only some forms of encoding are affected. I don't know which
encoding is the first example using, but it displays fine in my mail client
(mutt):

This (http://lists.gnu.org/archive/html/grub-devel/2007-06/msg00004.html):

From: =?UTF-8?B?VmVzYSBKw6TDpHNrZWzDpGluZW4=?= <chaac@...>

displays as:

From: Vesa JÃÃskelÃinen

this (http://lists.gnu.org/archive/html/grub-devel/2007-05/msg00155.html):

From: =?ISO-8859-1?Q?Vesa_J=E4=E4skel=E4inen?= <chaac@...>

displays as:

From: Vesa Jääskeläinen
"""

Thanks!

-jag




    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #1, bug #20252 (project mhonarc):

Not a mhonarc bug.

Almost certainly, mhonarc is converting the name to UTF-8,
but Apache is sending the web page out with an ISO-8859-1
header. Here's an example of mhonarc doing just fine with
a message from the same person.

http://www.mail-archive.com/grub-devel@.../msg00411.htm

Solution (going forward) is to have mhonarc produce UTF-8
for everything, and for the webserver to label it as such.



    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #2, bug #20252 (project mhonarc):

And here's the exact message. Note the combination of
Chinese, English, and umlauts; unicode is the only answer.

http://www.mail-archive.com/grub-devel@.../msg02923.html



    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #3, bug #20252 (project mhonarc):

I understand what you're trying to say, but I'm not sure you're correct.

First, Apache is returning the pages UTF-8 encoded:

HEAD /archive/html/grub-devel/2007-06/msg00004.html HTTP/1.0
Host: lists.gnu.org

HTTP/1.1 200 OK
Date: Tue, 26 Jun 2007 14:49:29 GMT
Server: Apache/2.0.51 (Fedora)
Last-Modified: Fri, 01 Jun 2007 21:19:30 GMT
ETag: "e3105a-11a8-c5549c80"
Accept-Ranges: bytes
Content-Length: 4520
Connection: close
Content-Type: text/html; charset=UTF-8

Second, the encodings presented as entities between the two pages are
different. In the first URL, msg00004.html, the special characters are
written as à whereas in the second URL, they are written as ä.

The correct character has a unicode codepage of 0xe4, an iso-8859-1 encoding
of 0xe4, and a utf-8 encoding of 0xc3a4. Given that, what I'm imagining has
happened is that in the first case, the UTF-8 characters are assumed to be
iso-8859-1, an 8 bit character encoding, and are written as the first byte of
the UTF-8 encoding; however in the second case, I'm supposing that it is
properly transcoding from utf-8 to latin1.

But I'm not very fluent in the internals of MHonArc. Thoughts?

-jag



    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #4, bug #20252 (project mhonarc):


Hi, I'm the person who reported this problem to gnu.org sysadmins.

> Given that, what I'm imagining has happened is that in the first
> case, the UTF-8 characters are assumed to be iso-8859-1, an 8 bit
> character encoding, and are written as the first byte of the
> UTF-8 encoding; however in the second case, I'm supposing that it
> is properly transcoding from utf-8 to latin1.

Sounds plausible.  I suspect what makes it behave differently is that in the
first case, subject is base64-encoded (in the second it isn't).

I'm afraid I can't really help.  I have zero knowledge about MHonArc
internals (and am not very fluent in perl either).  Sorry.


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #5, bug #20252 (project mhonarc):

Erm, I meant to say From of course ;-)


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


Re: [bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by Jeff Breidenbach :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

[ -savannah because I am lazy ]

Ok, well we do have proof that mhonarc is capable of doing the right thing
on the exact same message. I use the TEXTENCODE resource to send
everything to UTF-8, which is probably the recommended mhonarc way of
doing things these days anyway.

http://www.mhonarc.org/MHonArc/doc/resources/textencode.html
http://www.mhonarc.org/MHonArc/doc/rcfileexs/utf-8-encode.mrc.html

So while one could dive in deep and try to figure out what is going on,
another choice is just try TEXTENCODE and see if it all magically
works. And if so, tell the bug tracking system. If that doesn't do the
trick, I don't know what to say other than "works for me".

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #6, bug #20252 (project mhonarc):

For me to do an accurate analysis, I first need access to
the original raw mail message.

Then, it will help to know what resource settings are
being used for the archive in question since some resources
affect how mhonarc process character sets.

A quick dirty test is to run mhonarc (with default settings)
on just the message in question to see what happens.  If
the HTML created looks proper, then the problem is due
to some resource setting.  An example may be if resource
settings assume a single charset for all messages.

If the HTML looks bad, then one possibility is how the
email message is encoded.  I.e.  If the message is not
conforming to email standards, things may not turn out
right.

Since character set processing may leverage different
Perl modules depending on what the given perl installation
provides, it is possible some module may be introducing
errors.

Of course, there may be a bug in MHonArc, but I cannot
tell without testing.  Since Jeff states that the message
can be rendered properly, we at least know something
does work properly :)

Note, even though Jeff states that using TEXTENCODE to
UTF-8 everything can be done, and is generally a good idea,
it is dependent on the search engine that is being used
for the archives.  The gnu.org archives use Namazu, so
UTF-8 encoding is not an option in this case.

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #7, bug #20252 (project mhonarc):


I'm attaching a raw copy of the message that generated the bad html.

I don't have a mhonarc install to test it.  Is it possible to install and
process a single message right-away without setting up MTA integration, etc?


(file #13182)
    _______________________________________________________

Additional Item Attachment:

File name: bad_message                    Size:3 KB


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


Re: [bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by Jeff Breidenbach :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I don't have a mhonarc install to test it.  Is it possible to install and
> process a single message right-away without setting up MTA integration, etc?

Yes.

As a side note #1 I have the names of 564 gnu.org
and nongnu.org mailing lists that have been hand
checked and determined to be completely overrun
by spam.  Is there anyone at the FSF I should give
these to?

Side note #2 is mail-archive.com has kept secondary
archives for FSF lists with permission for some time now.
If it was helpful, we'd be happy to add FSF branding to
those archives and swap primary/secondary roles with
the FSF maintained archives.

Cheers,
Jeff

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #8, bug #20252 (project mhonarc):

Just download mhonarc from www.mhonarc.org, install, and
run.  Mhonarc is independent of any MTA.

To convert a single message, you can do:

mhonarc -single message.822 > message.html

I ran the above on the sample attached message, and
the output looks correct.  The name in question
got translated to:

Vesa Jääskeläinen

I'd need to see the resource settings used by the gnu.org
archives to see what may be wrong in the configuration.
My guess is they could be "flattening out" character handling
for lists, maybe for performance and/or security reasons.

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #9, bug #20252 (project mhonarc):


Tried that, and got the same result (with 0xe4).  However, this is latin1,
and can't possibly work.  As Jeff said:

"Note the combination of Chinese, English, and umlauts; unicode is the only
answer."

The HTML generated at lists.gnu.org seems to be utf-8 but truncated at one
byte.  Is it possible that mhonarc handled this conversion, or do we have
another conversion tool in place here?


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Update of bug #20252 (project mhonarc):

                  Status:                    None => Works For Me          

    _______________________________________________________

Follow-up Comment #10:

> The HTML generated at lists.gnu.org seems to be utf-8 but
> truncated at one byte. Is it possible that mhonarc handled
> conversion, or do we have another conversion tool in place
> here?

Only the gnu.org folks can answer that.

At this time, I cannot confirm that this problem is
due to mhonarc.

You may want submit a bug to the gnu folks directly on this.
I cannot provide much more help w/o knowing what resource
configuration they are using and if there is custom
processing that may introduce problems.

    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #11, bug #20252 (project mhonarc):

Isn't it the same as bug #11187? Sounds similar at least.
(That one was fixed in mhonarc 2.6.11 while gnu.org uses 2.6.10.)


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #12, bug #20252 (project mhonarc):


We are still running MHonarc 2.6.10 on the GNU lists server. I've just tested
the conversion of the message with

  ./mhonarc -single input > output.html

which generates the faulty encoding with 2.6.10, but does the right thing
with 2.6.16.

Our lists server is due for a major software upgrade sometime this fall at
which point we will regenerate the html archives, which will resolve this
problem.

Thanks,
Ward.


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #13, bug #20252 (project mhonarc):

Cool, nice to see this will be fixed.

Best regards


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Follow-up Comment #14, bug #20252 (project mhonarc):

My guess the following bug (fixed in v2.6.11) is
the source of the problem:

https://savannah.nongnu.org/bugs/?11187

Since it appears the problem is fixed in a later
version of mhonarc, I'm going to close this item.

If the problem persists after GNU's software update,
we can reopen the issue.


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV


[bug #20252] [gnu.org #336933] RFC2047 header encoding bug

by John Tartar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Update of bug #20252 (project mhonarc):

             Open/Closed:                    Open => Closed                
           Fixed Release:                         => N/A                    


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?20252>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo@... with the
message text UNSUBSCRIBE MHONARC-DEV

LightInTheBox - Buy quality products at wholesale price!