|
View:
New views
15 Messages
—
Rating Filter:
Alert me
|
|
|
New "Unicode" bundle in the Review trunkDear all,
there's a new bundle called "Unicode" in the review trunk. It is meant to be a place where we can gather any kind of scripts, commands, etc. which are related to general Unicode issue, meaning non-ASCII. This should also a place where we can gather scripts related to specific languages like Japanese, Chinese, Greek etc. This bundle is the first stage. How do we separate this bundle is a future task. Thus, if there is someone who already has such scripts or is willing to support, please let us/me know. Up to now there are the following stuff in: - Normalize according canonical (de)composition of accented characters - Delete Diacritics: façadë έ だ => facade ε た - Convert to a similar Unicode Character: type the letter 'c' to get a list of "cçćĉċčƈ¢ɕʗḉ⒞ⓒc¢" - Convert to Greek Character: type 'n' to get "ν" - Show Unicode Name: select some letters to get a list of the Unicode names like LATIN SMALL LETTER A I have many other scripts, but I need some time to polish them up. To get this bundle, simply use the Subversion Bundle's checkout http://macromates.com/svn/Bundles/trunk/Review/Bundles/Unicode.tmbundle save this to the Desktop or whatever. I know, to deal with non-ASCII scripts in TM 1.x is a bit tricky, but TM 2.0 will come ;) Cheers, --Hans ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkHans-Joerg Bibiko wrote:
> Dear all, > > there's a new bundle called "Unicode" in the review trunk. It is meant > to be a place where we can gather any kind of scripts, commands, etc. > which are related to general Unicode issue, meaning non-ASCII. This > should also a place where we can gather scripts related to specific > languages like Japanese, Chinese, Greek etc. > This bundle is the first stage. How do we separate this bundle is a > future task. > > Thus, if there is someone who already has such scripts or is willing to > support, please let us/me know. > > Up to now there are the following stuff in: > > - Normalize according canonical (de)composition of accented characters > - Delete Diacritics: façadë έ だ => facade ε た > - Convert to a similar Unicode Character: type the letter 'c' to get a > list of "cçćĉċčƈ¢ɕʗḉ⒞ⓒc¢" > - Convert to Greek Character: type 'n' to get "ν" > - Show Unicode Name: select some letters to get a list of the Unicode > names like LATIN SMALL LETTER A > > I have many other scripts, but I need some time to polish them up. > > To get this bundle, simply use the Subversion Bundle's checkout > > http://macromates.com/svn/Bundles/trunk/Review/Bundles/Unicode.tmbundle > > save this to the Desktop or whatever. > > I know, to deal with non-ASCII scripts in TM 1.x is a bit tricky, but TM > 2.0 will come ;) One small note: In the character name script you should probably call unicodedata.name() with a second argument in case the character has no name, i.e. replace res = a + " : " + unicodedata.name(a) with res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) Furthermore it would be great if this script could display all information there is in the Python Unicode database, i.e. stuff like unicodedata.category() unicodedata.bidrectional() unicodedata.decimal() etc. Servus, Walter ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkOn 30.05.2008, at 17:32, Walter Dörwald wrote: > Hans-Joerg Bibiko wrote: >> Dear all, >> there's a new bundle called "Unicode" in the review trunk. It is >> meant to be a place where we can gather any kind of scripts, >> commands, etc. which are related to general Unicode issue, meaning >> non-ASCII. This should also a place ... > > One small note: > > In the character name script you should probably call > unicodedata.name() with a second argument in case the character has > no name, i.e. replace > > res = a + " : " + unicodedata.name(a) > > with > > res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) wrote in python ;) Caused by the issue that python has installed some Unicode data per default. > Furthermore it would be great if this script could display all > information there is in the Python Unicode database, i.e. stuff like > > unicodedata.category() > unicodedata.bidrectional() > unicodedata.decimal() Yes. I have such a script in Perl which also shows up info about Unicode code points etc. Servus, --Hans ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkHans-Jörg Bibiko wrote:
> On 30.05.2008, at 17:32, Walter Dörwald wrote: > >> Hans-Joerg Bibiko wrote: >>> Dear all, >>> there's a new bundle called "Unicode" in the review trunk. It is >>> meant to be a place where we can gather any kind of scripts, >>> commands, etc. which are related to general Unicode issue, meaning >>> non-ASCII. This should also a place ... >> >> One small note: >> >> In the character name script you should probably call >> unicodedata.name() with a second argument in case the character has no >> name, i.e. replace >> >> res = a + " : " + unicodedata.name(a) >> >> with >> >> res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) > Thanks for the hint! These are more or less the first scripts which I > wrote in python ;) > Caused by the issue that python has installed some Unicode data per > default. Here's another patch (against the current version). It shows both the codepoint and the name. BTW, you don't have to use a regular expression to split a string into characters, simply iterating through it does the trick: Index: Commands/Show Unicode Names.tmCommand =================================================================== --- Commands/Show Unicode Names.tmCommand (revision 9813) +++ Commands/Show Unicode Names.tmCommand (working copy) @@ -8,11 +8,13 @@ <string>#!/usr/bin/python import unicodedata import sys -import re -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), "UTF-8")): - if (len(a)==1) and (a != '\n'): - res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) +for a in unicode(sys.stdin.read(), "UTF-8"): + if a != '\n': + res = u"%s : U+%04X" % (a, ord(a)) + name = unicodedata.name(a, None) + if name: + res += u" : %s" % name print res.encode("UTF-8")</string> <key>fallbackInput</key> <string>character</string> >> Furthermore it would be great if this script could display all >> information there is in the Python Unicode database, i.e. stuff like >> >> unicodedata.category() >> unicodedata.bidrectional() >> unicodedata.decimal() > Yes. I have such a script in Perl which also shows up info about Unicode > code points etc. OK, now I see that the script displays information about every character in the selection. Adding more info might be a space problem. Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To Lowercase" command. Servus, Walter ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkOn 02.06.2008, at 00:04, Walter Dörwald wrote:
> Here's another patch (against the current version). It shows both > the codepoint and the name. > > BTW, you don't have to use a regular expression to split a string > into characters, simply iterating through it does the trick: > > Index: Commands/Show Unicode Names.tmCommand > -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), > "UTF-8")): > - if (len(a)==1) and (a != '\n'): > - res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) > +for a in unicode(sys.stdin.read(), "UTF-8"): > + if a != '\n': > + res = u"%s : U+%04X" % (a, ord(a)) > + name = unicodedata.name(a, None) > + if name: > + res += u" : %s" % name > print res.encode("UTF-8")</string> > <key>fallbackInput</key> > <string>character</string> > >> Furthermore it would be great if this script could display all > >> information there is in the Python Unicode database, i.e. stuff > like > >> > >> unicodedata.category() > >> unicodedata.bidrectional() > >> unicodedata.decimal() > > Yes. I have such a script in Perl which also shows up info about > Unicode > > code points etc. > Another problem: Using Ctrl-Shift-U as the shortcut hides the > "Convert To Lowercase" command. Yes. This was a bad key combo. I changed it temporally to CTRL+OPT +APPLE+U BTW: Can Python handle Unicode codepoints which are specified in Unicode pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses UTF-16 internally. But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . I can print that character to TM but unicodedata fails because it expects one character but not two (?) Servus, --der Hans ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkSorry but do i miss something? I have that error
------------------------- Traceback (most recent call last): File "/tmp/temp_textmate.0WYiu4", line 50, in <module> result=dialog.menu([re.sub(r"(?=[^a-zA-Z0-9_ .\/\-\x7F-\xFF\n])", r'\\', a) + "\t" + unicodedata.name(a, "U+%04X" % ord(a)) for a in suggestions]) File "/Applications/TextMate.app/Contents/SharedSupport/Support/lib/dialog.py", line 51, in menu plist = to_plist(menu) UnboundLocalError: local variable 'menu' referenced before assignment ------------------------- when try to "Convert to Greek..." or "Convert to Similar..." Alexey Blinov On Mon, Jun 2, 2008 at 3:09 AM, Hans-Jörg Bibiko <bibiko@...> wrote: > On 02.06.2008, at 00:04, Walter Dörwald wrote: >> >> Here's another patch (against the current version). It shows both the >> codepoint and the name. >> >> BTW, you don't have to use a regular expression to split a string into >> characters, simply iterating through it does the trick: >> >> Index: Commands/Show Unicode Names.tmCommand >> -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), >> "UTF-8")): >> - if (len(a)==1) and (a != '\n'): >> - res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) >> +for a in unicode(sys.stdin.read(), "UTF-8"): >> + if a != '\n': >> + res = u"%s : U+%04X" % (a, ord(a)) >> + name = unicodedata.name(a, None) >> + if name: >> + res += u" : %s" % name >> print res.encode("UTF-8")</string> >> <key>fallbackInput</key> >> <string>character</string> > > Thanks! Just committed to the trunk. > >> >> Furthermore it would be great if this script could display all >> >> information there is in the Python Unicode database, i.e. stuff like >> >> >> >> unicodedata.category() >> >> unicodedata.bidrectional() >> >> unicodedata.decimal() >> > Yes. I have such a script in Perl which also shows up info about Unicode >> > code points etc. > > Just added to the bundle a prototype of 'Show Unicode Properties' > > >> Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert To >> Lowercase" command. > > Yes. This was a bad key combo. I changed it temporally to CTRL+OPT+APPLE+U > > BTW: Can Python handle Unicode codepoints which are specified in Unicode > pane B, meaning greater U+FFFF? I tried it out. I found out that Python uses > UTF-16 internally. > But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . > I can print that character to TM but unicodedata fails because it expects > one character but not two (?) > > Servus, > > --der Hans > ______________________________________________________________________ > For new threads USE THIS: textmate@... > (threading gets destroyed and the universe will collapse if you don't) > http://lists.macromates.com/mailman/listinfo/textmate > ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkOn 2 Jun 2008, at 15:26, Alexey Blinov wrote: > Sorry but do i miss something? I have that error > ------------------------- > Traceback (most recent call last): > File "/tmp/temp_textmate.0WYiu4", line 50, in <module> > result=dialog.menu([re.sub(r"(?=[^a-zA-Z0-9_ .\/\-\x7F-\xFF\n])", > r'\\', a) + "\t" + unicodedata.name(a, "U+%04X" % ord(a)) for a in > suggestions]) > File "/Applications/TextMate.app/Contents/SharedSupport/Support/lib/ > dialog.py", > line 51, in menu > plist = to_plist(menu) > UnboundLocalError: local variable 'menu' referenced before assignment > ------------------------- > when try to "Convert to Greek..." or "Convert to Similar..." You have to upgrade dialog.py in /Applications/TextMate.app/Contents/ SharedSupport/Support/lib The old version didn't support UTF-8. Cheers, Hans ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkHans-Jörg Bibiko wrote:
> On 02.06.2008, at 00:04, Walter Dörwald wrote: >> Here's another patch (against the current version). It shows both the >> codepoint and the name. >> >> BTW, you don't have to use a regular expression to split a string into >> characters, simply iterating through it does the trick: >> >> Index: Commands/Show Unicode Names.tmCommand >> -for a in re.compile("(?um)(.)").split(unicode(sys.stdin.read(), >> "UTF-8")): >> - if (len(a)==1) and (a != '\n'): >> - res = a + " : " + unicodedata.name(a, "U+%04X" % ord(a)) >> +for a in unicode(sys.stdin.read(), "UTF-8"): >> + if a != '\n': >> + res = u"%s : U+%04X" % (a, ord(a)) >> + name = unicodedata.name(a, None) >> + if name: >> + res += u" : %s" % name >> print res.encode("UTF-8")</string> >> <key>fallbackInput</key> >> <string>character</string> > Thanks! Just committed to the trunk. > >> >> Furthermore it would be great if this script could display all >> >> information there is in the Python Unicode database, i.e. stuff like >> >> >> >> unicodedata.category() >> >> unicodedata.bidrectional() >> >> unicodedata.decimal() >> > Yes. I have such a script in Perl which also shows up info about >> Unicode >> > code points etc. > Just added to the bundle a prototype of 'Show Unicode Properties' > > >> Another problem: Using Ctrl-Shift-U as the shortcut hides the "Convert >> To Lowercase" command. > Yes. This was a bad key combo. I changed it temporally to CTRL+OPT+APPLE+U > > BTW: Can Python handle Unicode codepoints which are specified in Unicode > pane B, meaning greater U+FFFF? I tried it out. I found out that Python > uses UTF-16 internally. At least the Python that ships with the OS uses 2 byte Unicode character with partial UTF-16 support: Python 2.5.2 (r252:60911, Apr 8 2008, 18:54:00) [GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.maxunicode 65535 The size of a Unicode character is specified at compile time with the --enable-unicode option, so you *could* compile a wide Python with: ./configure --enable-unicode=ucs4 > But e.g. UCS hex: 20000 ; UTF-16: D840 DC00 . > I can print that character to TM but unicodedata fails because it > expects one character but not two (?) There are some spots in the Python code base where in narrow builds surrogate pairs are interpreted properly as characters outside the BMP, but unicodedata isn't one of them (so it's not actually real UTF-16 throughout). There's an open issue on the Python bugtracker about that: http://bugs.python.org/issue1706460 So there are two options: 1) Apple starts compiling its Python with --enable-unicode=ucs4 2) Python gets fixed so that surrogate pairs can be passed to unicodedata functions. I think I might give 2) a try. Servus, Walter ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkWalter Dörwald wrote:
> Hans-Jörg Bibiko wrote: > >> On 02.06.2008, at 00:04, Walter Dörwald wrote: >>> Here's another patch (against the current version). It shows both the >>> codepoint and the name. >>> [...] Here's another suggestions on the current Bundle version: To get the UTF-8 bytes of a character, you're doing the following: print " UTF-8 : " + " ".join(repr(char.encode("UTF-8")).split('\\x')).lstrip("' ").rstrip("'").upper() This only works for characters with a codepoint >= 128. The following code should work better: print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char) Furthermore the code: decomp = unicodedata.decomposition(char).lstrip(' ').rstrip(' ') can be simplyfied to: decomp = unicodedata.decomposition(char).strip() (strip() strips from both ends and stripping all whitespace is the default when no argument is given.) Hope that helps. Servus, Walter ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkWalter Dörwald wrote:
> Walter Dörwald wrote: > >> Hans-Jörg Bibiko wrote: >> >>> On 02.06.2008, at 00:04, Walter Dörwald wrote: >>>> Here's another patch (against the current version). It shows both >>>> the codepoint and the name. >>>> [...] > > Here's another suggestions on the current Bundle version: > > To get the UTF-8 bytes of a character, you're doing the following: > > print " UTF-8 : " + " > ".join(repr(char.encode("UTF-8")).split('\\x')).lstrip("' > ").rstrip("'").upper() > > This only works for characters with a codepoint >= 128. The following > code should work better: > > print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for > c in char) Oops, that was of course supposed to be: print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() for c in char.encode("utf-8")) Servus, Walter ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkOn 3 Jun 2008, at 17:29, Walter Dörwald wrote: > Walter Dörwald wrote: >> Walter Dörwald wrote: >>> Hans-Jörg Bibiko wrote: >>> >>>> On 02.06.2008, at 00:04, Walter Dörwald wrote: >>>>> Here's another patch (against the current version). It shows >>>>> both the codepoint and the name. >>>>> [...] >> Here's another suggestions on the current Bundle version: >> To get the UTF-8 bytes of a character, you're doing the following: >> print " UTF-8 : " + " >> ".join(repr(char.encode("UTF-8")).split('\\x')).lstrip("' >> ").rstrip("'").upper() >> This only works for characters with a codepoint >= 128. The >> following code should work better: >> print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() >> for c in char) > > Oops, that was of course supposed to be: > > print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() > for c in char.encode("utf-8")) Once again, thanks a lot for teaching me Python ;) The code changes are in the SVN trunk. --Hans ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkOn 03.06.2008, at 17:29, Walter Dörwald wrote: > Walter Dörwald wrote: >> Walter Dörwald wrote: >>> Hans-Jörg Bibiko wrote: >>> >>>> On 02.06.2008, at 00:04, Walter Dörwald wrote: >>>>> Here's another patch (against the current version). It shows >>>>> both the codepoint and the name. >>>>> [...] >> Here's another suggestions on the current Bundle version: >> To get the UTF-8 bytes of a character, you're doing the following: >> print " UTF-8 : " + " ".join(repr(char.encode >> ("UTF-8")).split('\\x')).lstrip("' ").rstrip("'").upper() >> This only works for characters with a codepoint >= 128. The >> following code should work better: >> print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper >> () for c in char) > > Oops, that was of course supposed to be: > > print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper > () for c in char.encode("utf-8")) error for invalid syntax referring to 'for' Mac OSX 10.4.11 ppc; Python 2.4.2 On my 10.5.3 Mac it works(?) --Hans ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkHans-Jörg Bibiko wrote:
> > On 03.06.2008, at 17:29, Walter Dörwald wrote: > >> Walter Dörwald wrote: >>> Walter Dörwald wrote: >>>> Hans-Jörg Bibiko wrote: >>>> >>>>> On 02.06.2008, at 00:04, Walter Dörwald wrote: >>>>>> Here's another patch (against the current version). It shows both >>>>>> the codepoint and the name. >>>>>> [...] >>> Here's another suggestions on the current Bundle version: >>> To get the UTF-8 bytes of a character, you're doing the following: >>> print " UTF-8 : " + " >>> ".join(repr(char.encode("UTF-8")).split('\\x')).lstrip("' >>> ").rstrip("'").upper() >>> This only works for characters with a codepoint >= 128. The following >>> code should work better: >>> print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() >>> for c in char) >> >> Oops, that was of course supposed to be: >> >> print " UTF-8 : %s" % " ".join(hex(ord(c))[2:].upper() >> for c in char.encode("utf-8")) > Could it be that this isn't allowed in Python for Tiger? I get an error > for invalid syntax referring to 'for' > Mac OSX 10.4.11 ppc; Python 2.4.2 > > On my 10.5.3 Mac it works(?) AFAICR Tiger has Python 2.3, which didn't support generator expressions. The following should work: print " UTF-8 : %s" % " ".join([hex(ord(c))[2:].upper() for c in char.encode("utf-8")]) (i.e. replace the generator expression with a list comprehension by adding [] around the join argument.) Hope that helps! Servus, Walter ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkOn 03.06.2008, at 22:44, Walter Dörwald wrote: > print " UTF-8 : %s" % " ".join([hex(ord(c))[2:].upper() > for c in char.encode("utf-8")]) Thanks. This did the trick. I thought that I did try out this [ ]-notation, but anyway ... the main thing is that it works :) --Hans ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
|
|
Re: New "Unicode" bundle in the Review trunkHi,
there are some more commands available: Furthermore I wrote some basic syntax highlighting stuff to display 'no ASCII', 'no Latin', 'all combining diacritics' characters. Most of the commands also support Unicode higher than U+FFFF. [But be careful! Up to now TM 1 can display (more or less) these characters, but each char are in TM 1 two chars! If you place the caret in between and invoke a command TM will crash immediately! But TM 2 supports these chars ;) ] I wrote a new Chinese Traditional <> Simplified Converter. It also converts characters > U+FFFF (Apple's not ;) ), and it show up all those characters which have more than one counterpart as snippets {A=B|C}. There's a command which shows a menu displaying B and C etc. to disambiguate (Apple does not do that). All Unicode data are coming from the latest Unicode 5.1 and it's easy to upgrade. Show Unicode Properties also shows all known information about Chinese/Japanese/Korean ideographs, like Radical, readings, Wubi Xing codes, etc. All these data are coming from Apple's Character Palette internals ;) But I think about to integrate Unicode's UniHan database. This zip file (6MB) won't be part of that bundle. Anyone who wants to use it can download it (I will provide a command for that). Last but not least I want to say thank you to Walter Dörwald who helped me a lot with the Python scripts. Cheers, --Hans ______________________________________________________________________ For new threads USE THIS: textmate@... (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate |
| Free embeddable forum powered by Nabble | Forum Help |