k means

View: New views
5 Messages — Rating Filter:   Alert me  

k means

by Christophe Genolini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi the devel list,

I am using K means with a non standard distance. As far as I see, the
function kmeans is able to deal with 4 differents algorithm, but not
with a user define distance.

In addition, kmeans is not able to deal with missing value whereas
there is several solution that k-means can use to deal with them ; one
is using a distance that takes the missing value in account, like a
distance with Gower adjustement (which is the regular distance dist()
used in R).

So is it possible to adapt kmeans to let the user gives an argument
'distance to use'?

Christophe

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: k means

by Bill.Venables :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I would not support an extension of kmeans to do this.  I think it is
best left simple and fast as it now is.  

I can think of three ways you might handle your problem

1. Use, for example, pam() in the cluster package, which does a similar
job to kmeans (not quite the same, of course) with a general distance
measure.

2. If you are working with a non-standard metric and you really want to
use the k-means algorithm, then perhaps one way to do so is to use an
approximate euclidean coordinatisatin for the points with a
multidimensional scaling first and then use kmeans.  (e.g. cmdscale,
isoMDS, sammon, ...)  I've no idea what the traps are with this
approach, but it seems kind of feasible.

3. If the algorithms are there and available as you say, write the code
yourself and contribute it to the R-project as a simple package.
Everyone will benefit.


Bill Venables
CSIRO Laboratories
PO Box 120, Cleveland, 4163
AUSTRALIA
Office Phone (email preferred): +61 7 3826 7251
Fax (if absolutely necessary):  +61 7 3826 7304
Mobile:                         +61 4 8819 4402
Home Phone:                     +61 7 3286 7700
mailto:Bill.Venables@...
http://www.cmis.csiro.au/bill.venables/ 

-----Original Message-----
From: r-devel-bounces@...
[mailto:r-devel-bounces@...] On Behalf Of
cgenolin@...
Sent: Tuesday, 13 May 2008 3:25 AM
To: r-devel@...
Subject: [Rd] k means

Hi the devel list,

I am using K means with a non standard distance. As far as I see, the
function kmeans is able to deal with 4 differents algorithm, but not
with a user define distance.

In addition, kmeans is not able to deal with missing value whereas
there is several solution that k-means can use to deal with them ; one
is using a distance that takes the missing value in account, like a
distance with Gower adjustement (which is the regular distance dist()
used in R).

So is it possible to adapt kmeans to let the user gives an argument
'distance to use'?

Christophe

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: k means

by Friedrich Leisch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>>>>> On Mon, 12 May 2008 19:24:55 +0200,
>>>>> cgenolin  (c) wrote:

  > Hi the devel list,
  > I am using K means with a non standard distance. As far as I see, the
  > function kmeans is able to deal with 4 differents algorithm, but not
  > with a user define distance.

  > In addition, kmeans is not able to deal with missing value whereas
  > there is several solution that k-means can use to deal with them ; one
  > is using a distance that takes the missing value in account, like a
  > distance with Gower adjustement (which is the regular distance dist()
  > used in R).

  > So is it possible to adapt kmeans to let the user gives an argument
  > 'distance to use'?

As Bill Venables already pointed out that makes not too much sense,
especially as there are already R functions for doing that. Package
flexclust implements a k-means-type clustering algorithm where the
user can provide arbitrary distance measures, have a look at

     http://www.stat.uni-muenchen.de/~leisch/papers/Leisch-2006.pdf

The code you need to write for using a new distance measure is
minimal, and there are two examples in the paper describing in detail
what needs to be done.

Hope this helps,
Fritz Leisch

--
-----------------------------------------------------------------------
Prof. Dr. Friedrich Leisch

Institut für Statistik                          Tel: (+49 89) 2180 3165
Ludwig-Maximilians-Universität                  Fax: (+49 89) 2180 5308
Ludwigstraße 33
D-80539 München                     http://www.statistik.lmu.de/~leisch
-----------------------------------------------------------------------
   Journal Computational Statistics --- http://www.springer.com/180 
          Münchner R Kurse --- http://www.statistik.lmu.de/R

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: k means

by Christophe Genolini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi the list

I try the flexclust, but I do not manage to see what is wrong in my
(very simple) code...
Will you have few minutes to check it?

Thanks for your help.

Christophe
--- 8< --------------------------------
data  <- rbind(c(1,2 ,NA,4 ),
               c(1,1 ,NA,1 ),
               c(2,3 ,4 ,5 ),
               c(2,2 ,2 ,2 ),
               c(3,NA,NA,6 ),
               c(3,NA,NA,3 ),
               c(2,4 ,4 ,NA),
               c(2,3 ,2 ,NA))

distTest <- rbind(c(0,0,0,0),
                  c(1,1,1,1))

distNA <- function(x,centers){
    z <- matrix(0,nrow=nrow(x),ncol=nrow(centers))
    for(k in 1:nrow(centers)){
        z[,k]<- apply(x,1,function(x){dist(rbind(x,centers[k,]))})
    }
    z
}

distNA(data,distTest)

km <- kccaFamily(dist=distNA,cent=colMeans)
kcca(x=data,k=2,family=km)
kcca(x=data,k=3,family=km)

--- 8< --------------------------------






>>>>>> On Mon, 12 May 2008 19:24:55 +0200,
>>>>>> cgenolin  (c) wrote:
>
>  > Hi the devel list,
>  > I am using K means with a non standard distance. As far as I see, the
>  > function kmeans is able to deal with 4 differents algorithm, but not
>  > with a user define distance.
>
>  > In addition, kmeans is not able to deal with missing value whereas
>  > there is several solution that k-means can use to deal with them ; one
>  > is using a distance that takes the missing value in account, like a
>  > distance with Gower adjustement (which is the regular distance dist()
>  > used in R).
>
>  > So is it possible to adapt kmeans to let the user gives an argument
>  > 'distance to use'?
>
> As Bill Venables already pointed out that makes not too much sense,
> especially as there are already R functions for doing that. Package
> flexclust implements a k-means-type clustering algorithm where the
> user can provide arbitrary distance measures, have a look at
>
>     http://www.stat.uni-muenchen.de/~leisch/papers/Leisch-2006.pdf
>
> The code you need to write for using a new distance measure is
> minimal, and there are two examples in the paper describing in detail
> what needs to be done.
>
> Hope this helps,
> Fritz Leisch
>
> --
> -----------------------------------------------------------------------
> Prof. Dr. Friedrich Leisch
>
> Institut für Statistik                          Tel: (+49 89) 2180 3165
> Ludwig-Maximilians-Universität                  Fax: (+49 89) 2180 5308
> Ludwigstraße 33
> D-80539 München                     http://www.statistik.lmu.de/~leisch
> -----------------------------------------------------------------------
>   Journal Computational Statistics --- http://www.springer.com/180
>          Münchner R Kurse --- http://www.statistik.lmu.de/R
> -----------------------------------------------------------------------
>
>

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: k means

by Friedrich Leisch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>>>>> On Sat, 17 May 2008 00:54:55 +0200,
>>>>> cgenolin  (c) wrote:

  > Hi the list
  > I try the flexclust, but I do not manage to see what is wrong in my
  > (very simple) code...
  > Will you have few minutes to check it?

  > Thanks for your help.

  > Christophe
  > --- 8< --------------------------------
  > data  <- rbind(c(1,2 ,NA,4 ),
  >                c(1,1 ,NA,1 ),
  >                c(2,3 ,4 ,5 ),
  >                c(2,2 ,2 ,2 ),
  >                c(3,NA,NA,6 ),
  >                c(3,NA,NA,3 ),
  >                c(2,4 ,4 ,NA),
  >                c(2,3 ,2 ,NA))

  > distTest <- rbind(c(0,0,0,0),
  >                   c(1,1,1,1))

  > distNA <- function(x,centers){
  >     z <- matrix(0,nrow=nrow(x),ncol=nrow(centers))
  >     for(k in 1:nrow(centers)){
  >         z[,k]<- apply(x,1,function(x){dist(rbind(x,centers[k,]))})
  >     }
  >     z
  > }

  > distNA(data,distTest)

  > km <- kccaFamily(dist=distNA,cent=colMeans)
  > kcca(x=data,k=2,family=km)
  > kcca(x=data,k=3,family=km)

I don't think this is really appropriate for r-devel, you should
either ask the package author (me), or r-help.

Anyway, colMeans will not remove the missing values by default, so you
need also a special function for centroid computation:

R> centNA <- function(x) colMeans(x, na.rm=TRUE)
R> km <- kccaFamily(dist=distNA,cent=centNA)
R> kcca(x=data,k=2,family=km)
kcca object of family ??distNA??

call:
kcca(x = data, k = 2, family = km)

cluster sizes:

1 2
5 3


Hope this helps,
Fritz

--
-----------------------------------------------------------------------
Prof. Dr. Friedrich Leisch

Institut für Statistik                          Tel: (+49 89) 2180 3165
Ludwig-Maximilians-Universität                  Fax: (+49 89) 2180 5308
Ludwigstraße 33
D-80539 München                     http://www.statistik.lmu.de/~leisch
-----------------------------------------------------------------------
   Journal Computational Statistics --- http://www.springer.com/180 
          Münchner R Kurse --- http://www.statistik.lmu.de/R

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel