Smooth histograms

View: New views
5 Messages — Rating Filter:   Alert me  

Smooth histograms

by Philipp K. Janert :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


I just submitted a patch (1970923) which
draws a smooth histogram-like curve
for a random collection of points, using
a Gaussian kernel density estimation
algorithm.

Demos are found here:
        www.philipp-janert.com/kdensity

The new method has the following advantages
over the classic way of generating histograms
using "smooth frequency":
- the resulting histogram is a smooth curve,
        making the effect of binning less severe
- it handles intermediate "bins" with no points
        in them gracefully. (smooth freq does so
        only if used "with boxes", but if you use
        "with lines" for example, the line will not
        drop to zero if an intermediate bin is
        empty)

The method is invoked like a weighted smoothing
algorithm:
        plot "data" u 1:(1):(1) smooth kdensity
where the 2nd parameter is the weight of each
point and the 3rd parameter is the bandwidth
to be used.

This patch complements the "smooth cumulative"
algorithm as another way to visualize the
distribution of a collection of random points.

Comments and suggestions are welcome.

Best,

                Ph.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
gnuplot-beta mailing list
gnuplot-beta@...
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta

Re: Smooth histograms

by plotter :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, 24 May 2008 01:15:37 +0200, Philipp K. Janert <janert@...>  
wrote:

> I just submitted a patch (1970923) which
> draws a smooth histogram-like curve
> for a random collection of points, using
> a Gaussian kernel density estimation
> algorithm.
> Demos are found here:
> www.philipp-janert.com/kdensity


very interesting.

My initial impression on looking at your top left example is that there is  
a phase shift of +half a box in x most visible in the 0.01 and 0.05 plots.

Considering that x=0 is in the middle of the first box it appears that the  
fits are responding in a way that aligns with the right of each box.

It's a bit subjective due to the nature of the data but this is my  
impression for the peaks at 0.2 0.4 and 0.5

Maybe you could test this effect with a rapid change in the data.


I also think there is an egde effect at the begining and end of the data.  
This is a common problem when applying this sort of technique to image  
data. How to deal with edges when the kernel goes outside the data. There  
are several "solutions" which involve falsely extening the data but  
applying a kernel to an incompete sample range is effectively filling it  
with zeros and is equally false.

It's like a running mean cannot be meaningful upto the edges of the sample  
range since there are not enough samples to take the mean over. This also  
gives an artificial drop off at the edges. This is also rather marked near  
the origin in your lognormal example.

In image processing it's just a case of prettying up the edges but in a  
scientific context this is clearly not appropriate.

I think the only rigourous way to deal with this is not to plot the part  
where the data is incomplete.

I hope the comments are useful.

best regards, Peter.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
gnuplot-beta mailing list
gnuplot-beta@...
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta

Re: Smooth histograms

by Philipp K. Janert :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Saturday 24 May 2008 00:11, you wrote:

> On Sat, 24 May 2008 01:15:37 +0200, Philipp K. Janert <janert@...>
>
> wrote:
> > I just submitted a patch (1970923) which
> > draws a smooth histogram-like curve
> > for a random collection of points, using
> > a Gaussian kernel density estimation
> > algorithm.
> > Demos are found here:
> > www.philipp-janert.com/kdensity
>
> very interesting.
>
> My initial impression on looking at your top left example is that there is
> a phase shift of +half a box in x most visible in the 0.01 and 0.05 plots.
>
 The edge effect is actually in the histogram,
not in the kernel density (yet another advantage
of k-densities over histograms: the annoying
bin-placement problem goes away).

The code does what all current gnuplot
smoothing algos do: they stop at the min
and max data point in the sample. I think
this is reasonable.

Best,

                Ph.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
gnuplot-beta mailing list
gnuplot-beta@...
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta

Re: Smooth histograms

by plotter :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 26 May 2008 01:41:31 +0200, Philipp K. Janert <janert@...>  
wrote:

> On Saturday 24 May 2008 00:11, you wrote:
>> On Sat, 24 May 2008 01:15:37 +0200, Philipp K. Janert <janert@...>
>>
>> wrote:
>> > I just submitted a patch (1970923) which
>> > draws a smooth histogram-like curve
>> > for a random collection of points, using
>> > a Gaussian kernel density estimation
>> > algorithm.
>> > Demos are found here:
>> > www.philipp-janert.com/kdensity
>>
>> very interesting.
>>
>> My initial impression on looking at your top left example is that there  
>> is
>> a phase shift of +half a box in x most visible in the 0.01 and 0.05  
>> plots.
>>
>  The edge effect is actually in the histogram,
> not in the kernel density (yet another advantage
> of k-densities over histograms: the annoying
> bin-placement problem goes away).

hmm, never been much of a fan of bins and histograms, that's probably why.  
More for sociologists and economists.

>
> The code does what all current gnuplot
> smoothing algos do: they stop at the min
> and max data point in the sample. I think
> this is reasonable.
>

Well I'm not sure that is comparable. IRRC all the "smoothing" algos  
(appart from unique) are splines , these are calculated over 4 data. In  
fact they would require just one point outside the data range at each end.  
I have not looked how they are dealt with but it is unlikely to be  
important for one point.

However, techniques using a kernel require half the kernel width outside  
each end of the data range.

I would guess by looking at your examples that the missing data are  
initialised as zero. Is that correct?

Sorry to be a stickler for detail, it must be my rigourous physics  
training coming out. As people become less and less aware of what all  
these software tools are actually doing for them, it becomes more and more  
important that they do not introduce distortions.

Don't think I knocking your efforts, I'm pretty impressed overall.


best regards, Peter.


> Best,
>
> Ph.
>


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
gnuplot-beta mailing list
gnuplot-beta@...
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta

Re: Smooth histograms

by plotter :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 26 May 2008 01:41:31 +0200, Philipp K. Janert <janert@...>  
wrote:

> The code does what all current gnuplot
> smoothing algos do: they stop at the min
> and max data point in the sample. I think
> this is reasonable.
> Best,
> Ph.

What I would suggest is that it only produces a plot line over the range  
where the data is complete.

If the sample is large enough for this to be negligable , it won't notice  
anyway.

If it is significant in relation to the data sample, the plot line will  
stop at the point where it is no longer mathematically valid. That would  
seem to be the correct thing to do.

I see little justification for extending it beyond that range.


It must be a trivial change to make if you accept the principal.

best regards, Peter.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
gnuplot-beta mailing list
gnuplot-beta@...
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
LightInTheBox - Buy quality products at wholesale price