comment on my erlang Spamfilter

View: New views
6 Messages — Rating Filter:   Alert me  

comment on my erlang Spamfilter

by hask ellian :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I made a simple spamfilter in Erlang. It takes 2 files with previous spam and good emails and then counts how many times the most frequent words from the spammy emails and the good emails occurs and then calculates the quote spam/(spam+good) in the file you want to test and returns  a number between 0 and 1.
It could easily be improved in numerous ways but the main point for me was to learn Erlang. This isn't exactly what Erlang is for but it s way to get started.
I'd be happy to receive comments on the Erlang-ness of the code and improvements.
File I/O seems slow, is there a better way? In Haskell it is fairly instant.


-module(antispam).
-export([take/2,count/2,count_all/1,most_common/2,count_set_in_list/2,
classify/0,readfile/1]).

take(N,List) -> i_take(N,List,0,[]).
    i_take(N,List,Count,Acc) ->
     if Count < N andalso List /= [] ->
         i_take(N,tl(List),Count+1,Acc++[hd(List)]);
        Count == N ->
         Acc;
        true ->
         []
     end.

count(Tok,List) -> i_count(Tok,List,0).  
    i_count(Tok,List,Acc) ->
        if Tok == hd(List) andalso List /= [] ->
            i_count(Tok,tl(List),Acc+1);
           Tok /= hd(List) andalso List /= [] ->
            i_count(Tok,tl(List),Acc);
           true ->
            Acc
    end.

count_all(List) ->
    Unique = lists:usort(List),
    [{U, count(U, List)} || U <- Unique].

count_set_in_list(Set,List) ->
    S = [{S, count(S, List)} || S <- Set],
    lists:sum(lists:map(fun({H,T}) -> T end, S)).

most_common(Stringlist,Xmost) ->
    No_preps = lists:filter(fun(X) -> length(X) > 4 end, Stringlist),
    Sorted_by_count = lists:keysort(2, count_all(No_preps)),
    TakeX = take(Xmost, lists:reverse(Sorted_by_count)),
    lists:map(fun({H,T}) -> H end, TakeX).

readfile(FileName) ->
    {ok, Binary} = file:read_file(FileName),
    string:tokens(binary_to_list(Binary), " ").

classify() ->
    GoodWords = most_common(readfile("C:/Users/saftarn/Desktop/emails/okemails.txt"), 20),
    BadWords  = most_common(readfile("C:/Users/saftarn/Desktop/emails/spam.txt"), 20),
    GoodCount = count_set_in_list(GoodWords, readfile("C:/Users/saftarn/Desktop/emails/test.txt")),
    BadCount  = count_set_in_list(BadWords,  readfile("C:/Users/saftarn/Desktop/emails/test.txt")),
    T = BadCount + GoodCount,
    if T /= 0 ->
    BadCount / T;
       true ->
        0.5
    end.


_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: comment on my erlang Spamfilter

by Lev Walkin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hask ellian wrote:

> I made a simple spamfilter in Erlang. It takes 2 files with previous
> spam and good emails and then counts how many times the most frequent
> words from the spammy emails and the good emails occurs and then
> calculates the quote spam/(spam+good) in the file you want to test and
> returns  a number between 0 and 1.
> It could easily be improved in numerous ways but the main point for me
> was to learn Erlang. This isn't exactly what Erlang is for but it s way
> to get started.
> I'd be happy to receive comments on the Erlang-ness of the code and
> improvements.
> File I/O seems slow, is there a better way? In Haskell it is fairly instant.
>
>
> -module(antispam).
> -export([take/2,count/2,count_all/1,most_common/2,count_set_in_list/2,
> classify/0,readfile/1]).
>
> take(N,List) -> i_take(N,List,0,[]).
>     i_take(N,List,Count,Acc) ->
>      if Count < N andalso List /= [] ->
>          i_take(N,tl(List),Count+1,Acc++[hd(List)]);
>         Count == N ->
>          Acc;
>         true ->
>          []
>      end.

Consider using lists:sublist() instead of reimplementing
the standard library methods.

> count(Tok,List) -> i_count(Tok,List,0).  
>     i_count(Tok,List,Acc) ->
>         if Tok == hd(List) andalso List /= [] ->
>             i_count(Tok,tl(List),Acc+1);
>            Tok /= hd(List) andalso List /= [] ->
>             i_count(Tok,tl(List),Acc);
>            true ->
>             Acc
>     end.

Consider using

        length([T == Tok || T <- List])

which is much more concise and manageable.

> count_all(List) ->
>     Unique = lists:usort(List),
>     [{U, count(U, List)} || U <- Unique].

This one has O(N^2) complexity. Consider using ets if the data
set is larger than, say, 1000 elements.

> count_set_in_list(Set,List) ->
>     S = [{S, count(S, List)} || S <- Set],
>     lists:sum(lists:map(fun({H,T}) -> T end, S)).

count_set_in_list(Set, List) ->
        lists:sum([count(S, List) || S <- Set]).

> most_common(Stringlist,Xmost) ->
>     No_preps = lists:filter(fun(X) -> length(X) > 4 end, Stringlist),
>     Sorted_by_count = lists:keysort(2, count_all(No_preps)),
>     TakeX = take(Xmost, lists:reverse(Sorted_by_count)),
>     lists:map(fun({H,T}) -> H end, TakeX).



> readfile(FileName) ->
>     {ok, Binary} = file:read_file(FileName),
>     string:tokens(binary_to_list(Binary), " ").
>
> classify() ->
>     GoodWords =
> most_common(readfile("C:/Users/saftarn/Desktop/emails/okemails.txt"), 20),
>     BadWords  =
> most_common(readfile("C:/Users/saftarn/Desktop/emails/spam.txt"), 20),
>     GoodCount = count_set_in_list(GoodWords,
> readfile("C:/Users/saftarn/Desktop/emails/test.txt")),
>     BadCount  = count_set_in_list(BadWords,  
> readfile("C:/Users/saftarn/Desktop/emails/test.txt")),
>     T = BadCount + GoodCount,
>     if T /= 0 ->
>     BadCount / T;
>        true ->
>         0.5
>     end.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@...
> http://www.erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: comment on my erlang Spamfilter

by Hynek Vychodil :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Thu, Jul 24, 2008 at 1:35 PM, Lev Walkin <vlm@...> wrote:
hask ellian wrote:
> I made a simple spamfilter in Erlang. It takes 2 files with previous
> spam and good emails and then counts how many times the most frequent
> words from the spammy emails and the good emails occurs and then
> calculates the quote spam/(spam+good) in the file you want to test and
> returns  a number between 0 and 1.
> It could easily be improved in numerous ways but the main point for me
> was to learn Erlang. This isn't exactly what Erlang is for but it s way
> to get started.
> I'd be happy to receive comments on the Erlang-ness of the code and
> improvements.
> File I/O seems slow, is there a better way? In Haskell it is fairly instant.
>
>
> -module(antispam).
> -export([take/2,count/2,count_all/1,most_common/2,count_set_in_list/2,
> classify/0,readfile/1]).
>
> take(N,List) -> i_take(N,List,0,[]).
>     i_take(N,List,Count,Acc) ->
>      if Count < N andalso List /= [] ->
>          i_take(N,tl(List),Count+1,Acc++[hd(List)]);
>         Count == N ->
>          Acc;
>         true ->
>          []
>      end.

Consider using lists:sublist() instead of reimplementing
the standard library methods.

> count(Tok,List) -> i_count(Tok,List,0).
>     i_count(Tok,List,Acc) ->
>         if Tok == hd(List) andalso List /= [] ->
>             i_count(Tok,tl(List),Acc+1);
>            Tok /= hd(List) andalso List /= [] ->
>             i_count(Tok,tl(List),Acc);
>            true ->
>             Acc
>     end.

Consider using

       length([T == Tok || T <- List])

I guess you mean
count(Tok, List) -> length([T || T<-List, T=:=Tok]).

or may be better (less memory consuming)

count(Tok, List) -> lists:foldl(fun(Tok, Acc)->Acc+1; (_,Acc)->Acc end, 0, List).

but tail recursive version is not so bad at all (guess fastest)

count(Tok, List) -> i_count(Tok, List, 0).
i_count(_, [], Count) -> Count;
i_count(Tok, [Tok|Rest], Count) -> i_count(Tok, Rest, Count+1);
i_count(Tok, [_|Rest], Count) -> i_count(Tok, Rest, Count).

and non tail recursive version can work for no so much big Lists reasonably well

count(_, [])->0;
count(Tok, [Tok|Rest])->1+count(Tok, Rest);
count(Tok, [_|Rest)->count(Tok, Rest).



which is much more concise and manageable.

> count_all(List) ->
>     Unique = lists:usort(List),
>     [{U, count(U, List)} || U <- Unique].

This one has O(N^2) complexity. Consider using ets if the data
set is larger than, say, 1000 elements.

> count_set_in_list(Set,List) ->
>     S = [{S, count(S, List)} || S <- Set],
>     lists:sum(lists:map(fun({H,T}) -> T end, S)).

count_set_in_list(Set, List) ->
       lists:sum([count(S, List) || S <- Set]).

It is O(N*M) ets or dict should perform better too.

count_set_in_list(Set, List) ->
     lists:foldl(fun(X, D)->dict:update_counter(X, 1, D) end, dict:new(), List),
     lists:foldl(fun(K, Acc)->
          try dict:fetch(K, CountAll) of
             Count -> Acc+Count
          catch _:_->Acc end
       end, 0, Set).

or less memory consuming and probalbly faster

count_set_in_list(Set, List) ->
   S = sets:from_list(Set);
   lists:foldl(fun(K, Acc) ->
      case sets:is_element(K, S) of
          true -> Acc+1;
          false -> Acc
      end
   end, 0, List).

But for short Set it can be slower than

count_set_in_list(Set, List) ->
   lists:foldl(fun(K, Acc) ->
      case lists:member(K, Set) of
          true -> Acc+1;
          false -> Acc
      end
   end, 0, List).

which can be simplified as again O(N*M)

count_set_in_list(Set, List) ->
       length([ok || X<-List, Y<-Set, X=:=Y]).



> most_common(Stringlist,Xmost) ->
>     No_preps = lists:filter(fun(X) -> length(X) > 4 end, Stringlist),
>     Sorted_by_count = lists:keysort(2, count_all(No_preps)),
>     TakeX = take(Xmost, lists:reverse(Sorted_by_count)),
>     lists:map(fun({H,T}) -> H end, TakeX).



> readfile(FileName) ->
>     {ok, Binary} = file:read_file(FileName),
>     string:tokens(binary_to_list(Binary), " ").
>
> classify() ->
>     GoodWords =
> most_common(readfile("C:/Users/saftarn/Desktop/emails/okemails.txt"), 20),
>     BadWords  =
> most_common(readfile("C:/Users/saftarn/Desktop/emails/spam.txt"), 20),
>     GoodCount = count_set_in_list(GoodWords,
> readfile("C:/Users/saftarn/Desktop/emails/test.txt")),
>     BadCount  = count_set_in_list(BadWords,
> readfile("C:/Users/saftarn/Desktop/emails/test.txt")),
>     T = BadCount + GoodCount,
>     if T /= 0 ->
>     BadCount / T;
>        true ->
>         0.5
>     end.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@...
> http://www.erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions



--
--Hynek (Pichi) Vychodil

_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: comment on my erlang Spamfilter

by James Hague :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

readfile(FileName) ->
    {ok, Binary} = file:read_file(FileName),
    string:tokens(binary_to_list(Binary), " ").

Were I writing this, I wouldn't have called string:tokens at all, but
directly looped through Binary looking for words.  More than once I've
found string:tokens to be a hotspot, and there's no compelling reason
to do the tokenization as a separate step in this case.  That also
avoids the 8x blowup caused by binary_to_list on a potentially large
file.
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: comment on my erlang Spamfilter

by Richard O'Keefe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Lev Walkin has already commented on this.
I'm just going to cover a few points he omitted.

On 24 Jul 2008, at 10:58 pm, hask ellian wrote:
> take(N,List) -> i_take(N,List,0,[]).
>     i_take(N,List,Count,Acc) ->
>      if Count < N andalso List /= [] ->
>          i_take(N,tl(List),Count+1,Acc++[hd(List)]);
>         Count == N ->
>          Acc;
>         true ->
>          []
>      end.

(1) Where is the comment saying what this does?
(2) It is a bad idea to use 'andalso' or 'orelse' in guards;
     the current compiler generates MUCH worse code for them
     than the classic ',' and ';'.
(3) You wouldn't use head, tail, and not null in Haskell,
     so why do it in Erlang?
(4) Indenting the auxiliary function is something I used to
     do back when I first learned Prolog.  Lawrence Byrd
     commented wryly, "You really miss Algol, don't you?"
     I learned not to do that, because Prolog predicates
     don't nest, and NEITHER DO ERLANG FUNCTIONS.  (Funs,
     yes.  Named functions, no.)  To use indentation when
     there is no actual nesting is misleading.
(5) Jamming two function definitions together with no
     blank lines between makes them hard to read.
(6) Oh yeah, you would count DOWN in Haskell, not up.
     Erlang is no different.

Let's fix all those and see what we get.

        % take(N, List) -> Prefix
        % return the first N elements of List, if it has
        % at least that many, or all of it if it is shorter.
        % Error if N < 0 or List is not a list.

        take(N, List) when integer(N), N >= 0 ->
            take_loop(N, List).

        take_loop(N, [X|Xs]) when N > 0 ->
            [X | take_loop(N-1, Xs)];
        take_loop(_, _) ->
            [].

Drat.  In rewriting it to use pattern matching, I
accidentally fixed a serious performance bug.
In Erlang, as in Haskell, List ++ [Element] is
expensive; it copies all of List.  Your code took
O(N**2) time; the replacement is O(N).

As Lev Walkin noted, you should familiarise yourself
with the documentation of the 'lists' module.
An even better implementation is

        take(N, List) ->
            lists:sublist(List, N).

-- I actually think 'sublist' is a lousy name and that
-- 'take' is a better one.  But if you use the library
-- function, other people will be more likely to understand
-- you (and it's documented and tested already).


>
>
> count(Tok,List) -> i_count(Tok,List,0).
>     i_count(Tok,List,Acc) ->
>         if Tok == hd(List) andalso List /= [] ->
>             i_count(Tok,tl(List),Acc+1);
>            Tok /= hd(List) andalso List /= [] ->
>             i_count(Tok,tl(List),Acc);
>            true ->
>             Acc
>     end.

Same issues.  Generally, if you see an 'if', it
should either have been a 'case' or a function.

This would be

        %   count(Item, List) -> Count
        %   count the number of elements of List which equal Item.

        count(Item, List) ->
            count_loop(Item, List, 0).

        count_loop(Item, [Item|Xs], N) ->
            count_loop(Item, Xs, N+1);
        count_loop(Item, [_|Xs], N) ->
            count_loop(Item, Xs, N);
        count_loop(_, [], N) ->
            N.

This corresponds exactly to the Haskell

        count item list = count_loop item list 0

        count_loop item (x:xs) n | x == item = count_loop item xs (n+1)
        count_loop item (_:xs) n             = count_loop item xs n
        count_loop _    []     n             = n

As Lev Walkin notes, length([X | X <- List, X == Item]) would be
short and simple.  Erlang doesn't do deforestation (yet), so this
would be more expensive than using your own function here.
However, we soon learn that you really do not want to do this at all.
>
>
> count_all(List) ->
>     Unique = lists:usort(List),
>     [{U, count(U, List)} || U <- Unique].

This takes O(N**2) time when O(N.lgN) is possible.
Not a good idea.

What you want to do is to sort the list WITHOUT removing
duplicates, then count up the blocks.

        count_all(List) ->
            count_all_blocks(lists:sort(List)).

        count_all_blocks([]) -> [];
        count_all_blocks([X|Xs]) ->
            count_rest_of_block(X, Xs, 1).

        count_rest_of_block(X, [Y|Ys], N) when Y =< X ->
            count_rest_of_block(X, Ys, N+1);
        count_rest_of_block(X, Xs, N) ->
            [{X,N} | count_all_blocks(Xs)].

The sorting phase is still O(N.lgN), but now the gathering
and counting phase is O(N) instead of O(N**2).
>
>
> count_set_in_list(Set,List) ->
>     S = [{S, count(S, List)} || S <- Set],
>     lists:sum(lists:map(fun({H,T}) -> T end, S)).

Why isn't this just

        count_set_in_list(Set, List) ->
            lists:sum([count(S, List) || S <- Set]).

What is the point of wrapping {S, _} around the counts
when the next thing you do is to throw that part away?

Once again, this is an O(N**2) implementation of an
operation that can be done in O(N.lgN) time.  More
precisely, if Set and List are sorted, it can be done
in O(|Set|+|Length|) time; if they are not sorted,
you have to sort them first.

There are library modules ordsets, gb_sets, gb_trees,
and dict that might be of use to you instead.

>
>
> most_common(Stringlist,Xmost) ->
>     No_preps = lists:filter(fun(X) -> length(X) > 4 end, Stringlist),
>     Sorted_by_count = lists:keysort(2, count_all(No_preps)),
>     TakeX = take(Xmost, lists:reverse(Sorted_by_count)),
>     lists:map(fun({H,T}) -> H end, TakeX).

If you want to collect the K "biggest" things out of N, or
of course the K "smallest", that can be done in O(N.lg K)
time.  You are doing it in O(N.lg N).  Probably not worth
bothering about that.  Actually, there are linear expected
time algorithms for finding the Kth biggest element, and
then you can use an O(N) pass to find everything that big
or bigger, and then O(K.lg K) to sort those, if you
really want to, so O(N+K.lg K).  That might be worth having.

> readfile(FileName) ->
>     {ok, Binary} = file:read_file(FileName),
>     string:tokens(binary_to_list(Binary), " ").
>
> classify() ->
>     GoodWords = most_common(readfile("C:/Users/saftarn/Desktop/
> emails/okemails.txt"), 20),
>     BadWords  = most_common(readfile("C:/Users/saftarn/Desktop/
> emails/spam.txt"), 20),
>     GoodCount = count_set_in_list(GoodWords, readfile("C:/Users/
> saftarn/Desktop/emails/test.txt")),
>     BadCount  = count_set_in_list(BadWords,  readfile("C:/Users/
> saftarn/Desktop/emails/test.txt")),
>     T = BadCount + GoodCount,
>     if T /= 0 ->
>     BadCount / T;
>        true ->
>         0.5
>     end.

Something went wonky with the indentation of your 'if'.

You have two different concerns mixed up in your classify() function:
(A) where are my files?
(B) what do I want to calculate?

Split them.

classify_lists(Test_List, Good_List, Bad_List) ->
     Good_Count = count_set_in_list(most_common(Good_List, 20),  
Test_List),
     Bad_Count  = count_set_in_list(most_common(Bad_List,  20),  
Test_List),
     case Good_Count + Bad_Count
       of 0 -> 0.5
        ; T -> Bad_Count / T
     end.

classify_files(Test_File, Good_File, Bad_File) ->
     classify_lists(readfile(Test_File), readfile(Good_File),  
readfile(Bad_File)).

classify() ->
     D = "C:/Users/saftarn/Desktop/emails/"),
     classify_files(D ++ "test.txt", D ++ "okemails.txt", D ++  
"spam.txt").

In making this change, I also fixed the performance bug where you
read the test data twice.

We now notice a repeated pattern, which might as well be split out:

    probe_count(Word_List, Base_List) ->
        count_set_in_list(most_common(Word_List, 20), Base_List).

Now we can think about improving THIS function instead of the
individual bits that it is made of.

And the thought occurs at once:  why read, sort, and pick the cream of
*ALL* the good words and *ALL* the bad words *EVERY* time you want to
classify a message?  Why not just STORE the top 20 good words and the
top 20 bad words, and keep them around?

I'm also a little confused about where these word lists come from in
the first place.  If you are just extracting words from email messages,
you are like to find that the top few words in BOTH lists will be
something like [the,and,a].

Oddly enough, I recently had my first misclassification of an e-mail
message because of something not unlike this.  The message was
information from the University about medical research funding that
was available from the l-*-t-t-*-r-y.  You can guess why it was
misclassified.

_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: comment on my erlang Spamfilter

by Richard O'Keefe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On 25 Jul 2008, at 2:18 am, James Hague wrote:

> readfile(FileName) ->
>    {ok, Binary} = file:read_file(FileName),
>    string:tokens(binary_to_list(Binary), " ").
>
> Were I writing this, I wouldn't have called string:tokens at all, but
> directly looped through Binary looking for words.

I guess another comment is appropriate.
"What's a word?"  (This takes most of a 2-hour lecture in our
information retrieval paper!)

This is not going to identify "The" with "the".
It is going to take "time-to-live" as one word.
It is not going to realise that "these, and those"
contains the word "these".  (It's going to think that
the first word is "these,".)

The obvious thing is to run some separate program that
splits a document into words, and writes them out one per
line, but of course string:tokens("foo\nbar\n", " ") is
going to think the whole string is one token.

Oh, and let's consider a popular device used by spammers.
Let us suppose that Erlang is a "naughty word".  They
might write it as "3r1ang" and rely on the human eye to
read what they meant.

The simplest definition of a "word", that works much of the
time, is
        a sequence of letters and apostrophes such that
        each apostrophe has a letter on each side.

(This will be confused by "3r1ang"; I did say it works
"much of the time", not "all the time".)

I would be inclined to do
  - breaking documents into words
  - converting them to lower case
  - removing words in a stop list of maybe 100 words
  - MAYBE stemming using Porter's algorithm (if it's
    English you are interested in, find something else
    for other languages)
  - sorting-and-counting
in an outboard program, because this is high(er) volume
stuff.  I'd then hack on the results in Erlang.

--
If stupidity were a crime, who'd 'scape hanging?







_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions
LightInTheBox - Buy quality products at wholesale price