|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
comment on my erlang SpamfilterI made a simple spamfilter in Erlang. It takes 2 files with previous spam and good emails and then counts how many times the most frequent words from the spammy emails and the good emails occurs and then calculates the quote spam/(spam+good) in the file you want to test and returns a number between 0 and 1.
It could easily be improved in numerous ways but the main point for me was to learn Erlang. This isn't exactly what Erlang is for but it s way to get started. I'd be happy to receive comments on the Erlang-ness of the code and improvements. File I/O seems slow, is there a better way? In Haskell it is fairly instant. -module(antispam). -export([take/2,count/2,count_all/1,most_common/2,count_set_in_list/2, classify/0,readfile/1]). take(N,List) -> i_take(N,List,0,[]). i_take(N,List,Count,Acc) -> if Count < N andalso List /= [] -> i_take(N,tl(List),Count+1,Acc++[hd(List)]); Count == N -> Acc; true -> [] end. count(Tok,List) -> i_count(Tok,List,0). i_count(Tok,List,Acc) -> if Tok == hd(List) andalso List /= [] -> i_count(Tok,tl(List),Acc+1); Tok /= hd(List) andalso List /= [] -> i_count(Tok,tl(List),Acc); true -> Acc end. count_all(List) -> Unique = lists:usort(List), [{U, count(U, List)} || U <- Unique]. count_set_in_list(Set,List) -> S = [{S, count(S, List)} || S <- Set], lists:sum(lists:map(fun({H,T}) -> T end, S)). most_common(Stringlist,Xmost) -> No_preps = lists:filter(fun(X) -> length(X) > 4 end, Stringlist), Sorted_by_count = lists:keysort(2, count_all(No_preps)), TakeX = take(Xmost, lists:reverse(Sorted_by_count)), lists:map(fun({H,T}) -> H end, TakeX). readfile(FileName) -> {ok, Binary} = file:read_file(FileName), string:tokens(binary_to_list(Binary), " "). classify() -> GoodWords = most_common(readfile("C:/Users/saftarn/Desktop/emails/okemails.txt"), 20), BadWords = most_common(readfile("C:/Users/saftarn/Desktop/emails/spam.txt"), 20), GoodCount = count_set_in_list(GoodWords, readfile("C:/Users/saftarn/Desktop/emails/test.txt")), BadCount = count_set_in_list(BadWords, readfile("C:/Users/saftarn/Desktop/emails/test.txt")), T = BadCount + GoodCount, if T /= 0 -> BadCount / T; true -> 0.5 end. _______________________________________________ erlang-questions mailing list erlang-questions@... http://www.erlang.org/mailman/listinfo/erlang-questions |
|
|
Re: comment on my erlang Spamfilterhask ellian wrote:
> I made a simple spamfilter in Erlang. It takes 2 files with previous > spam and good emails and then counts how many times the most frequent > words from the spammy emails and the good emails occurs and then > calculates the quote spam/(spam+good) in the file you want to test and > returns a number between 0 and 1. > It could easily be improved in numerous ways but the main point for me > was to learn Erlang. This isn't exactly what Erlang is for but it s way > to get started. > I'd be happy to receive comments on the Erlang-ness of the code and > improvements. > File I/O seems slow, is there a better way? In Haskell it is fairly instant. > > > -module(antispam). > -export([take/2,count/2,count_all/1,most_common/2,count_set_in_list/2, > classify/0,readfile/1]). > > take(N,List) -> i_take(N,List,0,[]). > i_take(N,List,Count,Acc) -> > if Count < N andalso List /= [] -> > i_take(N,tl(List),Count+1,Acc++[hd(List)]); > Count == N -> > Acc; > true -> > [] > end. Consider using lists:sublist() instead of reimplementing the standard library methods. > count(Tok,List) -> i_count(Tok,List,0). > i_count(Tok,List,Acc) -> > if Tok == hd(List) andalso List /= [] -> > i_count(Tok,tl(List),Acc+1); > Tok /= hd(List) andalso List /= [] -> > i_count(Tok,tl(List),Acc); > true -> > Acc > end. Consider using length([T == Tok || T <- List]) which is much more concise and manageable. > count_all(List) -> > Unique = lists:usort(List), > [{U, count(U, List)} || U <- Unique]. This one has O(N^2) complexity. Consider using ets if the data set is larger than, say, 1000 elements. > count_set_in_list(Set,List) -> > S = [{S, count(S, List)} || S <- Set], > lists:sum(lists:map(fun({H,T}) -> T end, S)). count_set_in_list(Set, List) -> lists:sum([count(S, List) || S <- Set]). > most_common(Stringlist,Xmost) -> > No_preps = lists:filter(fun(X) -> length(X) > 4 end, Stringlist), > Sorted_by_count = lists:keysort(2, count_all(No_preps)), > TakeX = take(Xmost, lists:reverse(Sorted_by_count)), > lists:map(fun({H,T}) -> H end, TakeX). > readfile(FileName) -> > {ok, Binary} = file:read_file(FileName), > string:tokens(binary_to_list(Binary), " "). > > classify() -> > GoodWords = > most_common(readfile("C:/Users/saftarn/Desktop/emails/okemails.txt"), 20), > BadWords = > most_common(readfile("C:/Users/saftarn/Desktop/emails/spam.txt"), 20), > GoodCount = count_set_in_list(GoodWords, > readfile("C:/Users/saftarn/Desktop/emails/test.txt")), > BadCount = count_set_in_list(BadWords, > readfile("C:/Users/saftarn/Desktop/emails/test.txt")), > T = BadCount + GoodCount, > if T /= 0 -> > BadCount / T; > true -> > 0.5 > end. > > > ------------------------------------------------------------------------ > > _______________________________________________ > erlang-questions mailing list > erlang-questions@... > http://www.erlang.org/mailman/listinfo/erlang-questions _______________________________________________ erlang-questions mailing list erlang-questions@... http://www.erlang.org/mailman/listinfo/erlang-questions |
|
|
Re: comment on my erlang SpamfilterOn Thu, Jul 24, 2008 at 1:35 PM, Lev Walkin <vlm@...> wrote:
I guess you mean count(Tok, List) -> length([T || T<-List, T=:=Tok]). or may be better (less memory consuming) count(Tok, List) -> lists:foldl(fun(Tok, Acc)->Acc+1; (_,Acc)->Acc end, 0, List). but tail recursive version is not so bad at all (guess fastest) count(Tok, List) -> i_count(Tok, List, 0). i_count(_, [], Count) -> Count; i_count(Tok, [Tok|Rest], Count) -> i_count(Tok, Rest, Count+1); i_count(Tok, [_|Rest], Count) -> i_count(Tok, Rest, Count). and non tail recursive version can work for no so much big Lists reasonably well count(_, [])->0; count(Tok, [Tok|Rest])->1+count(Tok, Rest); count(Tok, [_|Rest)->count(Tok, Rest).
It is O(N*M) ets or dict should perform better too. count_set_in_list(Set, List) -> lists:foldl(fun(X, D)->dict:update_counter(X, 1, D) end, dict:new(), List), lists:foldl(fun(K, Acc)-> try dict:fetch(K, CountAll) of Count -> Acc+Count catch _:_->Acc end end, 0, Set). or less memory consuming and probalbly faster count_set_in_list(Set, List) -> S = sets:from_list(Set); lists:foldl(fun(K, Acc) -> case sets:is_element(K, S) of true -> Acc+1; false -> Acc end end, 0, List). But for short Set it can be slower than count_set_in_list(Set, List) -> lists:foldl(fun(K, Acc) -> case lists:member(K, Set) of true -> Acc+1; false -> Acc end end, 0, List). which can be simplified as again O(N*M) count_set_in_list(Set, List) -> length([ok || X<-List, Y<-Set, X=:=Y]).
-- --Hynek (Pichi) Vychodil _______________________________________________ erlang-questions mailing list erlang-questions@... http://www.erlang.org/mailman/listinfo/erlang-questions |
|
|
Re: comment on my erlang Spamfilterreadfile(FileName) ->
{ok, Binary} = file:read_file(FileName), string:tokens(binary_to_list(Binary), " "). Were I writing this, I wouldn't have called string:tokens at all, but directly looped through Binary looking for words. More than once I've found string:tokens to be a hotspot, and there's no compelling reason to do the tokenization as a separate step in this case. That also avoids the 8x blowup caused by binary_to_list on a potentially large file. _______________________________________________ erlang-questions mailing list erlang-questions@... http://www.erlang.org/mailman/listinfo/erlang-questions |
|
|
Re: comment on my erlang SpamfilterLev Walkin has already commented on this.
I'm just going to cover a few points he omitted. On 24 Jul 2008, at 10:58 pm, hask ellian wrote: > take(N,List) -> i_take(N,List,0,[]). > i_take(N,List,Count,Acc) -> > if Count < N andalso List /= [] -> > i_take(N,tl(List),Count+1,Acc++[hd(List)]); > Count == N -> > Acc; > true -> > [] > end. (1) Where is the comment saying what this does? (2) It is a bad idea to use 'andalso' or 'orelse' in guards; the current compiler generates MUCH worse code for them than the classic ',' and ';'. (3) You wouldn't use head, tail, and not null in Haskell, so why do it in Erlang? (4) Indenting the auxiliary function is something I used to do back when I first learned Prolog. Lawrence Byrd commented wryly, "You really miss Algol, don't you?" I learned not to do that, because Prolog predicates don't nest, and NEITHER DO ERLANG FUNCTIONS. (Funs, yes. Named functions, no.) To use indentation when there is no actual nesting is misleading. (5) Jamming two function definitions together with no blank lines between makes them hard to read. (6) Oh yeah, you would count DOWN in Haskell, not up. Erlang is no different. Let's fix all those and see what we get. % take(N, List) -> Prefix % return the first N elements of List, if it has % at least that many, or all of it if it is shorter. % Error if N < 0 or List is not a list. take(N, List) when integer(N), N >= 0 -> take_loop(N, List). take_loop(N, [X|Xs]) when N > 0 -> [X | take_loop(N-1, Xs)]; take_loop(_, _) -> []. Drat. In rewriting it to use pattern matching, I accidentally fixed a serious performance bug. In Erlang, as in Haskell, List ++ [Element] is expensive; it copies all of List. Your code took O(N**2) time; the replacement is O(N). As Lev Walkin noted, you should familiarise yourself with the documentation of the 'lists' module. An even better implementation is take(N, List) -> lists:sublist(List, N). -- I actually think 'sublist' is a lousy name and that -- 'take' is a better one. But if you use the library -- function, other people will be more likely to understand -- you (and it's documented and tested already). > > > count(Tok,List) -> i_count(Tok,List,0). > i_count(Tok,List,Acc) -> > if Tok == hd(List) andalso List /= [] -> > i_count(Tok,tl(List),Acc+1); > Tok /= hd(List) andalso List /= [] -> > i_count(Tok,tl(List),Acc); > true -> > Acc > end. Same issues. Generally, if you see an 'if', it should either have been a 'case' or a function. This would be % count(Item, List) -> Count % count the number of elements of List which equal Item. count(Item, List) -> count_loop(Item, List, 0). count_loop(Item, [Item|Xs], N) -> count_loop(Item, Xs, N+1); count_loop(Item, [_|Xs], N) -> count_loop(Item, Xs, N); count_loop(_, [], N) -> N. This corresponds exactly to the Haskell count item list = count_loop item list 0 count_loop item (x:xs) n | x == item = count_loop item xs (n+1) count_loop item (_:xs) n = count_loop item xs n count_loop _ [] n = n As Lev Walkin notes, length([X | X <- List, X == Item]) would be short and simple. Erlang doesn't do deforestation (yet), so this would be more expensive than using your own function here. However, we soon learn that you really do not want to do this at all. > > > count_all(List) -> > Unique = lists:usort(List), > [{U, count(U, List)} || U <- Unique]. This takes O(N**2) time when O(N.lgN) is possible. Not a good idea. What you want to do is to sort the list WITHOUT removing duplicates, then count up the blocks. count_all(List) -> count_all_blocks(lists:sort(List)). count_all_blocks([]) -> []; count_all_blocks([X|Xs]) -> count_rest_of_block(X, Xs, 1). count_rest_of_block(X, [Y|Ys], N) when Y =< X -> count_rest_of_block(X, Ys, N+1); count_rest_of_block(X, Xs, N) -> [{X,N} | count_all_blocks(Xs)]. The sorting phase is still O(N.lgN), but now the gathering and counting phase is O(N) instead of O(N**2). > > > count_set_in_list(Set,List) -> > S = [{S, count(S, List)} || S <- Set], > lists:sum(lists:map(fun({H,T}) -> T end, S)). Why isn't this just count_set_in_list(Set, List) -> lists:sum([count(S, List) || S <- Set]). What is the point of wrapping {S, _} around the counts when the next thing you do is to throw that part away? Once again, this is an O(N**2) implementation of an operation that can be done in O(N.lgN) time. More precisely, if Set and List are sorted, it can be done in O(|Set|+|Length|) time; if they are not sorted, you have to sort them first. There are library modules ordsets, gb_sets, gb_trees, and dict that might be of use to you instead. > > > most_common(Stringlist,Xmost) -> > No_preps = lists:filter(fun(X) -> length(X) > 4 end, Stringlist), > Sorted_by_count = lists:keysort(2, count_all(No_preps)), > TakeX = take(Xmost, lists:reverse(Sorted_by_count)), > lists:map(fun({H,T}) -> H end, TakeX). If you want to collect the K "biggest" things out of N, or of course the K "smallest", that can be done in O(N.lg K) time. You are doing it in O(N.lg N). Probably not worth bothering about that. Actually, there are linear expected time algorithms for finding the Kth biggest element, and then you can use an O(N) pass to find everything that big or bigger, and then O(K.lg K) to sort those, if you really want to, so O(N+K.lg K). That might be worth having. > readfile(FileName) -> > {ok, Binary} = file:read_file(FileName), > string:tokens(binary_to_list(Binary), " "). > > classify() -> > GoodWords = most_common(readfile("C:/Users/saftarn/Desktop/ > emails/okemails.txt"), 20), > BadWords = most_common(readfile("C:/Users/saftarn/Desktop/ > emails/spam.txt"), 20), > GoodCount = count_set_in_list(GoodWords, readfile("C:/Users/ > saftarn/Desktop/emails/test.txt")), > BadCount = count_set_in_list(BadWords, readfile("C:/Users/ > saftarn/Desktop/emails/test.txt")), > T = BadCount + GoodCount, > if T /= 0 -> > BadCount / T; > true -> > 0.5 > end. Something went wonky with the indentation of your 'if'. You have two different concerns mixed up in your classify() function: (A) where are my files? (B) what do I want to calculate? Split them. classify_lists(Test_List, Good_List, Bad_List) -> Good_Count = count_set_in_list(most_common(Good_List, 20), Test_List), Bad_Count = count_set_in_list(most_common(Bad_List, 20), Test_List), case Good_Count + Bad_Count of 0 -> 0.5 ; T -> Bad_Count / T end. classify_files(Test_File, Good_File, Bad_File) -> classify_lists(readfile(Test_File), readfile(Good_File), readfile(Bad_File)). classify() -> D = "C:/Users/saftarn/Desktop/emails/"), classify_files(D ++ "test.txt", D ++ "okemails.txt", D ++ "spam.txt"). In making this change, I also fixed the performance bug where you read the test data twice. We now notice a repeated pattern, which might as well be split out: probe_count(Word_List, Base_List) -> count_set_in_list(most_common(Word_List, 20), Base_List). Now we can think about improving THIS function instead of the individual bits that it is made of. And the thought occurs at once: why read, sort, and pick the cream of *ALL* the good words and *ALL* the bad words *EVERY* time you want to classify a message? Why not just STORE the top 20 good words and the top 20 bad words, and keep them around? I'm also a little confused about where these word lists come from in the first place. If you are just extracting words from email messages, you are like to find that the top few words in BOTH lists will be something like [the,and,a]. Oddly enough, I recently had my first misclassification of an e-mail message because of something not unlike this. The message was information from the University about medical research funding that was available from the l-*-t-t-*-r-y. You can guess why it was misclassified. _______________________________________________ erlang-questions mailing list erlang-questions@... http://www.erlang.org/mailman/listinfo/erlang-questions |
|
|
Re: comment on my erlang SpamfilterOn 25 Jul 2008, at 2:18 am, James Hague wrote: > readfile(FileName) -> > {ok, Binary} = file:read_file(FileName), > string:tokens(binary_to_list(Binary), " "). > > Were I writing this, I wouldn't have called string:tokens at all, but > directly looped through Binary looking for words. I guess another comment is appropriate. "What's a word?" (This takes most of a 2-hour lecture in our information retrieval paper!) This is not going to identify "The" with "the". It is going to take "time-to-live" as one word. It is not going to realise that "these, and those" contains the word "these". (It's going to think that the first word is "these,".) The obvious thing is to run some separate program that splits a document into words, and writes them out one per line, but of course string:tokens("foo\nbar\n", " ") is going to think the whole string is one token. Oh, and let's consider a popular device used by spammers. Let us suppose that Erlang is a "naughty word". They might write it as "3r1ang" and rely on the human eye to read what they meant. The simplest definition of a "word", that works much of the time, is a sequence of letters and apostrophes such that each apostrophe has a letter on each side. (This will be confused by "3r1ang"; I did say it works "much of the time", not "all the time".) I would be inclined to do - breaking documents into words - converting them to lower case - removing words in a stop list of maybe 100 words - MAYBE stemming using Porter's algorithm (if it's English you are interested in, find something else for other languages) - sorting-and-counting in an outboard program, because this is high(er) volume stuff. I'd then hack on the results in Erlang. -- If stupidity were a crime, who'd 'scape hanging? _______________________________________________ erlang-questions mailing list erlang-questions@... http://www.erlang.org/mailman/listinfo/erlang-questions |
| Free Forum Powered by Nabble | Forum Help |