|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
[Rule Set proposal] French RulesHi,
This is my first post on this list and first ruleset, so please point me to the right place/documents if I am doing anything wrong. According to a search of this list on markmail.org, there have been few subjects about spam in French and (no disrespect meant) I would agree with the comments I read about the current French Ruleset being inadequate (tried it, did not keep any of it). So I would like to propose a set for French Rules and get your feedback. You can find both the rules and some sample spam email messages (two of them missing, I have hits in my log files, but deleted them) at the following URL: http://www.saphirtech.fr/spam/ I have been running these for about a month sitewise on three domains, I have not seen any false positives (yet). Sincerely, JG ##################################################################################### ##### FRENCH SPECIFIC SPAMASSASSIN RULES. ##### USE AND REDISTRIBUTE WITH THIS NOTE AT YOUR OWN RISK AND PLEASURE. ##### AUTHOR: John GALLET ##### Version: 2008-JUNE-17 ##### Latest: http://www.saphirtech.fr/ ##### Status: It Works For Me (tm) ##################################################################################### # Spam is legal in France ! body FR_SPAMISLEGAL /\b(Conform.+ment|En vertu).{0,5}(article.{0,4}34.{0,4})?la loi\b/i describe FR_SPAMISLEGAL French: pretends spam is (l)awful. lang fr describe FR_SPAMISLEGAL Invoque la loi informatique et libertes. score FR_SPAMISLEGAL 2.5 body FR_SPAMISLEGAL_2 /\bdroit d.acc.+s.{1,3}(de modification)?.{0,5}de rectification\b/i describe FR_SPAMISLEGAL_2 French: pretends spam is (l)awful. lang fr describe FR_SPAMISLEGAL_2 Invoque le droit de rectification cnil. score FR_SPAMISLEGAL_2 2.5 ##### # yeah, sure. body FR_NOTSPAM /\b(ceci|ce).{1,9} n.est pas.{1,5}spam\b/i describe FR_NOTSPAM French: claims not to be spam. lang fr describe FR_NOTSPAM Affirme ne pas etre du spam. score FR_NOTSPAM 4.0 ##### ## I can pay my taxes body FR_PAYLESSTAXES /\b(paye|calcul|simul|r.+dui|investi).{1,7}(moins|vo|ses).{0,5}imp.+t(s)?\b/i describe FR_PAYLESSTAXES French: Pay less taxes lang fr describe FR_PAYLESSTAXES Simulateurs et reductions d'impots. score FR_PAYLESSTAXES 2.0 body FR_REALESTATE_INVEST /\b(loi)? (de.robien|girardin).{1,15}(neuf|recentr.+|ancien|IR|IS|imp.+t(s)?|industriel(le)?)\b/i describe FR_REALESTATE_INVEST French: Invest in real-estate with tax-reductions lang fr describe FR_REALESTATE_INVEST Reduction impots immobilier. score FR_REALESTATE_INVEST 2.5 ##### # I won at the casino body FR_ONLINEGAMBLING /\b(casino(s)?|jeu(x)?|joueur(s)?) (en ligne|de grattage)\b/i describe FR_ONLINEGAMBLING French: Online gambling lang fr describe FR_ONLINEGAMBLING Jeux en ligne. score FR_ONLINEGAMBLING 2.0 ##### # I am so lucky to receive spam body FR_YOURELUCKY /\b(tentez)? votre (jour de)? chance\b/i describe FR_YOURELUCKY French: it's your lucky day (sure). lang fr describe FR_YOURELUCKY Jeux de hasard et de chance. score FR_YOURELUCKY 1.0 ##### # Baby, did you forget to take your meds ? body FR_ONLINEMEDS /\bpharmacie(s)? (en ligne|internet)\b/i describe FR_ONLINEMEDS French: Online meds ordering lang fr describe FR_ONLINEMEDS Achat de medicaments en ligne. score FR_ONLINEMEDS 3.0 ###### # Tell me why body FR_REASON_SUBSCRIBE /\bVous recevez ce(t|tte)? (message|mail|m.+l|lettre|news.+) (car|parce que)\b/i describe FR_REASON_SUBSCRIBE French: you subscribed to my spam. lang fr describe FR_REASON_SUBSCRIBE Indique pourquoi vous recevez le courrier. score FR_REASON_SUBSCRIBE 1.5 ##### # How to unsubscribe body FR_HOWTOUNSUBSCRIBE /\b(souhaitez|d.+sirez|pour).{1,10}(plus.{1,}recevoir|d.+sincrire|d.+sinscription).{0,10}(information|email|mail|mailing|newsletter|message|offre|promotion)(s)?\b/i describe FR_HOWTOUNSUBSCRIBE French: how to unsubscribe lang fr describe FR_HOWTOUNSUBSCRIBE Indique comment se desabonner. score FR_HOWTOUNSUBSCRIBE 2.0 #### # Various "CRM" (Could Remove Me) ##### header FR_MAILER_1 X-Mailer =~ /(delosmail|cabestan|ems|mp6|wamailer|phpmailer|eMailink|Accucast|Benchmail)/i describe FR_MAILER_1 French spammy X-Mailer lang fr describe FR_MAILER_1 X-Mailer couramment employe pour des spams en francais. score FR_MAILER_1 4.0 header FR_MAILER_2 X-EMV- =~ /.+/ describe FR_MAILER_2 French spammy mailer header lang fr describe FR_MAILER_2 X-Mailer couramment employe pour des spams en francais. score FR_MAILER_2 4.0 ##################################################################################### ##### END FRENCH SPECIFIC SPAMASSASSIN RULES. ##################################################################################### |
|
|
Re: [Rule Set proposal] French RulesOn Tue, Jun 17, 2008 at 12:11 PM, John GALLET
<spamassassinlist@...> wrote: > Hi, > > This is my first post on this list and first ruleset, so please point me to > the right place/documents if I am doing anything wrong. > > According to a search of this list on markmail.org, there have been few > subjects about spam in French and (no disrespect meant) I would agree with > the comments I read about the current French Ruleset being inadequate (tried > it, did not keep any of it). > > So I would like to propose a set for French Rules and get your feedback. > > You can find both the rules and some sample spam email messages (two of them > missing, I have hits in my log files, but deleted them) at the following > URL: http://www.saphirtech.fr/spam/ > > I have been running these for about a month sitewise on three domains, I > have not seen any false positives (yet). > > Sincerely, > JG I was able to access the URL you mentioned, but not all of the files below it. I received: "Forbidden You don't have permission to access /spam/FR_PAYLESSTAXES.txt on this server." Dave |
|
|
Re: [Rule Set proposal] French RulesHi,
> I was able to access the URL you mentioned, but not all of the files > below it. I received: > "Forbidden > You don't have permission to access /spam/FR_PAYLESSTAXES.txt on this server." Sorry guys, only the ruleset file (the one I tried, of course) was readable, all the non empty spam samples had bad rights. This is fixed. I still miss samples for two rules, even if I did had hits according to /var/spool/maillog I did not save them. John |
|
|
Re: [Rule Set proposal] French RulesJohn GALLET writes: > Hi, > > This is my first post on this list and first ruleset, so please point me > to the right place/documents if I am doing anything wrong. > > According to a search of this list on markmail.org, there have been few > subjects about spam in French and (no disrespect meant) I would agree with > the comments I read about the current French Ruleset being inadequate > (tried it, did not keep any of it). > > So I would like to propose a set for French Rules and get your feedback. by the way, if you're reasonably perl-capable, it might be worthwhile using the algorithm I use to generate the JM_SOUGHT ruleset for english spam: http://taint.org/tag/rule-discovery you just give it a corpus of spam samples and it generates the rules for you. The code is in SpamAssassin SVN. --j. |
|
|
Re: [Rule Set proposal] French Rules> I still miss samples for two rules, even if I did had hits according to > /var/spool/maillog I did not save them. I added a sample for the FR_NOTSPAM rule, and I removed the FR_YOURELUCKY rule as I see other forms of the text getting through so it is not efficient. On the other hand, nearly all these messages are caught with RBL rules so I might even remove it completely if I can't find an efficient one. John PS: reminder, rules and samples avaible at http://www.saphirtech.fr/spam/ |
|
|
RE: [Rule Set proposal] French Rules> -----Original Message-----
> From: jm@... [mailto:jm@...] > Sent: Wednesday, June 18, 2008 12:10 PM > To: John GALLET > Cc: users@... > Subject: Re: [Rule Set proposal] French Rules > > ...omissis... > > by the way, if you're reasonably perl-capable, it might be worthwhile > using the algorithm I use to generate the JM_SOUGHT ruleset for english > spam: http://taint.org/tag/rule-discovery > > you just give it a corpus of spam samples and it generates the rules > for > you. The code is in SpamAssassin SVN. > > --j. Nah, that's great! I regret I can only occasionally read interesting messages due to my own time constraints. I could have read about this set of scripts weeks ago, otherwise... How this code is supposed to be used? I see these scripts in rule-dev: maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and strip-high-scorers-from-log. Give us a brief description of their work and usage. Nice idea, Justin! Giampaolo |
|
|
Re: [Rule Set proposal] French RulesGiampaolo Tomassoni writes: > > -----Original Message----- > > From: jm@... [mailto:jm@...] > > Sent: Wednesday, June 18, 2008 12:10 PM > > To: John GALLET > > Cc: users@... > > Subject: Re: [Rule Set proposal] French Rules > > > > ...omissis... > > > > by the way, if you're reasonably perl-capable, it might be worthwhile > > using the algorithm I use to generate the JM_SOUGHT ruleset for english > > spam: http://taint.org/tag/rule-discovery > > > > you just give it a corpus of spam samples and it generates the rules > > for > > you. The code is in SpamAssassin SVN. > > > > --j. > > Nah, that's great! > > I regret I can only occasionally read interesting messages due to my own > time constraints. I could have read about this set of scripts weeks ago, > otherwise... > > How this code is supposed to be used? I see these scripts in rule-dev: > maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and > strip-high-scorers-from-log. > > Give us a brief description of their work and usage. Basically, you collect 2 corpora: 1. a big corpus of ham samples, stuff that you do not want to match. 2. a smaller corpus of spam samples. You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out the patterns; you can then write rules based on these. Alternatively run "mass-check" and "seek-phrases-in-log" directly as that script does, to get a bit more control (and generate real SpamAssassin rules). That's what the JM_SOUGHT scripts do. See below: http://taint.org/x/2008/seekrules_run that script also calls "mk_meta_rule", which is here: http://taint.org/x/2008/mk_meta_rule --j. |
|
|
RE: [Rule Set proposal] French Rules> -----Original Message-----
> From: jm@... [mailto:jm@...] > Sent: Thursday, June 19, 2008 5:28 PM > To: Giampaolo Tomassoni > Cc: jm@...; users@... > Subject: Re: [Rule Set proposal] French Rules > > > Giampaolo Tomassoni writes: > > > -----Original Message----- > > > From: jm@... [mailto:jm@...] > > > Sent: Wednesday, June 18, 2008 12:10 PM > > > To: John GALLET > > > Cc: users@... > > > Subject: Re: [Rule Set proposal] French Rules > > > > > > ...omissis... > > > > > > by the way, if you're reasonably perl-capable, it might be > worthwhile > > > using the algorithm I use to generate the JM_SOUGHT ruleset for > english > > > spam: http://taint.org/tag/rule-discovery > > > > > > you just give it a corpus of spam samples and it generates the > rules > > > for > > > you. The code is in SpamAssassin SVN. > > > > > > --j. > > > > Nah, that's great! > > > > I regret I can only occasionally read interesting messages due to my > own > > time constraints. I could have read about this set of scripts weeks > ago, > > otherwise... > > > > How this code is supposed to be used? I see these scripts in rule- > dev: > > maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and > > strip-high-scorers-from-log. > > > > Give us a brief description of their work and usage. > > Basically, you collect 2 corpora: > > 1. a big corpus of ham samples, stuff that you do not want to match. > > 2. a smaller corpus of spam samples. > > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out > the patterns; you can then write rules based on these. > > Alternatively run "mass-check" and "seek-phrases-in-log" directly as > that > script does, to get a bit more control (and generate real SpamAssassin > rules). That's what the JM_SOUGHT scripts do. See below: > > http://taint.org/x/2008/seekrules_run > > that script also calls "mk_meta_rule", which is here: > http://taint.org/x/2008/mk_meta_rule Running seek-phrases-in-corpus I get a lot of these: "Wide character in print at /home/whatever/masses/plugins/Dumptext.pm line 26." Is it an issue with UTF-8 multibyte characters? Giampaolo > > --j. |
|
|
Re: [Rule Set proposal] French RulesGiampaolo Tomassoni writes: > > -----Original Message----- > > From: jm@... [mailto:jm@...] > > Sent: Thursday, June 19, 2008 5:28 PM > > To: Giampaolo Tomassoni > > Cc: jm@...; users@... > > Subject: Re: [Rule Set proposal] French Rules > > > > > > Giampaolo Tomassoni writes: > > > > -----Original Message----- > > > > From: jm@... [mailto:jm@...] > > > > Sent: Wednesday, June 18, 2008 12:10 PM > > > > To: John GALLET > > > > Cc: users@... > > > > Subject: Re: [Rule Set proposal] French Rules > > > > > > > > ...omissis... > > > > > > > > by the way, if you're reasonably perl-capable, it might be > > worthwhile > > > > using the algorithm I use to generate the JM_SOUGHT ruleset for > > english > > > > spam: http://taint.org/tag/rule-discovery > > > > > > > > you just give it a corpus of spam samples and it generates the > > rules > > > > for > > > > you. The code is in SpamAssassin SVN. > > > > > > > > --j. > > > > > > Nah, that's great! > > > > > > I regret I can only occasionally read interesting messages due to my > > own > > > time constraints. I could have read about this set of scripts weeks > > ago, > > > otherwise... > > > > > > How this code is supposed to be used? I see these scripts in rule- > > dev: > > > maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and > > > strip-high-scorers-from-log. > > > > > > Give us a brief description of their work and usage. > > > > Basically, you collect 2 corpora: > > > > 1. a big corpus of ham samples, stuff that you do not want to match. > > > > 2. a smaller corpus of spam samples. > > > > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out > > the patterns; you can then write rules based on these. > > > > Alternatively run "mass-check" and "seek-phrases-in-log" directly as > > that > > script does, to get a bit more control (and generate real SpamAssassin > > rules). That's what the JM_SOUGHT scripts do. See below: > > > > http://taint.org/x/2008/seekrules_run > > > > that script also calls "mk_meta_rule", which is here: > > http://taint.org/x/2008/mk_meta_rule > > Running seek-phrases-in-corpus I get a lot of these: > > "Wide character in print at > /home/whatever/masses/plugins/Dumptext.pm line 26." > > Is it an issue with UTF-8 multibyte characters? yes. It seems harmless -- I never got around to tracking it down. |
|
|
RE: [Rule Set proposal] French Rules> -----Original Message-----
> From: jm@... [mailto:jm@...] > Sent: Thursday, June 19, 2008 5:49 PM > To: Giampaolo Tomassoni > Cc: jm@...; users@... > Subject: Re: [Rule Set proposal] French Rules > > ...omissis... > Ok, I see I have to get a copy of some reference mass-check: mine is mostly in Italian and I'm getting a lot of stuff which could easily result in FPs. See: # 1.000 6.655 0.000 body SEEK_OKRP_V /We/ # 1.000 4.292 0.000 body SEEK_ZHYXLF / Redmond, WA / # 1.000 4.292 0.000 body SEEK_EFMKIR /Microsoft/ # 1.000 4.040 0.000 body SEEK_V__XNS /Get/ # 1.000 3.841 0.000 body SEEK_EXHMOF /This/ Thank you Justing, Giampaolo |
|
|
Re: [Rule Set proposal] French RulesGiampaolo Tomassoni writes: > > -----Original Message----- > > From: jm@... [mailto:jm@...] > > Sent: Thursday, June 19, 2008 5:49 PM > > To: Giampaolo Tomassoni > > Cc: jm@...; users@... > > Subject: Re: [Rule Set proposal] French Rules > > > > ...omissis... > > > > Ok, I see I have to get a copy of some reference mass-check: mine is mostly > in Italian and I'm getting a lot of stuff which could easily result in FPs. > See: > > # 1.000 6.655 0.000 > body SEEK_OKRP_V /We/ > # 1.000 4.292 0.000 > body SEEK_ZHYXLF / Redmond, WA / > # 1.000 4.292 0.000 > body SEEK_EFMKIR /Microsoft/ > # 1.000 4.040 0.000 > body SEEK_V__XNS /Get/ > # 1.000 3.841 0.000 > body SEEK_EXHMOF /This/ yeah, you'll need to ensure your ham corpus contains lots of both english _and_ Italian text ;) --j. |
|
|
hit frequencies (was Re: [Rule Set proposal] French RulesHi,
First of all, thanks to Justin for patiently helping me to install mass-check and pointing me in the right direction. I will try to run the algorithms tonight to see what they come up with. In the meantime, you can find a hit-frequencies report at: http://www.saphirtech.fr/spam/freqs_2008_06_23.txt All rules are prefixed with FR_ and are available in the same directory. I must say I did not double check for stray spam in my mailbox before using it as a ham corpus but it *should* be clean. I'll double check for next run. The spam corpus was 100% French spam, hand-picked over the last week through the "probably-spam" class (default score values 5-15). Any feedback on the results (not enough in corpus, bad rules, good rules, etc.) appreciated. Sincerely, JG |
|
|
Re: hit frequencies (was Re: [Rule Set proposal] French RulesOn Mon, 23 Jun 2008, John GALLET wrote:
> First of all, thanks to Justin for patiently helping me to install > mass-check and pointing me in the right direction. Applause for Justin! This is the sort of thing we need to see for many more specialized spam categories... > I will try to run the algorithms tonight to see what they come up with. Thanks for taking this burden upon yourself. One other thing you should be prepared to do, if you're willing to devote long-term responsibility to these rules, is to provide sa-update-compatible feeds of your dynamic rules. This is another thing that Justin can probably help you with. -- John Hardin KA7OHZ http://www.impsec.org/~jhardin/ jhardin@... FALaholic #11174 pgpk -a jhardin@... key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 ----------------------------------------------------------------------- The problem is when people look at Yahoo, slashdot, or groklaw and jump from obvious and correct observations like "Oh my God, this place is teeming with utter morons" to incorrect conclusions like "there's nothing of value here". -- Al Petrofsky, in Y! SCOX ----------------------------------------------------------------------- 11 days until the 232nd anniversary of the Declaration of Independence |
|
|
Re: hit frequencies (was Re: [Rule Set proposal] French RulesJohn GALLET a écrit :
> Any feedback on the results (not enough in corpus, bad rules, good > rules, etc.) appreciated. Looking at the rules, I'm worried about false positives on genuine opt-in advertising. I have a number of users who choose to receive all kinds of advertising blurb, so I'll run your rules with very low scores for a while to see what gets hit. John. -- -- Over 3000 webcams from ski resorts around the world - www.snoweye.com -- Translate your technical documents and web pages - www.tradoc.fr |
|
|
Re: hit frequencies (was Re: [Rule Set proposal] French RulesRe,
> Looking at the rules, I'm worried about false positives on genuine opt-in > advertising. I have a number of users who choose to receive all kinds of > advertising blurb, This is one of the reasons why I did not hunt for "click here" and "if you can't see this email in html". Now correct me if I am wrong (ouch, no, not on the head), but isn't this what whitelist_from is for ? I never was able to let the Intel newsletter through (it is in English), it would always be caught by SA. Same went for Microsoft Support genuine answers (ok, don't laugh). >so I'll run your rules with very low scores for a while to see what gets >hit. You can have a little more information, and exactly this suggestion, by reading http://www.saphirtech.fr/spamassassin.html JG |
|
|
Re: hit frequencies (was Re: [Rule Set proposal] French Rules> Thanks for taking this burden upon yourself. One other thing you should be
> prepared to do, if you're willing to devote long-term responsibility to these > rules, is to provide sa-update-compatible feeds of your dynamic rules. This > is another thing that Justin can probably help you with. I am happy with trying to do so, but I am honestly not worried about the feed part, all it bores down to is putting the right file at the right place (be it push or pull, ftp or rsync, whatever). What I am more worried about is testing regularly the rules, and, even before that, checking that they are valid. They are "good" on my system with my users, but then they were custom-tailored to be so. JG |
|
|
Re: hit frequencies (was Re: [Rule Set proposal] French RulesOn 6/23/2008 4:36 PM, John GALLET wrote:
> Hi, > > First of all, thanks to Justin for patiently helping me to install > mass-check and pointing me in the right direction. I will try to run the > algorithms tonight to see what they come up with. > > In the meantime, you can find a hit-frequencies report at: > http://www.saphirtech.fr/spam/freqs_2008_06_23.txt > > All rules are prefixed with FR_ and are available in the same directory. > > I must say I did not double check for stray spam in my mailbox before > using it as a ham corpus but it *should* be clean. I'll double check for > next run. The spam corpus was 100% French spam, hand-picked over the > last week through the "probably-spam" class (default score values 5-15). > > Any feedback on the results (not enough in corpus, bad rules, good > rules, etc.) appreciated. I excluded the last two rules from my masscheck to avoid FPs as these ESPs/X-Mailers are definitely grey, "import rcpt list and blast" sort of ESPs not black for global use. #counts FR_SPAMISLEGAL 8s/2h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_SPAMISLEGAL_2 5s/2h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_NOTSPAM 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_PAYLESSTAXES 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_REALESTATE_INVEST 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_ONLINEGAMBLING 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_ONLINEMEDS 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_REASON_SUBSCRIBE 1s/1h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_HOWTOUNSUBSCRIBE 7s/16h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 If these are hit rates with a very minimal daily corpus, don't know if the present ruleset is ready for production unless you have 0 tolerance for any bulk, period |
|
|
seekrules over French spam (was Re: [Rule Set proposal] French Rules |