[Product-Developers] Re: Looking for A word filter for plone

Mark Phillips mark at phillipsmarketing.biz
Thu Dec 6 14:57:42 UTC 2007


On Thu, 2007-12-06 at 09:14 -0500, Reinout van Rees wrote:
> Mark Phillips wrote:
> 
> > I had thought of that, but was less sure how to proceed with that
> idea
> > than with a "simple" word filter. :-)
> 
> I'm pretty sure the schoolkids will gladly take on the challenge of
> tricking the simple word filter. Inserting one extra space in
> "overgehaalde dekzwabber" or so will break the simple filter.
> 
> It will just not be worth the effort.
> 
> bayesian filters sound better. An advantage is that it will
> effectively
> be trained on single words, so it will filter out the bad stuff pretty
> quickly.
> 
> Reinout
> 
> --
> Reinout van Rees  - Programmer at http://zestsoftware.nl/
> http://vanrees.org/weblog/          reinout @ vanrees.org
> "Information overload isn't the problem. If it was, you'd
> walk into a library and die." (David Allen)
> 
> 
I agree with you. Since I found the links to the Bayesian filter in
python, I think it might not be that hard to implement in workflow.

1. User submits document for review
2. Document is scanned, and is sent to either the spam or the review
state (ham). The spam state holds spam that is close to the threshold.
All 100% spam is automatically rejected.
3. Reviewer has 2 worklists - spam and ham
3a. Reviewer can reject, publish, or spam the ham - spam goes to the
filter to train it, published material goes to the filter to train it.
3b. Reviewer can reject or publish the spam - rejected spam goes to the
filter to be trained. Published items go to the filter to train it. 

A rough cut off the top of my head. Any suggestions?

BTW, what the heck does "overgehaalde dekzwabber" mean in English? I
couldn't find a Dutch web translation service that would translate it.
:-)

Mark







More information about the Product-Developers mailing list