"Match whole word only" ?

Posting here is no longer possible, please use the forum of a filter list project, such as EasyList
Locked
Pumbaa80

"Match whole word only" ?

Post by Pumbaa80 »

Hi!
This goes to mainly to Wladimir, but I'd like everybody to contribute their thoughts about this:

First of all, great work! I just read your article on http://adblockplus.org/blog/investigati ... algorithms
and here is my idea:

I want a possibility to prevent filters like "ad" from blocking "head" etc.
At the moment, the only way to do that seems to be using several filters like this:

Code: Select all

.ad.
/ad.
/ad/*
/ad-
.ad-
.ad_
/ad_
or a (slow) regular expression of the form

Code: Select all

/([^\w]|_)ad([^\w]|_)/
It should be possible to add a "match whole word only" checkbox that can be checked for each filter individually. Now, adblock plus could take the word "ad" and -- if the checkbox is enabled -- convert it to a reg.exp. like the above, or it could match "ad" with the Boyer-Moore algorithm, and then check the characters to the left and to the right.
I know that this will slow down the algorithm, but I think it's worh it.
Another method could be: Split each URL into its components, i.e.

Code: Select all

http://ad.server.com/?showad.php&ad=test
becomes

Code: Select all

http
ad
server
com
showad
php
ad
test
and apply the Boyer-Moore algorithm to the above list, for the filters with "whole word only" enabled. This will roughly double the execution time, I think.
Still, it could be very useful.

What do you think?
bur

Post by bur »

I think that's a real good idea. Having a placeholder like $ or something that covers several special characters or even strings like "http://", ".", "/" or "?" would be very helpful.

With the new advantage of using simple filters you either have to create many filters like "*/ads/", "//ads.", ".ads." and so on or use a plain "ads" which would create loads of false positives. Being able to put in a simple "$ads$" would be great.

So this would really be a very nice feature. :)
bur

Post by bur »

I think that feature wouldn't even involve much work as Adblock could simply inflate a "$ads" to all corresponding filters like "/ads", "-ads", "?ads", ".ads" and so on and than just treat them like normal filters.

So this $-placeholder would work more like a macro and wouldn't require any new filter algorithms.
Guest

Post by Guest »

Right, that was my second thought. I guess the special characters
./?-_=&%|
should be sufficient ("|" marking the beginning or the end of a line). So a word like "pattern" would be hashed 81 times, with preceding and/or trailing special characters, procucing an "unvisible" list with

Code: Select all

.pattern.
/pattern.
?pattern=
&pattern=
etc...
The idea with the "$" is even better than a difficult-to-implement checkbox.
However, in most cases I guess the "$"s should be used in pairs(?)

Greetings
Pumba
Locked