That's an unnecessary assumption, we had mms:// ads for example.
mms:// ad exists but it is not frequency enough to take a place in a very common usage syntax.
But why should we? If the filter is "^referrer=" - why should it match "?referrer=Firefox" but not "!referrer=Firefox"? In its current form the separator placeholder seems to be more useful in the common scenarios (meaning: domain name ending and query parameters). Particularly for query parameters I've seen the most strange solutions so this regexp intentionally excludes only relatively few characters.
because they are defined in RFC2396. Reversed chars are very common in use. all URIs should follow the RFC2396.
I'm afraid you totally misunderstood the idea, your code below will be very slow with the filter numbers Adblock Plus usually deals with... Maybe
blog/investigat ... algorithms will help (this doesn't quite describe the current algorithm but close enough).
i know the algorithms. i have read it
this is an example for explain my code:
you have some filter rules :
bar.com/adpic/
bar.com/ads/
foo.com/adpic/
you want to filter this url:
http://foo.bar.com/adpic/a.jpg
first : program generate shortcuts for every rule and store them in to var'shortcuts'
||bar.com/adpic/ => bar com adpic => adpic
||bar.com/sexypic* => bar com sexypic* => bar
||foo.com/adpic/ => foo com adpic => adpic
var'shortcuts' ls like this:
{
bar: (RegExp"||bar.com/sexypic*")
adpic: [(RegExp'||bar.com/adpic/'), (RegExp'||bar.com/adpic/')]
}
this can be finish before checking an url
now you got the url'
http://foo.bar.com/adpic/a.jpg'
program break it into tokens first:
url='
http://foo.bar.com/adpic/a.jpg' => tokens=['http', 'foo', 'bar', 'com', 'adpic', 'jpg']
now for each element in tokens do an IN operation
'http' in shortcuts? => no, skip this
'foo' in shortcuts? => no, skip this
'bar' in shortcuts? => yes, get the shortcuts['bar'], found its length=0, so regex, not match, skip
...
'adpic' in shortcuts? => yes!
now get element : r = shortcuts['adpic']
notice that r is unsure, it may be a Regex or an Array.
by checking r.length we know that r is an array now(All the Regexp set length=0)
so for each element(now all the elements are RegExp) in r ,we do the regex match,
if any element matches, the url is in the filter list surely. return yes this time.
in your algorthm , it should be len(url)-7 times (IN operate and a "MustBe" substr )
in this algorithm, it should be len(url)/avglen(key of shortcut) times (IN operate and conditionly(very small rate, it's base on the IN operator) array loop) and only one time re.match
i am using this algorthm in an autoproxy switch script in python
with 15000 rules, an url len=2000 only takes 1ms for process.(XPSP3 Python2.6 i750CPU )
with [\w%*]{3,} , i just change the list to adblock's list and igorn all EHH rules, log as below.
and this algorithm make possiable to grab tokens from regex !
which means regex can be optimized..
i have do that in my script. so there are few slow rules ..........
Code: Select all
E:\Projects\proxy-workingset\local>cmd /k py26 proxy.py
getting autoproxy data from sources:
http://malwaredomains.lanik.us/malwaredomains_full.txt... 244649bytes, in 1.66s.
converting autoproxy data to hashed compiledRegex ...
rule_optimize: [\w*%]{3,} total: 13162 direct: 0 proxy: 13162
direct_hash(very fast): 0 / 0.00%
direct_shash(fast) : 0 / 0.00% keys: 0 / 0.00%
direct_share(slow) : 0*/ 0.00%
proxy_hash(vary fast) : 12325 / 93.64%
proxy_shash(fast) : 826 / 6.28% keys: 588 / 71.19%
proxy_share(slow) : 11*/ 0.08%
* : key of speed, lower is bette
---------------------------------------------------
ListenAddress: 127.0.0.1:8086
HTTPS Enabled: YES
OpenSSLModule: YES
Proxies Count: GAE(4) PHP(1) DIRECT(1)
AutoProxyList: 1
---------------------------------------------------
getting autoproxy data from sources:
https://easylist-downloads.adblockplus.org/easylist.txt... 235531bytes. in 2.219s
converting autoproxy data to hashed compiledRegex ...
rule_optimize: [\w*%]{3,} total: 5780 direct: 359 proxy: 5421
direct_hash(very fast): 352 / 98.05%
direct_shash(fast) : 7 / 1.95% keys: 6 / 85.59%
direct_share(slow) : 0*/ 0.00%
proxy_hash(vary fast) : 4672 / 86.18%
proxy_shash(fast) : 738 / 13.61% keys: 407 / 55.15%
proxy_share(slow) : 11*/ 0.20%
* : key of speed, lower is bette
Config reloaded
hash means an ID to a CompiledRegex.
shash mean hybird of (hash-share) , an ID to a CompiledRegex list(array in javascript)
share means no ID for this rule , they will be execute everytime an url passin
No performance cost but high maintenance cost. What's the point if nobody will use it? If somebody only needs the engine - here it is.
i know few about the firefox xpcom interface, i just know how to use them but not how to write them ...
and i was very poor in UI design (my XUL knowadge as more as HTML)..... so would like someone take this heavy work, LoL...