Feature request: Ignoring query string

Everything about using Adblock Plus on Mozilla Firefox, Thunderbird and SeaMonkey
Wladimir Palant

Feature request: Ignoring query string

Post by Wladimir Palant »

Reposting feature requests from lyricconch who couldn't post them from China (but will hopefully be able to reply):

1. || at the end of a filter will igorn all charactor in query. it very useful to avoid filtering such as http://www.google.com/search?q=www.something.com

2. i'd like that abp offer a XPCOM like "@adblockplus.org/urlmatcher;1" . so that the matcher module can be reuse in other extension. for example : an extension which searching some "download-able" url with some kind of rule. and it wont have to use a loop for finding out those URLs.
Wladimir Palant

Re: Feature request: Ignoring query string

Post by Wladimir Palant »

1. To provide some context for this question: I guess that this is about Google Instant. Google introduced that service a few days ago, it is enabled by default for searches on google.com. The search results are loaded and displayed immediately as you type in the search term. This works by sending off an XMLHttpRequest that will retrieve the results, the search terms the user entered are in the query string of that request. And I already found two subscriptions using the filter /[^a-z]count[^a-z]/ which would block the request if the user entered the word "count". Note that this isn't a new issue - same would have happened when using Google in a frame. This new Google feature only made the problem more obvious.

I doubt that adjusting the filter syntax to deal with this scenario is the right solution. There are already several solutions available. For example, one could use $~xmlhttprequest option. However, this will only solve the immediate problem but doesn't change the fact that a bad filter is being used. IMO it is better to remove the ambiguity from the filter, replace it by specific filters - even if it means that several filters have to be added. Looking at EasyList, the filters used instead of the regular expression above are "/ad?count=", "/ad_count.", "_ad_count.". Note that these three filters are significantly faster that the regular expression.

2. Starting with Adblock Plus 1.3 an extension can already use the Matcher class. Here is the code:

Code: Select all

if ("@adblockplus.org/abp/private;1" in Components.classes)
{
  var abpURL = Components.classes["@adblockplus.org/abp/private;1"]
                         .getService(Components.interfaces.nsIURI);
  Components.utils.import(abpURL.spec + "Matcher.jsm");
  alert(new Matcher());
}
Note however that this is private API. Unlike the public API it can change at any time, backwards compatibility isn't being considered here. And I don't want to make this public API - the matcher is an internal mechanism and I might decide to change it significantly to improve performance. This happened several times in the past, I don't want to be bound by compatibility concerns simply to allow other extensions to do something only marginally related to Adblock Plus.

If you want to use the matcher without being afraid that an update to Adblock Plus will break you - the code is there, just take it and put into your extension. The source code license is MPL which essentially means: you are free to use it but keep the license text at the top (the one mentioning initial developer).
lyricconch
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: Feature request: Ignoring query string

Post by lyricconch »

~ ~ i am here ~ ~ ~ ~ ~~

more regular expr optimize :
|| at beginning => ^https?:\/+[^\/]*[\/.]\b
this can be use for matching "domain or subdomain" perfectly
example:
filter : ||bar.com/
uri : https://bar.com/
regex : ^https://\bbar.com
uri : http://foo.bar.com/
regex : ^http://foo.\bbar.com

^ => [:@\/;?=&$+,]+/
these are rfc2396 reserved chars
i found that you just replace ^ as * them it will create " .* " regex, it will be slow
for example : regex "aaa.*bbb" try to match uri "http://aaa.com/adsfasdfljxcvm,weorulsdjflsjdl"
first, "aaa" got match on position 8-10 , then .* will match as many as it can, so it will match to the end of string, and "bbb" after .* make this regex-match fail. regex engine now move back a char, try to match again. fail again and again, until .* match nothing , fail finally. it do lot of useless work in this process
but with [:@\/;?=&$+,]+, it will fail on position 12 immediately

1. this syntax is easy to implement and useful to igorn query part for performance reason
an URI will have "?" only once, when build the regex of filter ,change "fiterregex" into this:
^[^?]*filterregex
and "filterregx" will never appear behind the questionmark when match
(ofter, an URI may hava a long query. such as google(google appengine' url can be 4000+ byte! )
but an URI's path wont be long , so it's acceptable and useful to igorn query part. because ^[^?]* will
immediately fail when current position get over the PATH part )

2. thanks the code ... i though that abp can be wroten in a new style.
"power by data" instead of "power by program logic" as below shows:

Code: Select all

ContentPolicy   -----+              +---> Matcher(abpwhitelist) => Actions(allow)
HttpSink           -- -- +--> URI---|
DOM event       -----+               + ---- > Matcher(abplist) => Actions(block)

in this model, its very easy to extend function that base on uri-match which is very very common in extension (stylish, greasemonkey, and so on .... all of them require the "uri-match").
Wladimir Palant

Re: Feature request: Ignoring query string

Post by Wladimir Palant »

lyricconch wrote:more regular expr optimize :
|| at beginning => ^https?:\/+[^\/]*[\/.]\b
this can be use for matching "domain or subdomain" perfectly
Some URLs to extend your testing:
ftp://example.com/
http://тест.рф/ (yes, internationalized domain names are fun)
And a filter:
||/example.com (shouldn't match http://example.com/ but with your regexp it will)
^ => [:@\/;?=&$+,]+/
these are rfc2396 reserved chars
Theoretically - yes. But what about exclamation mark for example? Theoretically nobody should be building URLs using ! as separator - but if you look around you will find such URLs. So separator placeholder is accepting everything that's not common in a URL.

Given that Adblock Plus almost never tests URLs against regular expressions (most of the time the URL is already dismissed by preceding checks) - this kind of micro-optimization isn't very useful.
1. this syntax is easy to implement and useful to igorn query part for performance reason
an URI will have "?" only once, when build the regex of filter ,change "fiterregex" into this:
^[^?]*filterregex
and "filterregx" will never appear behind the questionmark when match
(ofter, an URI may hava a long query. such as google(google appengine' url can be 4000+ byte! )
but an URI's path wont be long , so it's acceptable and useful to igorn query part. because ^[^?]* will
immediately fail when current position get over the PATH part )
As I said above, Adblock Plus usually doesn't run regular expressions so that optimizing them doesn't make much sense. The important scenario is testing of filter shortcuts that happens before any regular expressions are tested. Filter shortcuts are tested for all filters simultaneously - meaning that to improve performance you would need to ignore the query string on all filters. And I don't think you want that...
2. thanks the code ... i though that abp can be wroten in a new style.
"power by data" instead of "power by program logic" as below shows:

Code: Select all

ContentPolicy   -----+              +---> Matcher(abpwhitelist) => Actions(allow)
HttpSink           -- -- +--> URI---|
DOM event       -----+               + ---- > Matcher(abplist) => Actions(block)

in this model, its very easy to extend function that base on uri-match which is very very common in extension (stylish, greasemonkey, and so on .... all of them require the "uri-match").
How about you ask Stylish/GreaseMonkey guys whether they want to depend on Adblock Plus code? I bet that they don't :)
This kind of approach would be acceptable if the Matcher module were built into Firefox so you could rely on it being always there and offering a reasonably stable API. But Firefox isn't interested in the highly specialized code that Adblock Plus is using. And Stylish, GreaseMonkey etc. all have different ideas on how matching should work.
lyricconch
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: Feature request: Ignoring query string

Post by lyricconch »

Some URLs to extend your testing:
ftp://example.com/
http://тест.рф/ (yes, internationalized domain names are fun)
And a filter:
||/example.com (shouldn't match http://example.com/ but with your regexp it will)
abp is focusing on webpage ad filtering .... a ftp:// wont be a ad. when filter ftp://.... , one should not use the || it's reasonable that the omission can only be used in most common situation. i just use || for subdomain and http/https without regexp, alway write the domain in ||, ||/example seems not much useful.... :shock: , for those strange domain, they ends with '/' either, so this expr is working to them also. [^\/] always fail on the rootpath, so expr was limited in scm-auth-host part. (btw, it seems not a good idea to filter any user-defined scheme such as flashget:// thunder:// emule:// irc:// ....... even more.... i have an exec:// for execute programs from the location bar....)
btw2: ||/example => https?:\/+[^\/]*[\/.]\b\/example\.com
to match this rule , atleast 3 / must appear after http:, one is from \/+, one from [\/.] , one from \/exmaple. so it wont match "http://example.com", also \b make the match fail.
Theoretically - yes. But what about exclamation mark for example? Theoretically nobody should be building URLs using ! as separator - but if you look around you will find such URLs. So separator placeholder is accepting everything that's not common in a URL.

Given that Adblock Plus almost never tests URLs against regular expressions (most of the time the URL is already dismissed by preceding checks) - this kind of micro-optimization isn't very useful.
as "most of time the URL is already dismissed by preceding checks" , the final test (the regexp) should be much more exactly . when decides a filter rule, we often consider for its common usage , use placerholder for a non-common char is not much useful ---- we can type the "uncommon char" directly, because few URL can share a rule with a uncommon char. /~jimmy/ can be a user directory in one server , it seems not a good idea to filter all /jimmy/.
performace problem occors in a special situation: when there are many ^, for example:
ad^p^gif=> ad.*p.*gif
now , with a url len=2000, it will execute about count("p")*2000/2 times "simple match"
one more ^ , count("p")*extra 1000 times. two more ^, count("p")*2000 times .....
As I said above, Adblock Plus usually doesn't run regular expressions so that optimizing them doesn't make much sense. The important scenario is testing of filter shortcuts that happens before any regular expressions are tested. Filter shortcuts are tested for all filters simultaneously - meaning that to improve performance you would need to ignore the query string on all filters. And I don't think you want that...
it's more a "feature" than an "optimize", the optimize is not major goal.
for permormace : shortcut can be generate with filter.match(/[\w%*]{3,}/g), find the longest. it generates tokens of words, more filter can be optimize. and less of the loop.(url len=100 need loop 93times with 8char hash, but only 25times with 4char-avg token. the regexp can be design as two types, regexp or an array, as below by setting regex.length = 0, its very easy to find out which type it is)

Code: Select all

// match url
for each(token in url.match(/[\w%]{3,}/g) )
if (token in shortcuts) {
      r = shortcus[token]
      if (r.length) {
           for each(regex in r) 
           if(regex.test(url))
                 return true;
      } else {
            if(r.test(url)) return true;
      }
}
// generate shortcut
regex = new RegExp(filterregex)
regex.length = 0
if (key in shortcuts){
       r = shortcuts[key];
       if(r.length) {
             shortcuts[key].push(regex);
       else {
             shortcuts[key]=[r, regex];
       }
} else {
       shortcuts[key] = regex;
}
How about you ask Stylish/GreaseMonkey guys whether they want to depend on Adblock Plus code? I bet that they don't :)
This kind of approach would be acceptable if the Matcher module were built into Firefox so you could rely on it being always there and offering a reasonably stable API. But Firefox isn't interested in the highly specialized code that Adblock Plus is using. And Stylish, GreaseMonkey etc. all have different ideas on how matching should work.
they wont..... but ... with this approach abp can exten a lot feature without performance cost. abp-matcher is not highly specialized... url filtering is almost the same in every case. abp now have the "best engine", why not to be "widely engine"? someone can write a "modify header" script registry to abp-matcher interface, and get it works immediately or "dynmic switch proxy" ... more and more .....
Wladimir Palant

Re: Feature request: Ignoring query string

Post by Wladimir Palant »

lyricconch wrote:abp is focusing on webpage ad filtering .... a ftp:// wont be a ad.
That's an unnecessary assumption, we had mms:// ads for example.
we can type the "uncommon char" directly
But why should we? If the filter is "^referrer=" - why should it match "?referrer=Firefox" but not "!referrer=Firefox"? In its current form the separator placeholder seems to be more useful in the common scenarios (meaning: domain name ending and query parameters). Particularly for query parameters I've seen the most strange solutions so this regexp intentionally excludes only relatively few characters.
for permormace : shortcut can be generate with filter.match(/[\w%*]{3,}/g), find the longest.
I'm afraid you totally misunderstood the idea, your code below will be very slow with the filter numbers Adblock Plus usually deals with... Maybe blog/investigating-filter-matching-algorithms will help (this doesn't quite describe the current algorithm but close enough).
with this approach abp can exten a lot feature without performance cost.
No performance cost but high maintenance cost. What's the point if nobody will use it? If somebody only needs the engine - here it is.
lyricconch
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: Feature request: Ignoring query string

Post by lyricconch »

That's an unnecessary assumption, we had mms:// ads for example.
mms:// ad exists but it is not frequency enough to take a place in a very common usage syntax.
But why should we? If the filter is "^referrer=" - why should it match "?referrer=Firefox" but not "!referrer=Firefox"? In its current form the separator placeholder seems to be more useful in the common scenarios (meaning: domain name ending and query parameters). Particularly for query parameters I've seen the most strange solutions so this regexp intentionally excludes only relatively few characters.
because they are defined in RFC2396. Reversed chars are very common in use. all URIs should follow the RFC2396.
I'm afraid you totally misunderstood the idea, your code below will be very slow with the filter numbers Adblock Plus usually deals with... Maybe blog/investigat ... algorithms will help (this doesn't quite describe the current algorithm but close enough).
i know the algorithms. i have read it
this is an example for explain my code:

you have some filter rules :
bar.com/adpic/
bar.com/ads/
foo.com/adpic/
you want to filter this url:
http://foo.bar.com/adpic/a.jpg

first : program generate shortcuts for every rule and store them in to var'shortcuts'
||bar.com/adpic/ => bar com adpic => adpic
||bar.com/sexypic* => bar com sexypic* => bar
||foo.com/adpic/ => foo com adpic => adpic
var'shortcuts' ls like this:
{
bar: (RegExp"||bar.com/sexypic*")
adpic: [(RegExp'||bar.com/adpic/'), (RegExp'||bar.com/adpic/')]
}
this can be finish before checking an url

now you got the url'http://foo.bar.com/adpic/a.jpg'
program break it into tokens first:
url='http://foo.bar.com/adpic/a.jpg' => tokens=['http', 'foo', 'bar', 'com', 'adpic', 'jpg']
now for each element in tokens do an IN operation
'http' in shortcuts? => no, skip this
'foo' in shortcuts? => no, skip this
'bar' in shortcuts? => yes, get the shortcuts['bar'], found its length=0, so regex, not match, skip
...
'adpic' in shortcuts? => yes!
now get element : r = shortcuts['adpic']
notice that r is unsure, it may be a Regex or an Array.
by checking r.length we know that r is an array now(All the Regexp set length=0)
so for each element(now all the elements are RegExp) in r ,we do the regex match,
if any element matches, the url is in the filter list surely. return yes this time.

in your algorthm , it should be len(url)-7 times (IN operate and a "MustBe" substr )
in this algorithm, it should be len(url)/avglen(key of shortcut) times (IN operate and conditionly(very small rate, it's base on the IN operator) array loop) and only one time re.match

i am using this algorthm in an autoproxy switch script in python
with 15000 rules, an url len=2000 only takes 1ms for process.(XPSP3 Python2.6 i750CPU )
with [\w%*]{3,} , i just change the list to adblock's list and igorn all EHH rules, log as below.

and this algorithm make possiable to grab tokens from regex !
which means regex can be optimized..
i have do that in my script. so there are few slow rules ..........

Code: Select all

E:\Projects\proxy-workingset\local>cmd /k py26 proxy.py
getting autoproxy data from sources:
  http://malwaredomains.lanik.us/malwaredomains_full.txt...  244649bytes, in 1.66s. 
converting autoproxy data to hashed compiledRegex ...
  rule_optimize: [\w*%]{3,}     total:  13162   direct:      0  proxy:  13162
  direct_hash(very fast):      0 / 0.00%
  direct_shash(fast)    :      0 / 0.00%        keys:      0 / 0.00%
  direct_share(slow)    :      0*/ 0.00%
  proxy_hash(vary fast) :  12325 / 93.64%
  proxy_shash(fast)     :    826 / 6.28%        keys:    588 / 71.19%
  proxy_share(slow)     :     11*/ 0.08%
  * : key of speed, lower is bette
---------------------------------------------------
ListenAddress: 127.0.0.1:8086
HTTPS Enabled: YES
OpenSSLModule: YES
Proxies Count: GAE(4) PHP(1) DIRECT(1)
AutoProxyList: 1
---------------------------------------------------
getting autoproxy data from sources:
  https://easylist-downloads.adblockplus.org/easylist.txt...  235531bytes. in 2.219s 
converting autoproxy data to hashed compiledRegex ...
  rule_optimize: [\w*%]{3,}     total:   5780   direct:    359  proxy:   5421
  direct_hash(very fast):    352 / 98.05%
  direct_shash(fast)    :      7 / 1.95%        keys:      6 / 85.59%
  direct_share(slow)    :      0*/ 0.00%
  proxy_hash(vary fast) :   4672 / 86.18%
  proxy_shash(fast)     :    738 / 13.61%       keys:    407 / 55.15%
  proxy_share(slow)     :     11*/ 0.20%
  * : key of speed, lower is bette
Config reloaded
hash means an ID to a CompiledRegex.
shash mean hybird of (hash-share) , an ID to a CompiledRegex list(array in javascript)
share means no ID for this rule , they will be execute everytime an url passin
No performance cost but high maintenance cost. What's the point if nobody will use it? If somebody only needs the engine - here it is.
i know few about the firefox xpcom interface, i just know how to use them but not how to write them ...
and i was very poor in UI design (my XUL knowadge as more as HTML)..... so would like someone take this heavy work, LoL...
Wladimir Palant

Re: Feature request: Ignoring query string

Post by Wladimir Palant »

I now understand what algorithm you are proposing.
with 15000 rules, an url len=2000 only takes 1ms for process.(XPSP3 Python2.6 i750CPU )
Not very useful for comparison :)
The matching performance test uses 350 URLs with the total length 38000. With my SU9400 CPU it takes 52ms to run (number of filters doesn't matter) - that's 2.7ms for 2000 characters. It has to match twice - once for the blacklist and once for the whiltelist, so processing the filters once takes 1.4ms. Now add to this that your CPU is a lot faster than mine and that Python might be faster than JavaScript (not so sure about that any more) - the results are too close to know which algorithm is faster.

However, looking at EasyList: what will be the shortcuts for the filters "&ad_type_" or "/img/topad_"? How will the matching work for URLs "http://foo.bar.com/?a=b&ad_type_skyscraper=yes" and "http://foo.bar.com/img/topad_skyskraper.gif"?
lyricconch
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: Feature request: Ignoring query string

Post by lyricconch »

i am matching twice either .... ~ ~ and execute with 11 slow rules = =! LoL ~ ~

However, looking at EasyList: what will be the shortcuts for the filters "&ad_type_" or "/img/topad_"? How will the matching work for URLs "http://foo.bar.com/?a=b&ad_type_skyscraper=yes" and "http://foo.bar.com/img/topad_skyskraper.gif"?

they should be "type" and "topad" by change \w to a-zA-z\d,
but i just keep \w for "beautiful" .... maybe it is the reason of 11 slow rules in easylist?! i will try again.

[\w%*]{3,} choice only _A-z0-9%* for generate ID, but if it contains "*" this token will be dismiss.
short-rule should have enough "filtering rate" (someone have the "adtop" word will be check quickly)
i have define some "Never be an ID" words to avoid generate a ID with large array ,
generally they r short enough to keep them check at last time, but i just make them impossible to be an ID
for example : http https com net org htm html .... or just adjust 3 to 4.
Last edited by lyricconch on Thu Sep 30, 2010 1:16 pm, edited 1 time in total.
lyricconch
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: Feature request: Ignoring query string

Post by lyricconch »

these are slow rule in easylist with my algorithm

Code: Select all

rule_hash      rule_regex       rule
/ad-1.5.        \/ad\-1\.5\.        /ad-1.5.
/ad.pl?z        \/ad\.pl\?z        /ad.pl?z
/ad/?id=        \/ad\/\?id\=        /ad/?id=
/ad/im.js        \/ad\/im\.js        /ad/im.js
/im-ad/im-        \/im\-ad\/im\-        /im-ad/im-
/js.ng/c        \/js\.ng\/c        /js.ng/c
/js.ng/s        \/js\.ng\/s        /js.ng/s
e/js.ng/        e\/js\.ng\/        e/js.ng/
m/js.ng/        m\/js\.ng\/        m/js.ng/
||x7.to/ad/        ^https?:\/+[^\/]*[\/.]\bx7\.to\/ad\/        ||x7.to/ad/
v3.co.uk##.ad        v3\.co\.uk\#\#\.ad        v3.co.uk##.ad
without * place around a expression , it will be consider a "full word"
in my function
abcdef is an ID but *abcdef* is not an ID.
so i just consider ad_type_ as a whole word, it should be a bug when using this(\w) in easylist
(my script is using gfw-list, in China, we can not access all Internet sites, for example, we must use proxy to login twitter. use proxy to access google SSL.... the script is written for this porpose)

most time people write the whole words when writing part of words, he shoud ask program to notice that (not very common)

my algorithm just never optimize these short-ID rule (i want to keep the average length of IDs )
it seems some tokens can be an ID even they r only 2 charactor ?
such as "ng" "im" "ad" "x7"
they wont be common in other Urls (my IDs(tokens) is always a full word )
Wladimir Palant

Re: Feature request: Ignoring query string

Post by Wladimir Palant »

Ok, this algorithm definitely has some potential. Here is the patch I used: http://pastebin.mozilla.org/801439, seems to work correctly but needs a lot more testing.

And here are the results of performance testing:

Code: Select all

                       |           | Firefox 4          | Firefox 3.6.10
-----------------------+-----------+--------------------+-------------------
Matcher initialization | ABP 1.3a  | 26.3 ms +/- 1.2 ms | 30.1 ms +/- 0.9 ms
                       | patched   | 29.3 ms +/- 0.6 ms | 36.5 ms +/- 1.2 ms
-----------------------+-----------+--------------------+-------------------
URL matching           | ABP 1.3a  | 47.0 ms +/- 0.7 ms | 59.9 ms +/- 1.9 ms
                       | patched   | 41.7 ms +/- 0.9 ms | 74.5 ms +/- 2.2 ms
-----------------------+-----------+--------------------+-------------------
URL matching without   | ABP 1.3a  | 47.1 ms +/- 1.4 ms | 59.5 ms +/- 1.0 ms
slow filters           | patched   | 23.7 ms +/- 0.7 ms | 43.4 ms +/- 1.5 ms
For the last test I removed the filters that would be slow with the new algorithm for a more fair comparison. In other words, your algorithm seems to do very well in Firefox 4, especially once the existing filter subscriptions are updated. In Firefox 3.6 the results aren't as obvious, with the current filter subscriptions this algorithm will be slower. Filter subscriptions will need to be updated to get an advantage here.

Matcher initialization is slightly slower with this algorithm but that can probably be optimized.

For reference, these filters are displayed as slow in the current version of EasyList:

Code: Select all

-120x600
-160x600
-300x250
-336x280
-ad.html
.120x600
.160x600
.300x250
.336x280
/120x600
/160x600
/336x280
/468by60
/468x60r
/_ui/ad/*
/ad-1.5.
/ad.aspx
/ad.html
/ad.pl?z
/ad/?id=
/ad/adtech
/ad/banner
/ad/code
/ad/common
/ad/frame
/ad/google
/ad/header
/ad/im.js
/ad/inad
/ad/init
/ad/lrec
/ad/mercury
/ad/mrec
/ad/rectangle
/ad/serve
/ad/skyscraper
/ad/sponsor
/ad/text
/ad/view
/ad/wr.php
/ad?pageid
/ad_250x
/ad_300x
/ad_js.php
/adsadclient
/dj_ad.js
/hm_ad.php
/im-ad/im-
/js-ad.php
/js.ng/c
/js.ng/s
/js/ad_eo_
/sd/ad.php
/vr-ad.png
/ww_ad.png
120x600.
120x600b
160.html
160x600-
160x600.
300x250.
468x15.h
468x60.g
468x60.h
468x60.j
468x60.s
468x60a.
468x60ba
728.html
728banne
728x90.g
728x90.h
728x90.j
728x90.p
728x90.s
728x90px
_120x600
_160x290
_160x600
_300x250
_336x280
_580x100
_728by90
_ad.aspx
_ad.html
e/js.ng/
e/js_ng/
e120x600
m/js.ng/
n728x90_
r/ad.htm
||x7.to/ad/
lyricconch
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: Feature request: Ignoring query string

Post by lyricconch »

the optimize param is adjustable ....

for example
use [A-Za-z0-9%*]{4,} to generate tokenID
if fails use \b(?=.{,2}\d)\w\w\w\b for another scan (word len=3, but atleast one number)
if fails again use \b(?=.?\d)\w\w\b for a more deeply scan (word len=2, one number atleast)
if fails again use \b(?!.{0,2}_)\w\w?\w?\b for final check (word 1-3 char, no _ )
if fails again @#$%#$%#$^$^% nothing to say....
use [A-Za-z0-9%*]+ for break url to tokens.

its very important to mark "bad ID", such as "http" "https" "www" "com",
these ID will make max(r.length) bigger and lot's time you will do an useless match, but "bad ID" is better than "no ID", because without ID filter will execute for ALL urls and "bad ID" only execute for MOST urls.
unique ID > long ID > short ID > bad ID > no ID ...

if turn on the "partial word requires * " option, almost all rules can be optimized without any change.
(this is the most common situation, people seldom enter partial word and leave it alone)

in my option, {\w%*}{3,} is the best param (fast, wildly, simple and beautiful)
Wladimir Palant

Re: Feature request: Ignoring query string

Post by Wladimir Palant »

lyricconch wrote:if turn on the "partial word requires * " option, almost all rules can be optimized without any change.
(this is the most common situation, people seldom enter partial word and leave it alone)
This is not really an option for Adblock Plus - there are tons of existing filter lists, it would break compatibility badly. Subscription authors can add separator placeholder and the such where it makes sense.

The other parameters can be tweaked of course, I just don't have time to test them today.
Wladimir Palant

Re: Feature request: Ignoring query string

Post by Wladimir Palant »

Here is a somewhat tweaked patch: http://pastebin.mozilla.org/802751
Performance of matcher initialization has been improved, additionally common keywords like "www" are now less likely to be selected. This should be as much optimization as we can get - requiring keywords to have at least 4 characters would speed up matching a little bit again but would be problematic for filter subscriptions IMO.

Performance testing results again:

Code: Select all

                       |           | Firefox 4          | Firefox 3.6.10
-----------------------+-----------+--------------------+-------------------
Matcher initialization | ABP 1.3a  | 24.7 ms +/- 0.7 ms | 32.3 ms +/- 0.6 ms
                       | patched   | 28.2 ms +/- 1.0 ms | 35.8 ms +/- 1.4 ms
-----------------------+-----------+--------------------+-------------------
URL matching without   | ABP 1.3a  | 49.5 ms +/- 1.0 ms | 61.0 ms +/- 1.0 ms
slow filters           | patched   | 23.8 ms +/- 0.5 ms | 40.9 ms +/- 0.8 ms
The performance gains are looking good, that's definitely something we want to have. However, until the filter subscriptions have been adapted the performance will be worse that what we have now - so I am scheduling this for Adblock Plus 1.4 to give subscription authors enough time. We need this time to do thorough testing anyway.
lyricconch
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: Feature request: Ignoring query string

Post by lyricconch »

1. dont use "let" in a loop, it will generate unnecessary instructions.
2. use .length to support shareID(an ID => array of regex) is ok. array will not be large, and length operate has been well optimized.

D:\Projects\mozilla-central\ff\dist\bin>js -mj
js> dis(function fvar(){var j;for(var i=9;i;i--){var k=0;j+=k;}})
flags: LAMBDA NULL_CLOSURE
main:
00000: getlocal 0
00003: pop
00004: int8 9
00006: setlocal 1
00009: pop
00010: goto 34 (24)
00013: trace
00014: zero
00015: setlocal 2
00018: pop
00019: getlocal 0
00022: getlocal 2
00025: add
00026: setlocal 0
00029: pop
00030: localdec 1
00033: pop
00034: getlocal 1
00037: ifne 13 (-24)
00040: stop

Source notes:
0: 0 [ 0] decl offset 0
2: 6 [ 6] decl offset 0
4: 9 [ 3] for cond 24 update 20 tail 27
8: 15 [ 6] decl offset 0
10: 25 [ 10] xdelta
11: 25 [ 0] assignop
js> dis(function flet(){var j;for(let i=9;i;i--){let k=0;j+=k;}})
flags: LAMBDA NULL_CLOSURE
main:
00000: getlocal 0
00003: pop
00004: enterblock depth 0 {i: 0}
00007: int8 9
00009: setlocal 1
00012: pop
00013: goto 43 (30)
00016: trace
00017: enterblock depth 1 {k: 0}
00020: zero
00021: setlocal 2
00024: pop
00025: getlocal 0
00028: getlocal 2
00031: add
00032: setlocal 0
00035: pop
00036: leaveblock 1
00039: localdec 1
00042: pop
00043: getlocal 1
00046: ifne 16 (-30)
00049: leaveblock 1
00052: stop

Source notes:
0: 0 [ 0] decl offset 0
2: 9 [ 9] xdelta
3: 9 [ 0] decl offset 2
5: 12 [ 3] for cond 30 update 26 tail 33
9: 21 [ 9] xdelta
10: 21 [ 0] decl offset 2
12: 31 [ 10] xdelta
13: 31 [ 0] assignop
js> dis(function(x)x.length)
flags: LAMBDA EXPR_CLOSURE NULL_CLOSURE
main:
00000: getarg 0
00003: length
00004: return
00005: stop

Source notes:
0: 3 [ 3] pcbase offset 3
js> dis(function(x)"length" in x)
flags: LAMBDA EXPR_CLOSURE NULL_CLOSURE
main:
00000: string "length"
00003: getarg 0
00006: in
00007: return
00008: stop

Source notes:
js> dis(function(x)x.other)
flags: LAMBDA EXPR_CLOSURE NULL_CLOSURE
main:
00000: getargprop 0 "other"
00005: return
00006: stop
Post Reply