Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple issues in components/prism-nginx.js #1718

Closed
ghost opened this issue Feb 6, 2019 · 23 comments · Fixed by #2793
Closed

Multiple issues in components/prism-nginx.js #1718

ghost opened this issue Feb 6, 2019 · 23 comments · Fixed by #2793

Comments

@ghost
Copy link

ghost commented Feb 6, 2019

  1. nginx shouldn’t inherit anything from clike. nginx doesn’t have classes, booleans, operators, etc.;
  2. Directive shouldn’t be classified as keyword, because some parameters are keywords too.

Comments

  1. The current regex fails in the following example.
http {# This must be recognized as a comment
}
  1. The current regex doesn’t consider single­‑quoted strings.
location ~ '/example'# This shouldn’t be recognized as a comment
{}

Directives

Problem statement

Directives list is obsolete and inconsistent. AFAIK, the current list is based largely on a third­‑party gist.

Currently Prism supports 339 directives. nginx contains more than 693 directives. Beside that, some directives shown in components/prism-nginx.js are undocumented, removed years ago or not directives at all. See the table below.

Directive Status in nginx
CONTENT_, DOCUMENT_, etc. Unknown
auth Renamed to pop3_auth in 2007 ⁠¹
devpoll_changes An undocumented directive
devpoll_events An undocumented directive
epoll_events An undocumented directive
fastcgi_redirect_errors Renamed to fastcgi_intercept_errors in 2006 ⁠¹
if_not_empty A parameter for fastcgi_param and scgi_param
kqueue_changes An undocumented directive
kqueue_events An undocumented directive
log_format_combined Must be log_format combined. combined is a parameter
more_set_headers A part of ngx_headers_more. It is not distributed with nginx ⁠²
optimize_server_names Replaced by server_name_in_redirect in 2008 ⁠¹, removed in 2015 ⁠³
post_action A dangerous undocumented directive
proxy A parameter ⁠¹
proxy_redirect_errors Renamed to proxy_intercept_errors in 2006 ⁠¹
proxy_upstream_fail_timeout Not supported since 2006 ⁠¹
proxy_upstream_max_fails Not supported since 2006 ⁠¹
rtsig_overflow_events Removed in 2015 ⁠¹
rtsig_overflow_test Removed in 2015 ⁠¹
rtsig_overflow_threshold Removed in 2015 ⁠¹
rtsig_signo Removed in 2015 ⁠¹
satisfy_any Replaced by satisfy in 2008 ⁠¹, removed in 2015 ⁠³
so_keepalive A parameter for listen
worker_rlimit_sigpending Removed in 2015 ⁠⁴
xslt_entities Misspelled xml_entities?

¹ ⁠https://nginx.org/en/CHANGES.
² ⁠https://github.com/openresty/headers-more-nginx-module.
³ ⁠https://hg.nginx.org/nginx/rev/2911b7e5491b.
⁴ ⁠https://hg.nginx.org/nginx/rev/967594ba7571.

There are many more removed directives. Here they are:

  • obsolete aio directives and rtsig_signo (see http://hg.nginx.org/nginx/rev/adba26ff70b5) were removed in 2015;
  • limit_zone is removed in 2014;
  • secure_link_expires is removed in 2010;
  • open_file_cache_retest is removed in 2007;
  • fastcgi_upstream_fail_timeout, fastcgi_upstream_max_fails, fastcgi_x_powered_by, memcached_upstream_fail_timeout, memcached_upstream_max_fails, proxy_pass_server, proxy_pass_x_powered_by, restrict_host_names were removed in 2006;
  • fastcgi_root, fastcgi_set_var, fastcgi_params, default_charset, post_accept_timeout, proxy_add_x_forwarded_for, proxy_pass_unparsed_uri, proxy_preserve_host, proxy_set_x_real_ip, proxy_set_x_url, proxy_x_var, redirect, server_names_hash_threshold, server_names_hash were removed in 2005.

There are many more undocumented directives. These are degradation, degrade, eventport_events, gzip_hash, gzip_no_buffer, gzip_window, http2_pool_size, http2_streams_index_size, postpone_gzipping, ssi_ignore_recycled_buffers, uwsgi_string (see https://forum.nginx.org/read.php?29,277920,277920).

A proposed solution

On the one hand, there is a relatively fast way to update the list:

  1. save directives list from components/prism-nginx.js as old.txt;
  2. replace all occurrences of | in old.txt with \n;
  3. sort lines in old.txt alphabetically;
  4. save directives list from https://nginx.org/en/docs/dirindex.html as new.txt;
  5. remove names of modules from new.txt: \s+\(\w+\);
  6. replace (^[\w]+)(?:\n\1)+$ in new.txt with \1 to remove duplicates;
  7. find entries to remove: diff old.txt new.txt | grep '^<' | sed 's/^< *//';
  8. find entries to add: diff old.txt new.txt | grep '^>' | sed 's/^> *//'.

On the other hand, the resulting list is gonna be really long (I get 631 entries without third­‑party modules).

Maybe it’s better to replace the list with a single regular expression?

Numbers

Is it obligatory to highlight numbers? What about domain2.com, 128k, 192.168.0.1:8000?

Strings

  1. Add EOLs. No escaping required. Moreover, escaping is impossible.
  2. Add escape sequences. Only the following escape sequences are available: \", \', \\, \n, \r, \t. I’ve checked all the other symbols from 0x00 to 0x7E inside both " and '. \xFF, \uFFFF, \u{FFFF} are not supported.
  3. Add interpolation ("$example" and "${example}"). As I previously mentioned, the dollar sign ($) can’t be escaped with \, so nginx interprets \$x as a slash (\) and a subsequent $x variable.

Variables

  1. Variable names can contain digits. Examples are $geoip_city_country_code3, $geoip_country_code3, $http2, $time_iso8601. See https://nginx.org/en/docs/varindex.html.
  2. Names of user­‑defined variables (set $name value;) can only consist of digits, uppercase and lowercase English letters and underscores. Position is irrelevant ($0, $_ are OK). /\$\w+/ is the best choice for a regular variable.
  3. Built­‑in variables with an underscore at the end ($arg_, $cookie_, $http_, $jwt_claim_, $jwt_header_, $sent_http_, $sent_trailer_, $upstream_cookie_, $upstream_http_, $upstream_trailer_) accept a trailing part. This part can contain any character except /[\x00-\x1F\s"';\\{]/. /[;{]/ are only allowed inside strings ("$arg_;", "$arg_{" are OK). Escaping with a slash (\) is useless.
@RunDevelopment
Copy link
Member

Would you like to add a PR implementing your proposed solutions?

nginx doesn’t have [...], booleans, [...]

Don't on and off pretty much behave like booleans?

Maybe it’s better to replace the list with a single regular expression?

That would be ideal but it's usually quite hard to find this expression.

Is it obligatory to highlight numbers?

No, it isn't. If they get in your way or produce too many errors feel free to remove or change them.
Personally, I would like to see them stay in a stricter form and with units.
/(^|\s)(?:\d*\.)?\d+[a-z]*(?=$|[;\s])/ (with lookbehind: true) should work.

@ghost
Copy link
Author

ghost commented Feb 6, 2019

Don't on and off pretty much behave like booleans?

They do. But clike doesn’t implement support for on and off. Anyway, these are only tokens for specific directives.

That would be ideal but it's usually quite hard to find this expression.

nginx syntax is pretty strict. A directive can be found by its position. A directive is placed at the beginning of a text or preceded by ;, {, }.

name_1 parameters_1;
name_2 {
    name_2_1 parameters2_1;
    # A comment...
    name_2_2 parameters_2_2;
}
name_3 parameters_3;

Names of directives can be described with \w+.

'DIRECTIVE': {
	pattern: /((?:^|[;{}])\s*)\w+[^;{]*?(?=\s*[;{]|$)/, // Directive’s name and parameters
	lookbehind: true,
	inside: {
		'NAME': /^\w+(?=[\s;{]|$)/ // Name of a directive
	}
}

We have to find name and parameters first, because ^\s*\w+ can also catch a parameter.

Would you like to add a PR implementing your proposed solutions?

Sure! Most ideas are already implemented. But I have doubts about directives. What do you think about searching for directives by their position? Doesn’t it sound too naive?

/(^|\s)(?:\d*\.)?\d+[a-z]*(?=$|[;\s])/

I’ll try to insert your regex into my code.

@RunDevelopment
Copy link
Member

Sure! Most ideas are already implemented.

👍

But clike doesn’t implement support for on and off. Anyway, these are only tokens for specific directives.

This is a suggestion from my side because highlight.js also highlights on and off in the doc.

But I have doubts about directives. What do you think about searching for directives by their position?

I think it's a good approach but your current pattern does not account for strings:

name ';oh no';

It's good to know that the strict comment rules work in our favour here.

@ghost
Copy link
Author

ghost commented Feb 6, 2019

This is a suggestion from my side because highlight.js also highlights on and off in the doc.

Personally, I can’t see the difference between on/off and other tokens. For example, ssl_verify_client supports four mutually exclusive properties: on, off, optional, optional_no_ca. expires has epoch, max, off, but not on. That’s why I find the highlighting of on and off confusing.

Anyway, I’ll try to add it for compatibility.

I think it's a good approach but your current pattern does not account for strings:

name ';oh no';

I forgot to mention I also have a greedy string pattern. It catches ';oh no'.

@ghost
Copy link
Author

ghost commented Feb 6, 2019

I’ll make a PR on the weekend.

@RunDevelopment
Copy link
Member

I forgot to mention I also have a greedy string pattern. It catches ';oh no'.

But then this causes problems, doesn't it?

name ';oh no' parameter;

@ghost
Copy link
Author

ghost commented Feb 7, 2019

It does. Thank you.

Maybe, this works better.

'comment': {
	pattern: /(^|[\s;{}])#.*/,
	lookbehind: true
},
'function': {
	pattern: /((?:^|[;{}])\s*)\w+(?:(["'])(?:\\\2|(?!\2)[\S\s])*\2|\\["';{]|[^"';{])+/,
	lookbehind: true,
	inside: {
		'attr-name': /^\w+(?=[\s;{]|$)/
	}
}

Here are some test cases I used:

name parameter1;
name "parameter1";
name parameter1 parameter2;
name parameter1 "parameter2";
name "parameter1" parameter2;
name para\;meter1;
name "para;meter1";
name "para\;meter1";
internal;
internal    ;

# A multiline parameter
name "para

meter1";

name {
    name parameter1  'parameter2' \; par#ameter3;
    name parameter1 \" 'hello' par#ameter2;
    name parameter1; name parameter1;
    name parameter1 \{ 'hello';
    name {
        internal;
        name parameter1 parameter2;
        name para;meter1; # An error!
        name para{meter1; # An error!
        name parameter1 {; # An error!
    }
}

@RunDevelopment
Copy link
Member

This has problems with this:

name " #foo"; #bar

Also, I don't think that we should capture trailing spaces like in:

internal      ;

This should fix all of that:

Prism.languages.nginx = {
    'comment': {
        pattern: /((?:^|[{};])\s*)#.*/m,
        lookbehind: true
    },
    'directive': {
        pattern: /\b\w+(?:[^;{}"'\\]|\\.|("|')(?:(?!\1)[^\\]|\\.)*\1)*?(?=\s*[;{])/,
        greedy: true
    },
    'punctuation': /[{};]/
};

The lookbehind of the comment pattern is a little more complex so we can avoid greedy rematching which causes #1492. I also simplified the escape rules but that shouldn't cause problems.
I hope that works!

Also, we don't have to concern ourselves about how to parse syntactically incorrect code.

@ghost
Copy link
Author

ghost commented Feb 7, 2019

'comment': {
	pattern: /((?:^|[{};])\s*)#.*/m,
	lookbehind: true
},

Hm, this look similar to the lookbehind I used before. Could you explain, what exactly helps to avoid the bug you mentioned, because the example has problems with

location = "/example" # This is a comment
{}

# can be immediately preceded by [\s;{}], so \s* doesn’t work well.

@RunDevelopment
Copy link
Member

The reason I made these changes to the lookbehind is to avoid greedy rematching:

Because comment is matched before directive, it is possible for some comments to be wrong before directive is matched. E.g.

name "#foo"; name; #bar

The greedy matching of directive will then 'correct' these wrong matches of comment applying the comment pattern on the text after the current match of directive. We want to avoid this rematching of the comment pattern for two reasons:

  1. Performance. We match the comment pattern more often than we have to.
  2. The bug I mentioned. The bug boils down to the rematching being aborted too early. So there might still be text which has to be matched but it isn't.
    You don't have to worry about this too much though because it's really hard to provoke.

This is why I want to avoid rematching.
So, how does the modified lookahead achieve that?

It avoids matching non-comments like this:

location = "/example" # This is not a comment
{}

# can be immediately preceded by [\s;{}], so \s* doesn’t work well.

\s* alone doesn't, but because the ^ of (?:^|[{};]) is combined with the m flags, ^ matches the start of the text or the start of a line. And if it's the start of a line, the line is preceded by either \r or \n which are part of \s.
So (?:^|[{};])\s* also guaranttes that the comment is preceded by [\s;{}]. It's just a little stricter.

@ghost
Copy link
Author

ghost commented Feb 7, 2019

Thank you!

they only accept # at the beginning of a line for a comment

This answer from Stack Exchange is not complete.

location = "/example" # This IS a comment. There is a space
{# This is a comment after "{". No spaces required
    return 200 "Hello, world!";# This is a comment after ";". No spaces required
}# This is a comment after "}". No spaces required

location = /example # This IS a comment. There is a space
{}

location = "/example"# This is NOT a comment. There is no space
{}

location = /example# This is NOT a comment. There is no space
{}

Proof.

So (?:^|[{};])\s* also guaranttes that the comment is preceded by [\s;{}].

It doesn’t guarantte that the comment is preceded by \s.

@RunDevelopment
Copy link
Member

Proof.

Sorry, but I don't see the proof... I assume nginx.conf. Which line?

It doesn’t guarantee that the comment is preceded by \s.

Well, true. It doesn't in the case of the start of the file, but that should be it.


Also, is there a spec page of the NGINX configuration syntax?

@ghost
Copy link
Author

ghost commented Feb 7, 2019

Sorry, but I don't see the proof

Damn. There are examples of ; # and { #, but there are no examples of anything #. I’m sorry. See the explanation below.

Also, is there a spec page of the NGINX configuration syntax?

I’m not sure. I haven’t found it yet. I use nginx -t -c /path/to/nginx.conf to reveal the rules.

events # A comment
{
    worker_connections 512;
}

# A comment

http# This is not a comment. There is no space
{
    server {# A comment
        listen 80;# A comment

        location = "/example1" # This is a comment. There is a space
        {# This is a comment after "{". No spaces required
            return 200 "Hello, world!";# This is a comment after ";". No spaces required
        }# This is a comment after "}". No spaces required

        location = /example2 # This is a comment. There is a space
        {}

        location = "/example3"# This is not a comment. There is no space
        {}

        location = /example4# This is not a comment. There is no space
        {}
    }# A comment
}

nginx outputs the following:

nginx: [emerg] unknown directive "http#" in /path/to/nginx.conf:9
nginx: [emerg] unexpected "#" in /path/to/nginx.conf:21
nginx: [emerg] invalid number of arguments in "location" directive in /path/to/nginx.conf:25

Thus, the # This is not a comment part in your example is actually a comment.

location = "/example" # This is not a comment
{}

P.S. I hope I don’t bother you too much :)

@RunDevelopment
Copy link
Member

These examples I can work with!

But, it's the "/example3"# from location = "/example3"# This is not a comment. There is no space syntactically invalid?

@RunDevelopment
Copy link
Member

Just to confirm this:
It is invalid for a string to be followed by [^\s;]. So after a string, there has to be a white space or ;.

@RunDevelopment
Copy link
Member

Alright, assuming strings can indeed only be followed by a white space or ;, this should work.

Prism.languages.nginx = {
    'comment': {
        pattern: /(^|[\s{};])#.*/,
        lookbehind: true
    },
    'directive': {
        pattern: /\b\w+(?:[^;{}"'\\]|\\.|("|')(?:(?!\1)[^\\]|\\.)*\1)*?(?=\s*[;{])/,
        greedy: true,
        inside: {
            'string': {
                // I assumed that " and ' can be escaped using \" and \' and that \ can be escaped using \\
                pattern: /((?:^|[^\\])(?:\\\\)*)("|')(?:(?!\2)[^\\]|\\.)*\2/,
                lookbehind: true
            },
            'comment': {
                pattern: /(\s)#.*/,
                lookbehind: true,
                greedy: true
            },
            'keyword': /^\S+/

            // other patterns
        }
    },
    'punctuation': /[{};]/
};

I tested everything with the examples here and this:

events # A comment
{
    worker_connections 512;
}

# A comment

http# This is not a comment. There is no space
{
    server {# A comment
        listen 80;# A comment

        location = "/example1" # This is a comment. There is a space
        {# This is a comment after "{". No spaces required
            return 200 "Hello, world!";# This is a comment after ";". No spaces required
        }# This is a comment after "}". No spaces required

        location = /example2 # This is a comment. There is a space
        {}

        location = /example4# This is not a comment. There is no space
        {}
    }# A comment
}
name parameter1;
name "parameter1";
name parameter1 parameter2;
name parameter1 "parameter2";
name "parameter1" parameter2;
name para\;meter1;
name "para;meter1";
name "para\;meter1";
internal;
internal    ;

# A multiline parameter
name "para

meter1";

name {
    name parameter1  'parameter2' \; par#ameter3;
    name parameter1 \" 'he"llo' par#ameter2;
    name parameter1; name parameter1;
    name parameter1 \{ 'hello';
    name {
        internal;
        name parameter1 parameter2;
        name para;meter1; # An error!
        name para{meter1; # An error!
        name parameter1 {; # An error!
    }
}

name "#foo"; name; #bar
name " #foo"; #bar
name ';oh no' parameter;

@ghost
Copy link
Author

ghost commented Feb 7, 2019

So after a string, there has to be a white space or ;.

It looks that way.

@ghost
Copy link
Author

ghost commented Feb 7, 2019

this should work.

Your solution is perfect!

@RunDevelopment
Copy link
Member

Your solution is perfect!

It isn't...
There aren't any comments in the following, are there?

name # foo;

Also, I have another question: How does NGINX handle the following:

name # am I a comment;
;

Anyway, here's a corrected version. I also updated the name directive expression for the name:

Only directive and directive.inside.comment changed.

Prism.languages.nginx = {
    'comment': {
        pattern: /(^|[\s{};])#.*/,
        lookbehind: true
    },
    'directive': {
        pattern: /(^|\s)\w(?:[^;{}"'\\]|\\.|("|')(?:(?!\2)[^\\]|\\.)*\2)*?(?=[ \t]*[;{])/,
        lookbehind: true,
        greedy: true,
        inside: {
            'string': {
                // I assumed that " and ' can be escaped using \" and \' and that \ can be escaped using \\
                pattern: /((?:^|[^\\])(?:\\\\)*)("|')(?:(?!\2)[^\\]|\\.)*\2/,
                lookbehind: true
            },
            'comment': {
                pattern: /(\s)#.*(?=[\r\n])/,
                lookbehind: true,
                greedy: true
            },
            'keyword': /^\S+/

            // other patterns
        }
    },
    'punctuation': /[{};]/
};

@ghost
Copy link
Author

ghost commented Feb 13, 2019

There aren't any comments in the following, are there?

name # foo;

nginx recognizes # foo; as a comment. The punctuation at the end of the line is the only problem in your previous solution. Yes, name # foo; causes a syntax error, but parsing # foo as a parameter is wrong. If someone wants to pass a # with a preceding space, he should use quotes: name " # foo";. For example,

location # A comment;
{
}

causes nginx: [emerg] invalid number of arguments in "location" directive, because nginx doesn’t take # A comment; as a parameter. Removing spaces after # has no effect.

I think, your previous solution could be fixed easily:

'directive': {
	pattern: /\b\w+(?:("|')(?:(?!\1)[^\\]|\\.)*\1|(\s)#.*|[^;{}"'\\]|\\.)*?(?=\s*[;{])/,
	// ...
}

Also, I have another question: How does NGINX handle the following:

name # am I a comment;
;

No syntax errors. The first token punctuation span is unwanted.

Oh, I see the GitHub highlighter fails on some examples :)

@RunDevelopment
Copy link
Member

I applied your fix!

Prism.languages.nginx = {
    'comment': {
        pattern: /(^|[\s{};])#.*/,
        lookbehind: true
    },
    'directive': {
        pattern: /(^|\s)\w(?:[^;{}"'\\\s]|\\.|("|')(?:(?!\2)[^\\]|\\.)*\2|\s(?:#.*)?)*?(?=[ \t]*[;{])/,
        lookbehind: true,
        greedy: true,
        inside: {
            'string': {
                pattern: /((?:^|[^\\])(?:\\\\)*)("|')(?:(?!\2)[^\\]|\\.)*\2/,
                lookbehind: true
            },
            'comment': {
                pattern: /(\s)#.*(?=[\r\n])/,
                lookbehind: true,
                greedy: true
            },
            'keyword': /^\S+/

            // other patterns
        }
    },
    'punctuation': /[{};]/
};

I just have a bad feeling that \s(?:#.*)? might cause catastrophic backtracking. So if you find (relatively short) examples which take longer than ~1sec to highlight, please tell me.

@ghost
Copy link
Author

ghost commented Feb 13, 2019

@RunDevelopment, thank you! I’ll test it and write you back in a couple of days.

What about the next pattern?

'directive': {
	pattern: /(^|\s)\w(?:[^;{}"'\\\s]|\\.|("|')(?:(?!\2)[^\\]|\\.)*\2|\s#.*|\s+)*?(?=\s*[;{])/,
	// ...
}

@RunDevelopment
Copy link
Member

RunDevelopment commented Feb 13, 2019

Upon testing the two regexes I found that my pattern fails in linear time while yours takes exponential time. So please use my pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant