Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying for UTF-8 attribute values #48

Open
noebi opened this issue Mar 1, 2021 · 0 comments
Open

Querying for UTF-8 attribute values #48

noebi opened this issue Mar 1, 2021 · 0 comments

Comments

@noebi
Copy link

noebi commented Mar 1, 2021

Hi,
we recently stumbled over the fact that nodeattr does not allow queries involving UTF-8 encoded attribute values.
We have actually quite a lot of them but apparently never needed to use them in queries. Also, I haven't seen any
documentation that says attributes must be ASCII.

The problem seems to be the lex tokenizer that matches only ASCII characters.
While it's not really straightforward to teach flex about UTF-8, a relatively simple patch seems to do most of the work:

--- genders-1.28.1/src/libgenders/genders_query_parse.l 2021-02-28 11:13:08.580111309 +0100
+++ genders-1.28.1.utf8/src/libgenders/genders_query_parse.l    2021-02-28 11:13:45.383330719 +0100
@@ -41,8 +41,19 @@
 
 %}
 
+ASC     [a-zA-Z0-9]
+ASCC    [a-zA-Z0-9_\.\=:%\\\/\+]
+
+U       [\x80-\xbf]
+U2      [\xc2-\xdf]
+U3      [\xe0-\xef]
+U4      [\xf0-\xf4]
+
+UT     {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
+UTC    {ASCC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
+
 %%
-[a-zA-Z0-9][a-zA-Z0-9_\.\=:%\\\/\+]*([\-\|&]?[a-zA-Z0-9_\.\=:%\\\/\+]+)* yylval.attr = strdup(yytext); return ATTRTOK;
+{UT}{UTC}*([\-\|&]?{UTC}+)* yylval.attr = strdup(yytext); return ATTRTOK;
 \(                                                                       return LPARENTOK;
 \)                                                                       return RPARENTOK;
 \|\|                                                                     return UNIONTOK;
diff -r -u genders-1.28.1/src/libgenders/Makefile.am genders-1.28.1.utf8/src/libgenders/Makefile.am
--- genders-1.28.1/src/libgenders/Makefile.am   2020-05-15 21:52:08.000000000 +0200
+++ genders-1.28.1.utf8/src/libgenders/Makefile.am      2021-02-28 11:18:00.873911772 +0100
@@ -31,7 +31,7 @@
 
 # achu: -o option in lex/flex is not portable, use -t and write to stdout
 genders_query_parse.c: genders_query.c $(srcdir)/genders_query_parse.l
-       $(LEX) -t $(srcdir)/genders_query_parse.l > $(srcdir)/genders_query_parse.c
+       $(LEX) -8 -t $(srcdir)/genders_query_parse.l > $(srcdir)/genders_query_parse.c
 
 # achu: -o option in yacc/bison is not portable, use -b instead
 genders_query.c: $(srcdir)/genders_query.y

Any chance to see something like that in the next releases ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant