Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize #at_css and #css initialization #14

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 44 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![CI](https://github.com/serpapi/nokolexbor/actions/workflows/ci.yml/badge.svg)](https://github.com/serpapi/nokolexbor/actions/workflows/ci.yml)

Nokolexbor is a drop-in replacement for Nokogiri. It's 5.2x faster at parsing HTML and up to 997x faster at CSS selectors.
Nokolexbor is a drop-in replacement for Nokogiri. It's 4.7x faster at parsing HTML and up to 1352x faster at CSS selectors.

It's a performance-focused HTML5 parser for Ruby based on [Lexbor](https://github.com/lexbor/lexbor/). It supports both CSS selectors and XPath. Nokolexbor's API is designed to be 1:1 compatible as much as possible with [Nokogiri's API](https://github.com/sparklemotion/nokogiri).

Expand Down Expand Up @@ -106,75 +106,81 @@ end

## Benchmarks

Benchmark parsing google result page (368 KB) and selecting nodes using CSS and XPath. Run on MacBook Pro (2019) 2.3 GHz 8-Core Intel Core i9.
Benchmarks of parsing Google search result page (367 KB) and finding nodes using CSS selectors and XPath.

CPU: AMD Ryzen 5 5600 (Ubuntu 20.04 on Windows 10 WSL 2).

Run with: `ruby bench/bench.rb`

| | Nokolexbor (iters/s) | Nokogiri (iters/s) | Diff |
| ---------- | ------------- | ----------- | -------------- |
| parsing | 487.6 | 93.5 | 5.22x faster |
| at_css | 50798.8 | 50.9 | 997.87x faster |
| css | 7437.6 | 52.3 | 142.11x faster |
| at_xpath | 57.077 | 53.176 | same-ish |
| xpath | 51.523 | 58.438 | same-ish |
| ---------- | ------------- | ------------ | --------------- |
| parsing | 994.8 | 211.8 | 4.70x faster |
| at_css | 202963.7 | 150.1 | 1352.33x faster |
| css | 9787.9 | 150.0 | 65.27x faster |
| at_xpath | 154.6 | 153.2 | same-ish |
| xpath | 154.3 | 153.2 | same-ish |

<details>
<summary>Raw data</summary>

```
Warming up --------------------------------------
Nokolexbor parse 56.000 i/100ms
Nokogiri parse 8.000 i/100ms
Nokolexbor parse (367 KB)
100.000 i/100ms
Nokogiri parse (367 KB)
20.000 i/100ms
Calculating -------------------------------------
Nokolexbor parse 487.564 (±10.9%) i/s - 9.688k in 20.117173s
Nokogiri parse 93.470 (±21.4%) i/s - 1.736k in 20.024163s
Nokolexbor parse (367 KB)
994.773 (± 0.9%) i/s - 19.900k in 20.006124s
Nokogiri parse (367 KB)
211.793 (±12.3%) i/s - 4.180k in 20.093299s

Comparison:
Nokolexbor parse: 487.6 i/s
Nokogiri parse: 93.5 i/s - 5.22x (± 0.00) slower
Nokolexbor parse (367 KB): 994.8 i/s
Nokogiri parse (367 KB): 211.8 i/s - 4.70x (± 0.00) slower

Warming up --------------------------------------
Nokolexbor at_css 5.548k i/100ms
Nokogiri at_css 6.000 i/100ms
Nokolexbor at_css 20.195k i/100ms
Nokogiri at_css 15.000 i/100ms
Calculating -------------------------------------
Nokolexbor at_css 50.799k13.8%) i/s - 987.544k in 20.018481s
Nokogiri at_css 50.90735.4%) i/s - 828.000 in 20.666258s
Nokolexbor at_css 202.964k 0.7%) i/s - 4.059M in 20.000626s
Nokogiri at_css 150.084 0.7%) i/s - 3.015k in 20.089207s

Comparison:
Nokolexbor at_css: 50798.8 i/s
Nokogiri at_css: 50.9 i/s - 997.87x (± 0.00) slower
Nokolexbor at_css: 202963.7 i/s
Nokogiri at_css: 150.1 i/s - 1352.33x (± 0.00) slower

Warming up --------------------------------------
Nokolexbor css 709.000 i/100ms
Nokogiri css 4.000 i/100ms
Nokolexbor css 977.000 i/100ms
Nokogiri css 15.000 i/100ms
Calculating -------------------------------------
Nokolexbor css 7.438k14.7%) i/s - 145.345k in 20.083833s
Nokogiri css 52.33836.3%) i/s - 816.000 in 20.042053s
Nokolexbor css 9.788k 0.4%) i/s - 196.377k in 20.063658s
Nokogiri css 149.956 0.7%) i/s - 3.000k in 20.006363s

Comparison:
Nokolexbor css: 7437.6 i/s
Nokogiri css: 52.3 i/s - 142.11x (± 0.00) slower
Nokolexbor css: 9787.9 i/s
Nokogiri css: 150.0 i/s - 65.27x (± 0.00) slower

Warming up --------------------------------------
Nokolexbor at_xpath 2.000 i/100ms
Nokogiri at_xpath 4.000 i/100ms
Nokolexbor at_xpath 15.000 i/100ms
Nokogiri at_xpath 15.000 i/100ms
Calculating -------------------------------------
Nokolexbor at_xpath 57.07731.5%) i/s - 920.000 in 20.156393s
Nokogiri at_xpath 53.17635.7%) i/s - 876.000 in 20.036717s
Nokolexbor at_xpath 153.190 0.7%) i/s - 3.075k in 20.073628s
Nokogiri at_xpath 154.588 0.6%) i/s - 3.105k in 20.086664s

Comparison:
Nokolexbor at_xpath: 57.1 i/s
Nokogiri at_xpath: 53.2 i/s - same-ish: difference falls within error
Nokogiri at_xpath: 154.6 i/s
Nokolexbor at_xpath: 153.2 i/s - same-ish: difference falls within error

Warming up --------------------------------------
Nokolexbor xpath 3.000 i/100ms
Nokogiri xpath 3.000 i/100ms
Nokolexbor xpath 15.000 i/100ms
Nokogiri xpath 15.000 i/100ms
Calculating -------------------------------------
Nokolexbor xpath 51.52331.1%) i/s - 903.000 in 20.102568s
Nokogiri xpath 58.43835.9%) i/s - 852.000 in 20.001408s
Nokolexbor xpath 153.159 0.7%) i/s - 3.075k in 20.077580s
Nokogiri xpath 154.322 1.3%) i/s - 3.090k in 20.026288s

Comparison:
Nokogiri xpath: 58.4 i/s
Nokolexbor xpath: 51.5 i/s - same-ish: difference falls within error
Nokogiri xpath: 154.3 i/s
Nokolexbor xpath: 153.2 i/s - same-ish: difference falls within error
```
</details>
10 changes: 1 addition & 9 deletions ext/nokolexbor/extconf.rb
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,6 @@ def which(cmd)
append_cflags("-DLEXBOR_STATIC")
append_cflags("-DLIBXML_STATIC")

def sys(cmd)
puts "-- #{cmd}"
unless ret = xsystem(cmd)
raise "ERROR: '#{cmd}' failed"
end
ret
end

# Thrown when we detect CMake is taking too long and we killed it
class CMakeTimeout < StandardError
end
Expand Down Expand Up @@ -138,7 +130,7 @@ def apply_patch(patch_file, chdir)

Dir.chdir("build") do
run_cmake(10 * 60, ".. -DCMAKE_INSTALL_PREFIX:PATH=#{INSTALL_DIR} #{lexbor_cmake_flags.join(' ')}")
sys("#{MAKE} install")
system("#{MAKE}", "install")
end
end

Expand Down
2 changes: 1 addition & 1 deletion ext/nokolexbor/libxml/tree.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ extern "C" {
#endif

static size_t tmp_len;
#define NODE_NAME(node) lxb_dom_node_name_qualified((node), &tmp_len)
#define NODE_NAME(node) lxb_dom_node_name_qualified((lxb_dom_node_t *)(node), &tmp_len)
#define NODE_NS_HREF(node) ((node)->prefix ? lxb_ns_by_id((node)->owner_document->ns, (node)->ns, &tmp_len) : NULL)
#define NODE_NS_PREFIX(node) lxb_ns_by_id((node)->owner_document->prefix, (node)->prefix, &tmp_len)

Expand Down
8 changes: 4 additions & 4 deletions ext/nokolexbor/nl_attribute.c
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ nl_attribute_parent(VALUE self)
if (attr->owner == NULL) {
return Qnil;
}
return nl_rb_node_create(attr->owner, nl_rb_document_get(self));
return nl_rb_node_create((lxb_dom_node_t *)attr->owner, nl_rb_document_get(self));
}

/**
Expand All @@ -158,7 +158,7 @@ nl_attribute_previous(VALUE self)
if (attr->prev == NULL) {
return Qnil;
}
return nl_rb_node_create(attr->prev, nl_rb_document_get(self));
return nl_rb_node_create((lxb_dom_node_t *)attr->prev, nl_rb_document_get(self));
}

/**
Expand All @@ -175,7 +175,7 @@ nl_attribute_next(VALUE self)
if (attr->next == NULL) {
return Qnil;
}
return nl_rb_node_create(attr->next, nl_rb_document_get(self));
return nl_rb_node_create((lxb_dom_node_t *)attr->next, nl_rb_document_get(self));
}

static VALUE
Expand All @@ -189,7 +189,7 @@ nl_attribute_inspect(VALUE self)

return rb_sprintf("#<%" PRIsVALUE " %s=\"%s\">", c,
lxb_dom_attr_qualified_name(attr, &len),
attr_value == NULL ? "" : attr_value);
attr_value == NULL ? "" : (char *)attr_value);
}

void Init_nl_attribute(void)
Expand Down
43 changes: 10 additions & 33 deletions ext/nokolexbor/nl_document.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,6 @@ extern VALUE mNokolexbor;
extern VALUE cNokolexborNode;
VALUE cNokolexborDocument;

#ifdef HAVE_PTHREAD_H
#include <pthread.h>
pthread_key_t p_key_parser;
#endif

static void
free_nl_document(lxb_html_document_t *document)
{
Expand Down Expand Up @@ -50,24 +45,19 @@ nl_document_parse(VALUE self, VALUE rb_string_or_io)
const char *html_c = StringValuePtr(rb_html);
size_t html_len = RSTRING_LEN(rb_html);

#ifdef HAVE_PTHREAD_H
lxb_html_parser_t *g_parser = (lxb_html_parser_t *)pthread_getspecific(p_key_parser);
#else
lxb_html_parser_t *g_parser = NULL;
#endif
if (g_parser == NULL) {
g_parser = lxb_html_parser_create();
lxb_status_t status = lxb_html_parser_init(g_parser);
static lxb_html_parser_t *html_parser = NULL;
if (html_parser == NULL) {
html_parser = lxb_html_parser_create();
lxb_status_t status = lxb_html_parser_init(html_parser);
if (status != LXB_STATUS_OK) {
lxb_html_parser_destroy(html_parser);
html_parser = NULL;
nl_raise_lexbor_error(status);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaks the initial allocation unless status was LXB_STATUS_ERROR_OBJECT_IS_NULL. Otherwise it also leaks other allocations performed by lxb_html_parser_init. The parser object also leaks in the non-HAVE_PTHREAD_H case due to reliance on it calling the destructor.

}
g_parser->tree->scripting = true;
#ifdef HAVE_PTHREAD_H
pthread_setspecific(p_key_parser, g_parser);
#endif
html_parser->tree->scripting = true;
}

lxb_html_document_t *document = lxb_html_parse(g_parser, (const lxb_char_t *)html_c, html_len);
lxb_html_document_t *document = lxb_html_parse(html_parser, (const lxb_char_t *)html_c, html_len);

if (document == NULL) {
rb_raise(rb_eRuntimeError, "Error parsing document");
Expand Down Expand Up @@ -104,7 +94,7 @@ static VALUE
nl_document_get_title(VALUE self)
{
size_t len;
lxb_char_t *str = lxb_html_document_title(nl_rb_document_unwrap(self), &len);
lxb_char_t *str = lxb_html_document_title((lxb_html_document_t *)nl_rb_document_unwrap(self), &len);
return str == NULL ? rb_str_new("", 0) : rb_utf8_str_new(str, len);
}

Expand All @@ -126,7 +116,7 @@ nl_document_set_title(VALUE self, VALUE rb_title)
{
const char *c_title = StringValuePtr(rb_title);
size_t len = RSTRING_LEN(rb_title);
lxb_html_document_title_set(nl_rb_document_unwrap(self), (const lxb_char_t *)c_title, len);
lxb_html_document_title_set((lxb_html_document_t *)nl_rb_document_unwrap(self), (const lxb_char_t *)c_title, len);
return rb_title;
}

Expand All @@ -142,21 +132,8 @@ nl_document_root(VALUE self)
return nl_rb_node_create(lxb_dom_document_root(doc), self);
}

static void
free_parser(void *data)
{
lxb_html_parser_t *g_parser = (lxb_html_parser_t *)data;
if (g_parser != NULL) {
g_parser = lxb_html_parser_destroy(g_parser);
}
}

void Init_nl_document(void)
{
#ifdef HAVE_PTHREAD_H
pthread_key_create(&p_key_parser, free_parser);
#endif

cNokolexborDocument = rb_define_class_under(mNokolexbor, "Document", cNokolexborNode);
rb_define_singleton_method(cNokolexborDocument, "new", nl_document_new, 0);
rb_define_singleton_method(cNokolexborDocument, "parse", nl_document_parse, 1);
Expand Down
52 changes: 32 additions & 20 deletions ext/nokolexbor/nl_node.c
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ nl_node_attribute(VALUE self, VALUE rb_name)
if (attr->owner == NULL) {
attr->owner = lxb_dom_interface_element(node);
}
return nl_rb_node_create(attr, nl_rb_document_get(self));
return nl_rb_node_create((lxb_dom_node_t *)attr, nl_rb_document_get(self));
}

/**
Expand All @@ -185,7 +185,7 @@ nl_node_attribute_nodes(VALUE self)
if (attr->owner == NULL) {
attr->owner = lxb_dom_interface_element(node);
}
rb_ary_push(ary, nl_rb_node_create(attr, rb_doc));
rb_ary_push(ary, nl_rb_node_create((lxb_dom_node_t *)attr, rb_doc));
attr = attr->next;
}

Expand Down Expand Up @@ -366,28 +366,32 @@ nl_node_find(VALUE self, VALUE selector, lxb_selectors_cb_f cb, void *ctx)
lxb_dom_node_t *node = nl_rb_node_unwrap(self);

lxb_status_t status;
lxb_css_parser_t *parser = NULL;
lxb_selectors_t *selectors = NULL;
static lxb_css_parser_t *css_parser = NULL;
static lxb_selectors_t *selectors = NULL;
lxb_css_selector_list_t *list = NULL;

/* Create CSS parser. */
parser = lxb_css_parser_create();
status = lxb_css_parser_init(parser, NULL, NULL);
if (status != LXB_STATUS_OK) {
goto cleanup;
/* CSS parser. */
if (css_parser == NULL) {
css_parser = lxb_css_parser_create();
status = lxb_css_parser_init(css_parser, NULL, NULL);
if (status != LXB_STATUS_OK) {
goto init_error;
}
}

/* Selectors. */
selectors = lxb_selectors_create();
status = lxb_selectors_init(selectors);
if (status != LXB_STATUS_OK) {
goto cleanup;
if (selectors == NULL) {
selectors = lxb_selectors_create();
status = lxb_selectors_init(selectors);
if (status != LXB_STATUS_OK) {
goto init_error;
}
}

/* Parse and get the log. */
list = lxb_css_selectors_parse_relative_list(parser, (const lxb_char_t *)selector_c, selector_len);
if (parser->status != LXB_STATUS_OK) {
status = parser->status;
list = lxb_css_selectors_parse_relative_list(css_parser, (const lxb_char_t *)selector_c, selector_len);
if (css_parser->status != LXB_STATUS_OK) {
status = css_parser->status;
goto cleanup;
}

Expand All @@ -398,11 +402,19 @@ nl_node_find(VALUE self, VALUE selector, lxb_selectors_cb_f cb, void *ctx)
}

cleanup:
/* Destroy all object for all CSS Selector List. */
lxb_css_selector_list_destroy_memory(list);

return status;

init_error:
/* Destroy Selectors object. */
(void)lxb_selectors_destroy(selectors, true);
lxb_selectors_destroy(selectors, true);
selectors = NULL;

/* Destroy resources for CSS Parser. */
(void)lxb_css_parser_destroy(parser, true);
lxb_css_parser_destroy(css_parser, true);
css_parser = NULL;

/* Destroy all object for all CSS Selector List. */
lxb_css_selector_list_destroy_memory(list);
Expand Down Expand Up @@ -1014,9 +1026,9 @@ static VALUE
nl_node_add_sibling(VALUE self, VALUE next_or_previous, VALUE new)
{
bool insert_after;
if (rb_eql(rb_String(next_or_previous), rb_str_new_literal("next"))) {
if (rb_str_cmp(rb_String(next_or_previous), rb_str_new_literal("next")) == 0) {
insert_after = true;
} else if (rb_eql(rb_String(next_or_previous), rb_str_new_literal("previous"))) {
} else if (rb_str_cmp(rb_String(next_or_previous), rb_str_new_literal("previous")) == 0) {
insert_after = false;
} else {
rb_raise(rb_eArgError, "Unsupported inserting position");
Expand Down
4 changes: 2 additions & 2 deletions ext/nokolexbor/xml_tree.c
Original file line number Diff line number Diff line change
Expand Up @@ -339,8 +339,8 @@ nl_xmlGetNodePath(const lxb_dom_node_t *node)

} else if (cur->type == LXB_DOM_NODE_TYPE_ATTRIBUTE) {
sep = "/@";
name = (const char *) lxb_dom_attr_qualified_name(cur, &tmp_len);
next = ((lxb_dom_attr_t_ptr)cur)->owner;
name = (const char *) lxb_dom_attr_qualified_name((lxb_dom_attr_t_ptr)cur, &tmp_len);
next = (lxb_dom_node_t *)((lxb_dom_attr_t_ptr)cur)->owner;
} else {
nl_xmlFree(buf);
nl_xmlFree(buffer);
Expand Down
Loading
Loading