Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use string scanner with baseparser #105

Merged
merged 7 commits into from
Jan 17, 2024

Conversation

naitoh
Copy link
Contributor

@naitoh naitoh commented Jan 6, 2024

Using StringScanner reduces the string copying process and speeds up the process.

And I removed unnecessary methods.

https://github.com/ruby/rexml/actions/runs/7549990000/job/20554906140?pr=105

ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
Calculating -------------------------------------
                     rexml 3.2.6      master  3.2.6(YJIT)  master(YJIT) 
                 dom       4.868       5.077        8.137         8.303 i/s -     100.000 times in 20.540529s 19.696590s 12.288900s 12.043666s
                 sax      13.597      13.953       19.206        20.948 i/s -     100.000 times in 7.354343s 7.167142s 5.206745s 4.773765s
                pull      15.641      16.918       22.266        25.378 i/s -     100.000 times in 6.393424s 5.910955s 4.491201s 3.940471s
              stream      14.339      15.844       19.810        22.206 i/s -     100.000 times in 6.973856s 6.311350s 5.047957s 4.503244s

Comparison:
                              dom
        master(YJIT):         8.3 i/s 
         3.2.6(YJIT):         8.1 i/s - 1.02x  slower
              master:         5.1 i/s - 1.64x  slower
         rexml 3.2.6:         4.9 i/s - 1.71x  slower

                              sax
        master(YJIT):        20.9 i/s 
         3.2.6(YJIT):        19.2 i/s - 1.09x  slower
              master:        14.0 i/s - 1.50x  slower
         rexml 3.2.6:        13.6 i/s - 1.54x  slower

                             pull
        master(YJIT):        25.4 i/s 
         3.2.6(YJIT):        22.3 i/s - 1.14x  slower
              master:        16.9 i/s - 1.50x  slower
         rexml 3.2.6:        15.6 i/s - 1.62x  slower

                           stream
        master(YJIT):        22.2 i/s 
         3.2.6(YJIT):        19.8 i/s - 1.12x  slower
              master:        15.8 i/s - 1.40x  slower
         rexml 3.2.6:        14.3 i/s - 1.55x  slower
  • YJIT=ON : 1.02x - 1.14x faster
  • YJIT=OFF : 1.02x - 1.10x faster

@naitoh naitoh marked this pull request as draft January 6, 2024 23:22
@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from 09b7fb9 to 4c56eb8 Compare January 7, 2024 07:26
@naitoh naitoh marked this pull request as ready for review January 7, 2024 07:37
lib/rexml/parseexception.rb Outdated Show resolved Hide resolved
@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from 4c56eb8 to 8edd4ce Compare January 8, 2024 22:55
@naitoh naitoh requested a review from kou January 8, 2024 23:04
lib/rexml/parsers/baseparser.rb Outdated Show resolved Hide resolved
def read
begin
@buffer << readline
@scanner.string = @scanner.rest + readline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use @scanner << readline here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to @scanner << readline causes the following error in JRuby.

https://github.com/ruby/rexml/actions/runs/7434514894/job/20228750659#step:4:43

Error: test_rexml(REXMLTests::TestIssuezillaParsing):
  REXML::ParseException: No close tag for /issuezilla/issue[118]/activity[2]
  Line: -1
  Position: -1
  Last 80 unconsumed characters:
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/parsers/treeparser.rb:28:in `parse'
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/document.rb:448:in `build'
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/document.rb:101:in `initialize'
org/jruby/RubyClass.java:904:in `new'
/Users/naitoh/ghq/github.com/naitoh/rexml/test/test_rexml_issuezilla.rb:8:in `block in test_rexml'
      5:     include Helper::Fixture
      6:     def test_rexml
      7:       doc = File.open(fixture_path("ofbiz-issues-full-177.xml")) do |f|
  =>  8:         REXML::Document.new(f)
      9:       end
     10:       ctr = 1
     11:       doc.root.each_element('//issue') do |issue|
org/jruby/RubyIO.java:1179:in `open'
/Users/naitoh/ghq/github.com/naitoh/rexml/test/test_rexml_issuezilla.rb:7:in `test_rexml'
org/jruby/RubyKernel.java:1310:in `catch'
org/jruby/RubyKernel.java:1305:in `catch'
org/jruby/RubyKernel.java:1310:in `catch'
org/jruby/RubyKernel.java:1305:in `catch'

I am not sure why the error is occurring, but I am thinking that ruby/strscan#78 may be affected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the issue shows that scanner.string = scanner.rest + XXX has a problem but scanner << XXX doesn't have a problem. So it may not be related...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruby/strscan#78 has been fixed. Could you try with the latest strscan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but it did not fix it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@headius Cc: @kou
I think the problem is with JRuby's StringScanner, so I have created issue.

ruby/strscan#83

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naitoh Thank you! I will try to fix it today.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruby/strscan#83 is fixed and will be released soon!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou
ruby/strscan#84 has been merged into master and I confirmed that JRuby's @scanner << readline works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

lib/rexml/source.rb Outdated Show resolved Hide resolved
lib/rexml/source.rb Outdated Show resolved Hide resolved
lib/rexml/source.rb Outdated Show resolved Hide resolved
lib/rexml/source.rb Outdated Show resolved Hide resolved
lib/rexml/source.rb Outdated Show resolved Hide resolved
lib/rexml/source.rb Outdated Show resolved Hide resolved
lib/rexml/source.rb Outdated Show resolved Hide resolved
match = @source.match( ENTITYDECL, true ).to_a.compact
match[0] = :entitydecl
match = @source.match( ENTITYDECL, true )
match = match.nil? ? [:entitydecl] : [:entitydecl, *match.captures.compact.reject(&:empty?)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This assumes that match is StringScanner.
How about returning @scanner.captures instead of @scanner by @source.match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that ruby/strscan#72 needs to be merged in order to use @scanner.captures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged and released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed source#match? to return scanner.captures instead of @scanner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add compact option to @source.match with 8bc8955

Improve processing speed by returning @scanner.captures.compact if @compact=true and @scanner if compact=false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added match? and removed compact option in 50b3057.

Copy link
Contributor Author

@naitoh naitoh Jan 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Source#match? and return @scanner in Source#match.

@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from 8edd4ce to fcc4db8 Compare January 13, 2024 05:12
@naitoh
Copy link
Contributor Author

naitoh commented Jan 13, 2024

@kou
I have implemented the fix, but for some reason CI fails for head only.
(The error that occurs when strscan 3.0.8 is not used is being output. Do you know why?)

I used @scanner.captures instead of @scanner in source#match?, but the performance was much worse.

I don't think this is a good idea...

https://github.com/ruby/rexml/actions/runs/7510468370/job/20448939679?pr=105

                     rexml 3.2.6      master  3.2.6(YJIT)  master(YJIT) 
                 dom       4.981       4.923        7.432         8.192 i/s -     100.000 times in 20.075435s 20.314165s 13.455163s 12.207556s
                 sax      13.060      12.877       18.979        18.779 i/s -     100.000 times in 7.657007s 7.765487s 5.269112s 5.325084s
                pull      16.223      15.909       23.373        22.213 i/s -     100.000 times in 6.164121s 6.285777s 4.278464s 4.501960s
              stream      14.449      14.439       20.160        19.273 i/s -     100.000 times in 6.920919s 6.925817s 4.960348s 5.188534s

@naitoh naitoh requested a review from kou January 13, 2024 08:17
@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from fcc4db8 to 8227cc2 Compare January 13, 2024 13:15
@naitoh
Copy link
Contributor Author

naitoh commented Jan 13, 2024

Add compact option to @source.match with 8bc8955

Improve processing speed by returning @scanner.captures.compact if @compact=true and @scanner if compact=false.

https://github.com/ruby/rexml/actions/runs/7512802060/job/20453872698?pr=105

                     rexml 3.2.6      master  3.2.6(YJIT)  master(YJIT) 
                 dom       5.012       5.100        8.407         8.440 i/s -     100.000 times in 19.951141s 19.607244s 11.894233s 11.848374s
                 sax      13.083      13.761       19.094        21.722 i/s -     100.000 times in 7.643699s 7.266801s 5.237383s 4.603641s
                pull      15.565      16.990       22.291        24.950 i/s -     100.000 times in 6.424717s 5.885687s 4.486175s 4.008003s
              stream      14.348      14.817       19.798        21.514 i/s -     100.000 times in 6.969712s 6.749020s 5.051075s 4.648116s

def read
begin
@buffer << readline
@scanner.string = @scanner.rest + readline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the issue shows that scanner.string = scanner.rest + XXX has a problem but scanner << XXX doesn't have a problem. So it may not be related...

rexml.gemspec Outdated Show resolved Hide resolved
@kou
Copy link
Member

kou commented Jan 13, 2024

Improve processing speed by returning @scanner.captures.compact if @compact=true and @scanner if compact=false.

It seems that calling @scanner.captures when @source.match result isn't used slows down.

How about providing Source#match? that just true/false instead of matched result/nil like Regexp#match? and Regexp#match?
(And removed the added compact.)

We'll use match? for many cases, for example:

diff --git a/lib/rexml/parsers/baseparser.rb b/lib/rexml/parsers/baseparser.rb
index 305b120..610209e 100644
--- a/lib/rexml/parsers/baseparser.rb
+++ b/lib/rexml/parsers/baseparser.rb
@@ -223,13 +223,13 @@ module REXML
             return process_instruction
           when DOCTYPE_START
             base_error_message = "Malformed DOCTYPE"
-            @source.match(DOCTYPE_START, true)
+            @source.match?(DOCTYPE_START, true)
             @nsstack.unshift(curr_ns=Set.new)
             name = parse_name(base_error_message)
-            if @source.match(/\A\s*\[/um, true)
+            if @source.match?(/\A\s*\[/um, true)
               id = [nil, nil, nil]
               @document_status = :in_doctype
-            elsif @source.match(/\A\s*>/um, true)
+            elsif @source.match?(/\A\s*>/um, true)
               id = [nil, nil, nil]
               @document_status = :after_doctype
             else

@kou
Copy link
Member

kou commented Jan 13, 2024

I have implemented the fix, but for some reason CI fails for head only.
(The error that occurs when strscan 3.0.8 is not used is being output. Do you know why?)

Hmm. https://github.com/ruby/rexml/pull/105/files#r1451610001 may fix this.

@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from 8bc8955 to 50b3057 Compare January 14, 2024 03:19
@naitoh
Copy link
Contributor Author

naitoh commented Jan 14, 2024

I have implemented the fix, but for some reason CI fails for head only.
(The error that occurs when strscan 3.0.8 is not used is being output. Do you know why?)

Hmm. https://github.com/ruby/rexml/pull/105/files#r1451610001 may fix this.

It was not fixed...

@naitoh
Copy link
Contributor Author

naitoh commented Jan 14, 2024

Added match? and removed compact option in 50b3057.
Slight improvement over #105 (comment) ...

https://github.com/ruby/rexml/actions/runs/7516658356/job/20461969215?pr=105

                     rexml 3.2.6      master  3.2.6(YJIT)  master(YJIT) 
                 dom       4.932       4.866        8.354         8.015 i/s -     100.000 times in 20.275441s 20.548772s 11.969668s 12.477243s
                 sax      13.182      13.048       19.279        19.058 i/s -     100.000 times in 7.586344s 7.664254s 5.186958s 5.247065s
                pull      15.635      15.440       22.523        22.527 i/s -     100.000 times in 6.396004s 6.476610s 4.439974s 4.439205s
              stream      14.473      14.247       19.828        19.634 i/s -     100.000 times in 6.909637s 7.019042s 5.043251s 5.093328s

@naitoh naitoh requested a review from kou January 14, 2024 03:36
@kou
Copy link
Member

kou commented Jan 14, 2024

OK. It seems that we don't access all captured results in our use case.
Let's remove Source#match? and return @scanner in Source#match.

@kou
Copy link
Member

kou commented Jan 14, 2024

ruby/ruby#9536 will fix the CI failure.

@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from 50b3057 to ec62e37 Compare January 14, 2024 12:54
@naitoh
Copy link
Contributor Author

naitoh commented Jan 14, 2024

Let's remove Source#match? and return @scanner in Source#match.

I removed Source#match? and return @scanner in Source#match.

https://github.com/ruby/rexml/actions/runs/7519306454/job/20467689597?pr=105

                      rexml 3.2.6      master  3.2.6(YJIT)  master(YJIT) 
                 dom       4.907       5.070        8.226         8.248 i/s -     100.000 times in 20.377790s 19.723392s 12.157131s 12.123566s
                 sax      12.805      13.697       19.942        20.809 i/s -     100.000 times in 7.809347s 7.300932s 5.014450s 4.805517s
                pull      15.677      16.530       23.046        25.013 i/s -     100.000 times in 6.378722s 6.049606s 4.339091s 3.997863s
              stream      14.310      14.990       20.231        22.022 i/s -     100.000 times in 6.988153s 6.671139s 4.943000s 4.540999s
  • YJIT=ON : 1.00x - 1.09x faster
  • YJIT=OFF : 1.04x - 1.07x faster

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

head CI jobs are fixed.

match = @source.match( ENTITYDECL, true ).to_a.compact
match[0] = :entitydecl
match = @source.match( ENTITYDECL, true )
match = match.nil? ? [:entitydecl] : [:entitydecl, *match.captures.compact]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need match.nil? check here?
(Is there any case that the above @source.match() failed?)

Copy link
Contributor Author

@naitoh naitoh Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the match.nil? check is removed, all tests succeed.

But, if the string <!ENTITY> comes in, @source.match() responds with nil and undefined method ``captures' for nil is raised.

However, since <!ENTITY> violates the XML specification and should be treated as an error.
I removed the match.nil? check.

https://xml.coverpages.org/xmlBNF.html

EntityDecl ::= '<!ENTITY' S Name S EntityDef S? '>'   /* General entities */
	| '<!ENTITY' S '%' S Name S EntityDef S? '>'   /* Parameter entities */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Could you add a test for the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can do it as a separated PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK
I added test for this case.

lib/rexml/source.rb Outdated Show resolved Hide resolved
lib/rexml/source.rb Outdated Show resolved Hide resolved
def read
begin
@buffer << readline
@scanner.string = @scanner.rest + readline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruby/strscan#78 has been fixed. Could you try with the latest strscan?

test/test_core.rb Outdated Show resolved Hide resolved
[Why]
Using StringScanner reduces the string copying process and speeds up the process.
@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from ec62e37 to 995d3e2 Compare January 15, 2024 15:11
@naitoh naitoh requested a review from kou January 15, 2024 23:00
@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from 995d3e2 to ba9f7fc Compare January 16, 2024 11:52
test/test_document.rb Outdated Show resolved Hide resolved
Removed `attr_reader :buffer` and added similar `def buffer` and `def buffer_encoding=` interfaces.
@naitoh naitoh force-pushed the use_StringScanner_with_baseparser branch from ba9f7fc to eeb45e1 Compare January 16, 2024 23:38
@naitoh naitoh requested a review from kou January 16, 2024 23:46
lib/rexml/source.rb Outdated Show resolved Hide resolved
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Could you update the PR description before we merge this?

@naitoh
Copy link
Contributor Author

naitoh commented Jan 17, 2024

@kou
Updated PR description.
How do you like it?

@kou kou merged commit 810d228 into ruby:master Jan 17, 2024
39 checks passed
@kou
Copy link
Member

kou commented Jan 17, 2024

Thanks!

@naitoh naitoh deleted the use_StringScanner_with_baseparser branch January 17, 2024 06:46
@naitoh
Copy link
Contributor Author

naitoh commented Jan 17, 2024

Thanks for your review!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants