Skip to content

Commit

Permalink
Add support for detecting invalid XML that has unsupported content be…
Browse files Browse the repository at this point in the history
…fore root element (#184)

## Why?

XML with content at the start of the document is invalid.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   	prolog	   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl
```
[23]   	XMLDecl	   ::=   	'<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Misc

```
[27]   	Misc	   ::=   	Comment | PI | S
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI

```
[16]   	PI	   ::=   	'<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget

```
[17]   	PITarget	   ::=   	Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-doctypedecl
```
[28]   	doctypedecl	   ::=   	'<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>'
```

See: #164 (comment)
  • Loading branch information
naitoh committed Jul 22, 2024
1 parent 2c39c91 commit 2bca7bd
Show file tree
Hide file tree
Showing 4 changed files with 60 additions and 22 deletions.
10 changes: 7 additions & 3 deletions lib/rexml/parsers/baseparser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -486,11 +486,15 @@ def pull_event
if text.chomp!("<")
@source.position -= "<".bytesize
end
if @tags.empty? and @have_root
if @tags.empty?
unless /\A\s*\z/.match?(text)
raise ParseException.new("Malformed XML: Extra content at the end of the document (got '#{text}')", @source)
if @have_root
raise ParseException.new("Malformed XML: Extra content at the end of the document (got '#{text}')", @source)
else
raise ParseException.new("Malformed XML: Content at the start of the document (got '#{text}')", @source)
end
end
return pull_event
return pull_event if @have_root
end
return [ :text, text ]
end
Expand Down
12 changes: 12 additions & 0 deletions test/parse/test_comment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,18 @@ def test_after_doctype_malformed_comment_end
end
end

def test_before_root
parser = REXML::Parsers::BaseParser.new('<!-- ok comment --><a></a>')

events = {}
while parser.has_next?
event = parser.pull
events[event[0]] = event[1]
end

assert_equal(" ok comment ", events[:comment])
end

def test_after_root
parser = REXML::Parsers::BaseParser.new('<a></a><!-- ok comment -->')

Expand Down
43 changes: 24 additions & 19 deletions test/parse/test_processing_instruction.rb
Original file line number Diff line number Diff line change
Expand Up @@ -25,25 +25,6 @@ def test_no_name
DETAIL
end

def test_garbage_text
# TODO: This should be parse error.
# Create test/parse/test_document.rb or something and move this to it.
doc = parse(<<-XML)
x<?x y
<!--?><?x -->?>
<r/>
XML
pi = doc.children[1]
assert_equal([
"x",
"y\n<!--",
],
[
pi.target,
pi.content,
])
end

def test_xml_declaration_not_at_document_start
exception = assert_raise(REXML::ParseException) do
parser = REXML::Parsers::BaseParser.new('<a><?xml version="1.0" ?></a>')
Expand All @@ -62,6 +43,30 @@ def test_xml_declaration_not_at_document_start
end
end

def test_comment
doc = parse(<<-XML)
<?x y
<!--?><?x -->?>
<r/>
XML
assert_equal([["x", "y\n<!--"],
["x", "-->"]],
[[doc.children[0].target, doc.children[0].content],
[doc.children[1].target, doc.children[1].content]])
end

def test_before_root
parser = REXML::Parsers::BaseParser.new('<?abc version="1.0" ?><a></a>')

events = {}
while parser.has_next?
event = parser.pull
events[event[0]] = event[1]
end

assert_equal("abc", events[:processing_instruction])
end

def test_after_root
parser = REXML::Parsers::BaseParser.new('<a></a><?abc version="1.0" ?>')

Expand Down
17 changes: 17 additions & 0 deletions test/parse/test_text.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,23 @@
module REXMLTests
class TestParseText < Test::Unit::TestCase
class TestInvalid < self
def test_before_root
exception = assert_raise(REXML::ParseException) do
parser = REXML::Parsers::BaseParser.new('b<a></a>')
while parser.has_next?
parser.pull
end
end

assert_equal(<<~DETAIL.chomp, exception.to_s)
Malformed XML: Content at the start of the document (got 'b')
Line: 1
Position: 4
Last 80 unconsumed characters:
<a>
DETAIL
end

def test_after_root
exception = assert_raise(REXML::ParseException) do
parser = REXML::Parsers::BaseParser.new('<a></a>c')
Expand Down

0 comments on commit 2bca7bd

Please sign in to comment.