Skip to content

Commit

Permalink
Parse in quirksmode if no doctype html
Browse files Browse the repository at this point in the history
Fixes #2197
  • Loading branch information
jhy committed Sep 10, 2024
1 parent d3104a0 commit 8601e85
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 2 deletions.
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
character. [2169](https://github.com/jhy/jsoup/issues/2169)
* When tracking source ranges, a text node following an invalid self-closing element may be left
untracked.[2175](https://github.com/jhy/jsoup/issues/2175)
* When a document has no doctype, or a doctype not named `html`, it should be parsed in Quirks
Mode. [2197](https://github.com/jhy/jsoup/issues/2197)

## 1.18.1 (2024-Jul-10)

Expand Down
5 changes: 3 additions & 2 deletions src/main/java/org/jsoup/parser/HtmlTreeBuilderState.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,19 @@ enum HtmlTreeBuilderState {
tb.insertCommentNode(t.asComment());
} else if (t.isDoctype()) {
// todo: parse error check on expected doctypes
// todo: quirk state check on doctype ids
Token.Doctype d = t.asDoctype();
DocumentType doctype = new DocumentType(
tb.settings.normalizeTag(d.getName()), d.getPublicIdentifier(), d.getSystemIdentifier());
doctype.setPubSysKey(d.getPubSysKey());
tb.getDocument().appendChild(doctype);
tb.onNodeInserted(doctype);
if (d.isForceQuirks())
// todo: quirk state check on more doctype ids, if deemed useful (most are ancient legacy and presumably irrelevant)
if (d.isForceQuirks() || !doctype.name().equals("html") || doctype.publicId().equalsIgnoreCase("HTML"))
tb.getDocument().quirksMode(Document.QuirksMode.quirks);
tb.transition(BeforeHtml);
} else {
// todo: check not iframe srcdoc
tb.getDocument().quirksMode(Document.QuirksMode.quirks); // missing doctype
tb.transition(BeforeHtml);
return tb.process(t); // re-process token
}
Expand Down
20 changes: 20 additions & 0 deletions src/test/java/org/jsoup/parser/HtmlParserTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -1888,4 +1888,24 @@ private static void assertMathNamespace(Element el) {
img.ownerDocument().outputSettings().charset("ascii");
assertEquals("<img multi=\"&#x1f4af;\" single=\"&#x1f4af;\" hexsingle=\"&#x1f4af;\">", img.outerHtml());
}

@Test void tableInPInQuirksMode() {
// https://github.com/jhy/jsoup/issues/2197
String html = "<p><span><table><tbody><tr><td><span>Hello table data</span></td></tr></tbody></table></span></p>";
Document doc = Jsoup.parse(html);
assertEquals(Document.QuirksMode.quirks, doc.quirksMode());
assertEquals(
"<p><span><table><tbody><tr><td><span>Hello table data</span></td></tr></tbody></table></span></p>", // quirks, allows table in p
TextUtil.normalizeSpaces(doc.body().html())
);

// doctype set, no quirks
html ="<!DOCTYPE html><p><span><table><tbody><tr><td><span>Hello table data</span></td></tr></tbody></table></span></p>";
doc = Jsoup.parse(html);
assertEquals(Document.QuirksMode.noQuirks, doc.quirksMode());
assertEquals(
"<p><span></span></p><table><tbody><tr><td><span>Hello table data</span></td></tr></tbody></table><p></p>", // no quirks, p gets closed
TextUtil.normalizeSpaces(doc.body().html())
);
}
}

0 comments on commit 8601e85

Please sign in to comment.