Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading HTML via XmlReader API when root element is absent #16

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

greatvovan
Copy link

Consider folowing snippet:

            string toParse;
            toParse = "<p>one</p><p>two</p><p>three</p>";
            var sr = new StringReader(toParse);
            var NULL = "NULL";

            using (var reader = new SgmlReader()
                {InputStream = sr, DocType = "HTML", IgnoreDtd = true})
            {
                while (!reader.EOF)
                {
                    reader.Read();
                    WriteLine($"[{reader.NodeType,10}] {reader.Name ?? NULL,5}: " +
                        $"{reader.Value ?? NULL}");
                }
            }

On version 1.8.12 SGMLReader fails to read all document since it comes to this section twice:

            if (this.Depth == 1)
            {
                if (this.m_rootCount == 1)
                {
                    // Hmmm, we found another root level tag, soooo, the only
                    // thing we can do to keep this a valid XML document is stop
                    this.m_state = State.Eof;
                    return false;
                }
                this.m_rootCount++;
            }

It only returns the first <p> element and opening tag for second.

It was working in old good version 1.8.6, but I don't have a code of it, as on GitHub history starts from version 1.8.7 wich is already broken.

This commit allows to fix this behavior if user has set DocType to "HTML" explicitly.

@UweKeim
Copy link

UweKeim commented May 24, 2016

Awesome, just the same problem as I'm having here, too. Hopefully it get's integrated and published to NuGet, soon.

I'm now wrapping my HTML fragments inside an artificial <div> root tag to fulfil the requirement of one root only. Of course, this only works when reading from the document, not wenn modifying.

@lovettchris
Copy link

I'm worried this change is too HTML centric, what about SgmlReader over other types of data, like OFX ?

@greatvovan
Copy link
Author

@lovettchris I can not see how it affects OFX and other types. Why did you get this particular doubt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants