Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html: add option to set MaxBuf in Parse #214

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Jarcis-cy
Copy link

@Jarcis-cy Jarcis-cy commented Jun 20, 2024

I encountered an issue when using html.Parse that triggers the following
call chain: html.Parse -> ParseWithOptions -> p.parse() -> p.tokenizer.Next()
-> readByte(). In the readByte() function, there's a logic block:

if z.maxBuf > 0 && z.raw.end-z.raw.start >= z.maxBuf {
z.err = ErrBufferExceeded
return 0
}

This logic only takes effect if maxBuf is set. However, when using html.Parse,
there is no way to use SetMaxBuf, nor is there any exported method to use
ParseWithOptions with SetMaxBuf. As a result, when parsing a very large HTML
document, such as this page: http://vod.culture.ihns.cas.cn, the memory usage
can increase significantly.

To solve this problem, I wrote a function using reflection:

func ParseOptionSetMaxBuf(maxBuf int) html.ParseOption {
funcValue := reflect.MakeFunc(
reflect.FuncOf([]reflect.Type{reflect.TypeOf((*html.ParseOption)(nil)).Elem().In(0)}, nil, false),
func(args []reflect.Value) (results []reflect.Value) {
parserValue := args[0].Elem()

        tokenizerField := parserValue.FieldByName("tokenizer")
        tokenizerPtr := reflect.NewAt(tokenizerField.Type(), unsafe.Pointer(tokenizerField.UnsafeAddr())).Elem().Interface()

        if tokenizer, ok := tokenizerPtr.(interface {
            SetMaxBuf(int)
        }); ok {
            tokenizer.SetMaxBuf(maxBuf)
        }

        return nil
    },
)
var option html.ParseOption
reflect.ValueOf(&option).Elem().Set(funcValue)
return option

}

And then used it as follows:

html.ParseWithOptions(bytes.NewReader(data), util.ParseOptionSetMaxBuf(len(data)*3))

Testing showed that setting maxBuf to at least 1.04 times the body length
ensures normal operation.

Therefore, would it be feasible to introduce a function similar to
ParseOptionEnableScripting that allows users to set MaxBuf?

Environment:

  • Go version: 1.21
  • OS: Tested on Ubuntu 22.04 and Windows 11

@gopherbot
Copy link
Contributor

This PR (HEAD: cbd34d5) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/net/+/593635.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

@gopherbot
Copy link
Contributor

Message from Gopher Robot:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/593635.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Contributor

Message from Gopher Robot:

Patch Set 1:

Congratulations on opening your first change. Thank you for your contribution!

Next steps:
A maintainer will review your change and provide feedback. See
https://go.dev/doc/contribute#review for more info and tips to get your
patch through code review.

Most changes in the Go project go through a few rounds of revision. This can be
surprising to people new to the project. The careful, iterative review process
is our way of helping mentor contributors and ensuring that their contributions
have a lasting impact.

During May-July and Nov-Jan the Go project is in a code freeze, during which
little code gets reviewed or merged. If a reviewer responds with a comment like
R=go1.11 or adds a tag like "wait-release", it means that this CL will be
reviewed as part of the next development cycle. See https://go.dev/s/release
for more details.


Please don’t reply on this GitHub thread. Visit golang.org/cl/593635.
After addressing review feedback, remember to publish your drafts!

@Jarcis-cy Jarcis-cy changed the title Add option to set MaxBuf in html.Parse html: add option to set MaxBuf in Parse Jun 20, 2024
@gopherbot
Copy link
Contributor

Message from Ian Lance Taylor:

Patch Set 2: Hold+1

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/593635.
After addressing review feedback, remember to publish your drafts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants