Skip to content

Commit

Permalink
Merge branch 'master' into Stream
Browse files Browse the repository at this point in the history
# Conflicts:
#	pom.xml
  • Loading branch information
Isira-Seneviratne committed Dec 26, 2023
2 parents 00ad0c7 + d4b2c36 commit 5502bcf
Show file tree
Hide file tree
Showing 58 changed files with 2,531 additions and 962 deletions.
36 changes: 36 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# jsoup Changelog

## 1.17.2 (Pending)

### Improvements

* Added `Element.attribute(String)` and `Attributes.attribute(String)` to more simply obtain an `Attribute` object.
[2069](https://github.com/jhy/jsoup/issues/2069)
* If source tracking is on, and an Attribute's key is changed (via `Attribute.setKey(String)`), the source range is
now still tracked in `Attribute.sourceRange()`. [2070](https://github.com/jhy/jsoup/issues/2070)
* Added support for the `[*]` element with any attribute selector. And also restored support for selecting by an empty
attribute name prefix (`[^]`). [2079](https://github.com/jhy/jsoup/issues/2079)

### Bug Fixes

* When tracking the source position of attributes, if source attribute name was mix-cased but the parser was
lower-case normalizing attribute names, the source position for that attribute was not tracked
correctly. [2067](https://github.com/jhy/jsoup/issues/2067)
* When tracking the source position of a body fragment parse, a null pointer exception was
thrown. [2068](https://github.com/jhy/jsoup/issues/2068)
* A multi-point encoded emoji entity may be incorrectly decoded to the replacement
character. [2074](https://github.com/jhy/jsoup/issues/2074)
* (Regression) in a selector like `parent [attr=va], other`, the `, OR` was binding to `[attr=va]` instead of
`parent [attr=va]`, causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr
to represent the query, allowing simpler and more thorough query parse
tests. [2073](https://github.com/jhy/jsoup/issues/2073)
* When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an
extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML
polyglot format, if the data is not already within a CData section. [2078](https://github.com/jhy/jsoup/issues/2078)
* The `:has` evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple
concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the
iterator object is a thread-local. [2088](https://github.com/jhy/jsoup/issues/2088)

---
Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in
[change-archive.txt](./change-archive.txt).
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# jsoup: Java HTML Parser

**jsoup** is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.
**jsoup** is a Java library that makes it easy to work with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.

**jsoup** implements the [WHATWG HTML5](https://html.spec.whatwg.org/multipage/) specification, and parses HTML to the same DOM as modern browsers.

Expand Down Expand Up @@ -42,7 +42,7 @@ jsoup is an open source project distributed under the liberal [MIT license](http
When used in Android projects, [core library desugaring](https://developer.android.com/studio/write/java8-support#library-desugaring) with the [NIO specification](https://developer.android.com/studio/write/java11-nio-support-table) should be enabled to support Java 8+ features.

## Development and support
If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via the [mailing list](https://jsoup.org/discussion).
If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via [jsoup Discussions](https://github.com/jhy/jsoup/discussions).

If you find any issues, please file a [bug](https://jsoup.org/bugs) after checking for duplicates.

Expand Down
41 changes: 39 additions & 2 deletions CHANGES → change-archive.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
jsoup changelog
jsoup Changelog Archive

Release 1.17.1 [PENDING]
Contains change notes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27).
More recent changes may be found in CHANGES.md.

Release 1.17.1 [27-Nov-2023]
* Improvement: in Jsoup.connect(), added support for request-level authentication, supporting authentication to
proxies and to servers.
<https://github.com/jhy/jsoup/pull/2046>
Expand All @@ -23,6 +26,25 @@ Release 1.17.1 [PENDING]
* Improvement: repackaged the library with native (vs automatic) JPMS module support.
<https://github.com/jhy/jsoup/pull/2025>

* Improvement: better fidelity of source positions when tracking is enabled. And implicitly created or closed elements
are tracked and detectable via Range.isImplicit().
<https://github.com/jhy/jsoup/pull/2056>

* Improvement: when source tracking is enabled, the source position for attribute names and values is now available.
Attribute#sourceRange() provides the ranges.
<https://github.com/jhy/jsoup/pull/2057>

* Improvement: when running concurrently under Java 21+ Virtual Threads, virtual threads could be pinned to their
carrier platform thread when parsing an input stream. To improve performance, particularly when parsing fetched
URLs, the internal ConstrainableInputStream has been replaced by ControllableInputStream, which avoids the locking
which caused that pinning.
<https://github.com/jhy/jsoup/issues/2054>

* Improvement: in Jsoup.Connect, allow any XML mimetype as a supported mimetype. Was previously limited to
`{application|text}/xml`. This enables for e.g. fetching SVGs with a image/svg+xml mimetype, without having to
disable mimetype validation.
<https://github.com/jhy/jsoup/issues/2059>

* Bugfix: when outputting with XML syntax, HTML elements that were parsed as data nodes (<script> and <style>) should
be emitted as CDATA nodes, so that they can be parsed correctly by an XML parser.
<https://github.com/jhy/jsoup/pull/1720>
Expand All @@ -38,12 +60,27 @@ Release 1.17.1 [PENDING]
* Bugfix: in W3CDom, if the jsoup input document contained an empty doctype, the conversion would fail with a
DOMException. Now, said doctype is discarded, and the conversion continues.

* Bugfix: when cleaning a document containing SVG elements (or other foreign elements that have preserved case names),
the cleaned output would be incorrectly nested if the safelist had a different case than the input document.
<https://github.com/jhy/jsoup/issues/2049>

* Bugfix: when cleaning a document, the output style of unknown self-closing tags from the input was not preserved in
the output. (So a <foo /> in the input, if safe-listed, would be output as <foo></foo>.)
<https://github.com/jhy/jsoup/issues/2049>

* Build Improvement: added a local test proxy implementation, for proxy integration tests.
<https://github.com/jhy/jsoup/pull/2029>

* Build Improvement: added tests for HTTPS request support, using a local self-signed cert. Includes proxy tests.
<https://github.com/jhy/jsoup/pull/2032>

* Change: the InputStream returned in Connection.Response.bodyStream() is no longer a ConstrainedInputStream, and
so is not subject to settings such as timeout or maximum size. It is now a plain BufferedInputStream around the
response stream. Whilst this behaviour was not documented, you may have been inadvertently relying on those
constraints. The constraints are still applied to other methods such as .parse() and .bufferUp(). So if you do want
a constrained BufferedInputStream, you may do Connection.Response.bufferUp().bodyStream().
<https://github.com/jhy/jsoup/issues/2054>

Release 1.16.2 [20-Oct-2023]
* Improvement: optimized the performance of complex CSS selectors, by adding a cost-based query planner. Evaluators
are sorted by their relative execution cost, and executed in order of lower to higher cost. This speeds the
Expand Down
20 changes: 12 additions & 8 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.1-SNAPSHOT</version><!-- remember to update previous version below for japicmp -->
<version>1.17.2-SNAPSHOT</version><!-- remember to update previous version below for japicmp -->
<url>https://jsoup.org/</url>
<description>jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.</description>
<inceptionYear>2009</inceptionYear>
Expand Down Expand Up @@ -42,7 +42,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version>
<version>3.12.0</version>
<configuration>
<encoding>UTF-8</encoding>
<compilerArgs>
Expand Down Expand Up @@ -88,14 +88,17 @@
<version>2.3.3_r2</version>
</signature>
<ignores>
<ignore>java.util.Set</ignore> <!-- Set#stream() -->
<ignore>java.util.function.*</ignore>
<ignore>java.util.stream.*</ignore>
<ignore>java.lang.ThreadLocal</ignore>
<ignore>java.io.UncheckedIOException</ignore>
<ignore>java.util.stream.*</ignore>
<ignore>java.util.List</ignore> <!-- List#stream() -->
<ignore>java.util.Objects</ignore>
<ignore>java.util.Optional</ignore>
<ignore>java.util.Set</ignore> <!-- Set#stream() -->
<ignore>java.util.Spliterator</ignore>
<ignore>java.util.Spliterators</ignore>
<ignore>java.util.Optional</ignore>

<ignore>java.net.HttpURLConnection</ignore><!-- .setAuthenticator(java.net.Authenticator) in Java 9; only used in multirelease 9+ version -->
</ignores>
Expand All @@ -108,10 +111,11 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>3.6.2</version>
<version>3.6.3</version>
<configuration>
<doclint>none</doclint>
<source>8</source>
<linksource>true</linksource>
</configuration>
<executions>
<execution>
Expand Down Expand Up @@ -192,15 +196,15 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.2.2</version>
<version>3.2.3</version>
<configuration>
<!-- smaller stack to find stack overflows -->
<argLine>-Xss256k</argLine>
</configuration>
</plugin>
<plugin>
<artifactId>maven-failsafe-plugin</artifactId>
<version>3.2.2</version>
<version>3.2.3</version>
<executions>
<execution>
<goals>
Expand Down Expand Up @@ -372,7 +376,7 @@
<plugins>
<plugin>
<artifactId>maven-failsafe-plugin</artifactId>
<version>3.2.2</version>
<version>3.2.3</version>
<executions>
<execution>
<goals>
Expand Down
16 changes: 11 additions & 5 deletions src/main/java/org/jsoup/Connection.java
Original file line number Diff line number Diff line change
Expand Up @@ -211,13 +211,14 @@ default Connection newRequest(URL url) {
/**
* Add an input stream as a request data parameter. For GETs, has no effect, but for POSTS this will upload the
* input stream.
* <p>Use the {@link #data(String, String, InputStream, String)} method to set the uploaded file's mimetype.</p>
* @param key data key (form item name)
* @param filename the name of the file to present to the remove server. Typically just the name, not path,
* component.
* @param inputStream the input stream to upload, that you probably obtained from a {@link java.io.FileInputStream}.
* You must close the InputStream in a {@code finally} block.
* @return this Connection, for chaining
* @see #data(String, String, InputStream, String) if you want to set the uploaded file's mimetype.
* @see #data(String, String, InputStream, String)
*/
Connection data(String key, String filename, InputStream inputStream);

Expand Down Expand Up @@ -871,10 +872,15 @@ interface Response extends Base<Response> {
Response bufferUp();

/**
* Get the body of the response as a (buffered) InputStream. You should close the input stream when you're done with it.
* Other body methods (like bufferUp, body, parse, etc) will not work in conjunction with this method.
* <p>This method is useful for writing large responses to disk, without buffering them completely into memory first.</p>
* @return the response body input stream
Get the body of the response as a (buffered) InputStream. You should close the input stream when you're done
with it.
<p>Other body methods (like bufferUp, body, parse, etc) will generally not work in conjunction with this method,
as it consumes the InputStream.</p>
<p>Any configured max size or maximum read timeout applied to the connection will not be applied to this stream,
unless {@link #bufferUp()} is called prior.</p>
<p>This method is useful for writing large responses to disk, without buffering them completely into memory
first.</p>
@return the response body input stream
*/
BufferedInputStream bodyStream();
}
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/jsoup/examples/ListLinks.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ public static void main(String[] args) throws IOException {

print("\nMedia: (%d)", media.size());
for (Element src : media) {
if (src.normalName().equals("img"))
if (src.nameIs("img"))
print(" * %s: <%s> %sx%s (%s)",
src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
trim(src.attr("alt"), 20));
Expand Down
18 changes: 9 additions & 9 deletions src/main/java/org/jsoup/helper/DataUtil.java
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
package org.jsoup.helper;

import org.jsoup.internal.ConstrainableInputStream;
import org.jsoup.internal.ControllableInputStream;
import org.jsoup.internal.Normalizer;
import org.jsoup.internal.SharedConstants;
import org.jsoup.internal.StringUtil;
import org.jsoup.nodes.Comment;
import org.jsoup.nodes.Document;
Expand Down Expand Up @@ -32,6 +33,8 @@
import java.util.regex.Pattern;
import java.util.zip.GZIPInputStream;

import static org.jsoup.internal.SharedConstants.DefaultBufferSize;

/**
* Internal static utilities for handling data.
*
Expand All @@ -42,7 +45,6 @@ public final class DataUtil {
public static final Charset UTF_8 = Charset.forName("UTF-8"); // Don't use StandardCharsets, as those only appear in Android API 19, and we target 10.
static final String defaultCharsetName = UTF_8.name(); // used if not found in header or meta charset
private static final int firstReadBufferSize = 1024 * 5;
static final int bufferSize = 1024 * 32;
private static final char[] mimeBoundaryChars =
"-_1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".toCharArray();
static final int boundaryLength = 32;
Expand Down Expand Up @@ -127,7 +129,7 @@ public static Document load(InputStream in, @Nullable String charsetName, String
* @throws IOException on IO error
*/
static void crossStreams(final InputStream in, final OutputStream out) throws IOException {
final byte[] buffer = new byte[bufferSize];
final byte[] buffer = new byte[DefaultBufferSize];
int len;
while ((len = in.read(buffer)) != -1) {
out.write(buffer, 0, len);
Expand All @@ -137,13 +139,13 @@ static void crossStreams(final InputStream in, final OutputStream out) throws IO
static Document parseInputStream(@Nullable InputStream input, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
if (input == null) // empty body
return new Document(baseUri);
input = ConstrainableInputStream.wrap(input, bufferSize, 0);
input = ControllableInputStream.wrap(input, DefaultBufferSize, 0);

@Nullable Document doc = null;

// read the start of the stream and look for a BOM or meta charset
try {
input.mark(bufferSize);
input.mark(DefaultBufferSize);
ByteBuffer firstBytes = readToByteBuffer(input, firstReadBufferSize - 1); // -1 because we read one more to see if completed. First read is < buffer size, so can't be invalid.
boolean fullyRead = (input.read() == -1);
input.reset();
Expand Down Expand Up @@ -206,7 +208,7 @@ else if (first instanceof Comment) {
if (doc == null) {
if (charsetName == null)
charsetName = defaultCharsetName;
BufferedReader reader = new BufferedReader(new InputStreamReader(input, Charset.forName(charsetName)), bufferSize); // Android level does not allow us try-with-resources
BufferedReader reader = new BufferedReader(new InputStreamReader(input, Charset.forName(charsetName)), DefaultBufferSize); // Android level does not allow us try-with-resources
try {
if (bomCharset != null && bomCharset.offset) { // creating the buffered reader ignores the input pos, so must skip here
long skipped = reader.skip(1);
Expand Down Expand Up @@ -245,9 +247,7 @@ else if (first instanceof Comment) {
* @throws IOException if an exception occurs whilst reading from the input stream.
*/
public static ByteBuffer readToByteBuffer(InputStream inStream, int maxSize) throws IOException {
Validate.isTrue(maxSize >= 0, "maxSize must be 0 (unlimited) or larger");
final ConstrainableInputStream input = ConstrainableInputStream.wrap(inStream, bufferSize, maxSize);
return input.readToByteBuffer(maxSize);
return ControllableInputStream.readToByteBuffer(inStream, maxSize);
}

static ByteBuffer emptyByteBuffer() {
Expand Down
Loading

0 comments on commit 5502bcf

Please sign in to comment.