Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some hosts return 404/503/non-200 when links are checked #165

Open
hellt opened this issue Apr 19, 2021 · 7 comments
Open

Some hosts return 404/503/non-200 when links are checked #165

hellt opened this issue Apr 19, 2021 · 7 comments
Labels

Comments

@hellt
Copy link

hellt commented Apr 19, 2021

Describe the bug

Checks of external links to media resources hosted on twitter, such as https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg report 404, although curl has not issues with that:

❯ curl -vL https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5\?format\=jpg > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 93.184.220.70...
* TCP_NODELAY set
* Connected to pbs.twimg.com (93.184.220.70) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [227 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [98 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2937 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Twitter, Inc.; CN=*.twimg.com
*  start date: Nov  5 00:00:00 2020 GMT
*  expire date: Nov  9 23:59:59 2021 GMT
*  subjectAltName: host "pbs.twimg.com" matched cert's "*.twimg.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS RSA SHA256 2020 CA1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fce8000aa00)
> GET /media/EuF4GgyXUAEZ3j5?format=jpg HTTP/2
> Host: pbs.twimg.com
> User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
> Accept: */*
> Referer:
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200
< accept-ranges: bytes
< access-control-allow-origin: *
< access-control-expose-headers: Content-Length
< age: 106
< cache-control: max-age=604800, must-revalidate
< content-type: image/jpeg
< date: Mon, 19 Apr 2021 09:05:11 GMT
< last-modified: Sat, 13 Feb 2021 08:04:38 GMT
< server: ECS (amb/6B85)
< strict-transport-security: max-age=631138519
< surrogate-key: media media/bucket/4 media/1360500615718326273
< timing-allow-origin: https://twitter.com, https://mobile.twitter.com
< x-cache: HIT
< x-connection-hash: 1223340481ffa7d392cf6199e5d2bd1f
< x-content-type-options: nosniff
< x-response-time: 238
< x-tw-cdn: VZ
< content-length: 79258
<
{ [16383 bytes data]
100 79258  100 79258    0     0   814k      0 --:--:-- --:--:-- --:--:--  814k
* Connection #0 to host pbs.twimg.com left intact
* Closing connection 0

Here is the error from htmltest

Non-OK status: 404 --- 2021/transparently-redirecting-packets/frames-between-interfaces/index.html --> https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg

To Reproduce

Steps to reproduce the behaviour:

  1. embed a link to a twitter hosted media resource, for example https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg

.htmltest.yml

bare config

Expected behaviour

An error is not reported since the resource is available.

Actual behaviour

404 is returned

@hellt hellt added the bug label Apr 19, 2021
@jtopper
Copy link

jtopper commented Mar 23, 2022

I've found something similar. I believe it's because this service is fronted by CloudFlare which, not recognising the source of the request, serves up a CAPTCHA page with a 403 instead of the resource. I guess the fix would be to manipulate the requests htmltest makes so that it looks more like a real browser, but that seems non-trivial.

@wjdp
Copy link
Owner

wjdp commented May 28, 2022

I've done some testing on URLs here using htmltest unchanged and configured with a curl user agent and the range header we add removed. No change to behaviour from upstream hosts.

URL Status (htmltest) Status (htmltest as curl)
https://www.php.net/manual/en/book.pcntl.php 200 200
https://play.google.com/store/apps/details?id=com.azure.authenticator&hl=en&gl=US 404 404
https://old.reddit.com/r/golang/comments/teu78z/118_is_released/ 200 200
https://www.reddit.com/r/golang/comments/teu78z/118_is_released/ 200 200
https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg 404 404

@wjdp wjdp changed the title checks of external links on pbs.twimg.com/media return 404 Some hosts return 404/503 when links are checked May 28, 2022
@wjdp wjdp changed the title Some hosts return 404/503 when links are checked Some hosts return 404/503/non-200 when links are checked May 28, 2022
@arranf
Copy link

arranf commented May 28, 2022

I can provide exact examples that work fine with curl but don't succeed with htmltest. This is reliably reproducible. What kind of logs/output would help you verify?

@wjdp
Copy link
Owner

wjdp commented May 29, 2022

@arranf Just a list of urls you've found problematic. I've not pushed the branch but have been adding these as a unit test to help track. I'm then planning on tweaking request params (as above trying to pretend to be curl) to try and identify what's causing these to be blocked.

I doubt we'll have this completely fixed for all hosts but am hoping for an improvement.

@arranf
Copy link

arranf commented May 30, 2022

- ^https?://(www\.)?play\.google\.com\b # Always fails with htmltest
- ^https?://(www\.)?crates\.io\b # Always fails with htmltest
- ^https?://(www\.)?lastpass\.com\b # Always fails with htmltest
- ^https?://help\.elgato\.com\b # Always fails with htmltest
- ^https?://uk\.pcpartpicker\.com\b # Always fails with htmltest
- ^https?://uk\.pcpartpicker\.com\b # Always fails with htmltest
- ^https?://(www\.)?corsair\.com\b # Always fails with htmltest
- 'https://docs.github.com/en/get-started/using-git/about-git-rebase' # Not sure why this is 403ing
- ^https?://(www\.)?reddit\.com\b # Always fails with htmltest

This is a list copied from my .htmltest.yml

@theory
Copy link

theory commented Jun 12, 2022

I found a couple more:

  • transunion.com returns a 403. Also does in Curl but changes to a 301 if I give it a valid user agent
  • https://bookshop.org/books/project-hail-mary/9780593135204 returns a 403 to htmltest but a 200 to curl

@istr
Copy link
Contributor

istr commented Dec 2, 2022

And also https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679 fine for curl, 500 for htmltest.
(but this is due to StripQueryString defaulting to true and I would doubt that this default is the best choice here...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants