fix(cfscrape): Fix for challenge where part of formula is stored in a div #5462

CodyWoolaver · 2019-05-07T21:22:44Z

Original Credit to https://github.com/lukele - All i have done is carry on conversations - he did the heavy lifting.

This is not a full version and has not been released by any official cfscrape repo. I am creating this PR to alleviate the numerous pain-points that others are experiancing. I have been running this version on my own setup (with a deluge webapi as a search provider) for the last several weeks with zero issues.

I am open to comments and suggestions, however my actual knowledge into cfscrape is actually quite limited.

Fixes #5185
Fixes #5420
Fixes #5423

Proposed changes in this pull request:

Pulled in partial changes from Fix for challenge where part of formula is stored in a div Anorov/cloudflare-scrape#206 that will satisfy our requirements for SickChill
PR is based on the DEVELOP branch
Don't send big changes all at once. Split up big PRs into multiple smaller PRs that are easier to manage and review
Read contribution guide

cc'ing individuals related to ongoing conversations in #5423
@pro-src @mitch71h @lukele @WebSpider

ghost · 2019-05-07T21:45:59Z

The user agent randomization, header order, and query string param order is going to be a problem if aiming for a general solution. collections.OrderedDict is an unused import, probably a remnant of somebody trying to solve the aforementioned. I would not rely on dictionary insertion order prior to python 3.7. However, if this is working for the domains that SickChill is targeting than it LGTM as a temporary fix.

CodyWoolaver · 2019-05-07T23:14:37Z

I have removed the unused import.

Could you explain your issues with user agent randomization on class initialization? It uses a random user agent for each instance of the class instead of using the same random user agent for each one.

I am also not sure the importance of header order when it comes to making a request. To my knowledge there is nothing enforcing a specific order of headers in a request. Though i could be terribly wrong.

miigotu · 2019-05-08T04:27:53Z

Hi, I am planning on switching to https://github.com/SickChill/cloudscraper with js2py after some testing. It is possible many of these fixes are already there

ghost · 2019-05-08T05:47:03Z

@miigotu There are issues with js2py not being spec compliant, requiring excessive polyfills(lacks https://tc39.github.io/ecma262/#sec-additional-ecmascript-features-for-web-browsers), slow, and security issues. It was dropped by cloudflare-scrape due to the security issues that the author notes himself. I personally will not be maintaining a version that uses js2py. There is also the issue with cloudscraper.py unnecessarily playing with SSL ciphers and lowering security.... i.e. preventing the use of TLSv1.3 and causing CAPTCHA on systems with openssl versions prior to openssl v1.1.1.

ghost · 2019-05-08T05:51:17Z

@CodyWoolaver Those problems are to due to Cloudflare's BIC(Browser Integrity Check). The HTTP spec doesn't define any header order or letter casing but Cloudflare, being what it is, has been able to profile requests and filter based on those attributes. This has been seen time and time again with countless bug reports.

miigotu · 2019-05-08T06:04:12Z

I commented on the other issue, and I'm open to other options. We need to work out the drama surrounding all this it seems.

If there are issues with js2py, maybe we can improve js2py?
Node just creates so many problems for some users who don't have root or even sudo on their seedboxes, among other places.

If we must continue with a node reliant method, even just for the time being, that is ok also.

Whatever version is used, I will be maintaining a fork here.

ghost · 2019-05-08T08:10:06Z

If there are issues with js2py, maybe we can improve js2py?

I've already fixed it once: PiotrDabkowski/Js2Py#157
It can be improved if you want to invest the time in it but there is so many edge cases that it's going to be a pain whenever it does break. It's like relying on fragile regular expressions to do web scrapping where the target site keeps updating but instead of changing the RE, you have to find your way around and update a large code base. If Cloudflare ever starts using es6 features, js2py will be horribly slow. There are many stale issues in the js2py repository and then there's the security aspect, ofc. IMHO, The current use of js2py in cloudscraper is very hacky but enough of the negativity, lol, you currently can use js2py.

I hope that makes sense as the reason why I don't want to maintain a version that uses it.

Includes latest challenge update: Anorov/cloudflare-scrape#234

CodyWoolaver · 2019-05-13T18:22:00Z

New cfscrape version has been built - I am going to pull in that change and test things out. They bumped the version from 1.9.7 to 2.0.0 so I am not sure if there are breaking changes

ghost · 2019-05-13T19:02:34Z

@CodyWoolaver There isn't any breaking changes in the public API.

Node support did change: Anorov/cloudflare-scrape#241
Some security concerns were addressed.
Diff: https://github.com/Anorov/cloudflare-scrape/compare/f35c46..67f80e

If you guys run into this issue: Anorov/cloudflare-scrape#235
Please let us know. I've wrote a script to help generate reports for it. If you could run it, that'd be fantastic. It's really the only way that I see us being able to fully address the issue without shooting in the dark.

CodyWoolaver · 2019-05-13T19:23:31Z

I am seeing some issues around the headers again. Same issue as before when it came to passing in additional headers. Accept in particular - I will dig deeper tonight. The issues are not around cloudflare, but other websites requiring specific header formats that are not being passed in. I don't have all the details yet.

ghost · 2019-05-15T17:42:24Z

@CodyWoolaver Have you identified any problems? The master branch has been updated to contain a fix to avoid CAPTCHA. The fix has also been released. It should be noted that whether or not you experience the problem that the recent release resolves is believed to be based on CPU.

CodyWoolaver · 2019-05-15T17:50:06Z

Seems to be intermittent and I have not nailed down the full effects. I can assume my cpu is not a problem though I am using an Intel(R) Xeon(R) CPU E5-2620 v4

The standard requests work fine, it is just the interactions with the deluge web api

lukele · 2019-05-15T17:55:25Z

@CodyWoolaver would you have a small test case for a to test with?

ghost · 2019-05-15T17:58:22Z

If you are receiving a CAPTCHA from Cloudflare, the recent release fixes a case that is seemingly only experienced by users depending on openssl version and CPU. Otherwise, yes, CPU would be irrelevant.

I'll look over the requests codebase to review how session headers are handled, maybe I'll spot some improvement to be made.

CodyWoolaver · 2019-05-15T18:04:09Z

My test case, within SickChill is to manually search for an episode. I receive a notification informing me the episode has been found, however in my logs i receive this message:

2019-05-15 14:03:08 WARNING SEARCHQUEUE-BACKLOG-268592 :: Deluge: Unable to send Torrent

The message is not very helpful, and I have a terrible sickrage setup for debugging.

ghost · 2019-05-15T18:09:11Z

@CodyWoolaver All I need is the exact headers that should normally be sent. I should be able to debug it with that information. Also any other request specific information such as payload, query params, etc. would be helpful but not necessary if the headers are known to be the problem.

CodyWoolaver · 2019-05-15T18:12:39Z

Setting a reminder for myself tonight to get back to you - Can't test it from work :(

ghost · 2019-05-15T18:15:30Z

Ah, okay, thx. I'm multi-tasking at the moment but a little later I will experiment. I really think that I should be able to determine the problem if it resides with the library without any additional information. If @lukele doesn't beat me to it. 😃

ghost · 2019-05-15T18:21:12Z

Does this document the request? https://deluge.readthedocs.io/en/develop/devguide/how-to/curl-jsonrpc.html

lukele · 2019-05-15T20:41:45Z

@CodyWoolaver does this happen with a specific search provider/episode?

lukele · 2019-05-15T21:17:32Z

@CodyWoolaver Do you have validate certificate enable in SickChill Deluge settings? I wonder if that might be causing the problems. Also, are you running Deluge on localhost or on another server?

From the code it also looks like running it SickChill with debug logs on should quickly help identify the issue, since it prints the data being sent, the host for the connection which is a good starting point.

ghost · 2019-05-15T21:40:55Z

I did find one issue but I haven't finished looking into this yet: Anorov/cloudflare-scrape#247

Details

You can see how it's affected here: https://github.com/SickChill/SickChill/blob/175df89cabf295358b8caa2d96dafadb2347622d/sickbeard/helpers.py#L1373-L1378

If you want to see whether this is the problem or not:

# from collections import OrderedDict
 def make_session(): 
     session = requests.Session() 
  
     session.headers = OrderedDict(session.headers)
     session.headers.update({'User-Agent': USER_AGENT, 'Accept-Encoding': 'gzip,deflate'}) 
  
     session = cfscrape.create_scraper(sess=session)

lukele · 2019-05-15T21:48:57Z

Is only the order affected or are the headers itself affected as well? If the former, that should be no problem, since Cloudflare is not the issue in this case from what I understand.

ghost · 2019-05-15T21:53:54Z

Yeh, it's the former but I'm just throwing it out there.. It does have something to do with the headers. 😃

I just found another issue.

This will override the HTTPS adapter and undo the fix to avoid CAPTCHA. I can't think of a good solution for this at the moment... :/

https://github.com/SickChill/SickChill/blob/175df89cabf295358b8caa2d96dafadb2347622d/lib/cachecontrol/wrapper.py#L5-L21

lukele · 2019-05-15T21:55:41Z

I worried something like this would happen. Unfortunately the way requests is build, there's no easy way around it. The CacheControlAdapter could inherit from the CaptchaAdapter maybe.

ghost · 2019-05-15T21:57:23Z

The only problem with that is they have to modify a dependency. I don't think they mind in this case but it's probably not preferred.

lukele · 2019-05-15T22:10:57Z

Unrelated to any cfscrape issues, this error could happen if the connection to a deluge host is no longer active, which was established during _get_auth. But yeah, debug logs would tell the whole story I assume.

ghost · 2019-05-15T22:21:52Z

I've completed by review of cfscrape, requests, and both deluge api clients, one of which doesn't use cfscrape at all. The usage of cfscrape by the other shouldn't be causing problems if not related to Cloudflare. I would replace the usage of cfscrape with requests in helpers.py, specifically, return the regular requests session in make_session to eliminate cfscrape as the cause.

ghost · 2019-05-15T22:30:15Z

I'm starting to feel bad about polluting this PR... lol... Sorry about that. If you want to open an issue over at cloudflare-scrape's repo, that's cool.

Bump cfscrape version to 2.0.0

722ad62

Includes latest challenge update: Anorov/cloudflare-scrape#234

CodyWoolaver force-pushed the cfscrape-issue branch from a3380ce to 722ad62 Compare May 13, 2019 18:21

Merge branch 'develop' into cfscrape-issue

76796e7

Merge branch 'develop' into cfscrape-issue

db42515

miigotu merged commit 3925a60 into SickChill:develop May 15, 2019

ghost mentioned this pull request May 30, 2019

(cfscrape) Patch urllib3 ciphers #5516

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cfscrape): Fix for challenge where part of formula is stored in a div #5462

fix(cfscrape): Fix for challenge where part of formula is stored in a div #5462

CodyWoolaver commented May 7, 2019 •

edited

Loading

ghost commented May 7, 2019

CodyWoolaver commented May 7, 2019

miigotu commented May 8, 2019 •

edited

Loading

ghost commented May 8, 2019 •

edited by ghost

Loading

ghost commented May 8, 2019

miigotu commented May 8, 2019

ghost commented May 8, 2019 •

edited by ghost

Loading

CodyWoolaver commented May 13, 2019

ghost commented May 13, 2019

CodyWoolaver commented May 13, 2019

ghost commented May 15, 2019

CodyWoolaver commented May 15, 2019

lukele commented May 15, 2019

ghost commented May 15, 2019

CodyWoolaver commented May 15, 2019 •

edited

Loading

ghost commented May 15, 2019 •

edited by ghost

Loading

CodyWoolaver commented May 15, 2019

ghost commented May 15, 2019

ghost commented May 15, 2019

lukele commented May 15, 2019

lukele commented May 15, 2019 •

edited

Loading

ghost commented May 15, 2019 •

edited by ghost

Loading

lukele commented May 15, 2019

ghost commented May 15, 2019 •

edited by ghost

Loading

lukele commented May 15, 2019 •

edited

Loading

ghost commented May 15, 2019 •

edited by ghost

Loading

lukele commented May 15, 2019

ghost commented May 15, 2019

ghost commented May 15, 2019

fix(cfscrape): Fix for challenge where part of formula is stored in a div #5462

fix(cfscrape): Fix for challenge where part of formula is stored in a div #5462

Conversation

CodyWoolaver commented May 7, 2019 • edited Loading

ghost commented May 7, 2019

CodyWoolaver commented May 7, 2019

miigotu commented May 8, 2019 • edited Loading

ghost commented May 8, 2019 • edited by ghost Loading

ghost commented May 8, 2019

miigotu commented May 8, 2019

ghost commented May 8, 2019 • edited by ghost Loading

CodyWoolaver commented May 13, 2019

ghost commented May 13, 2019

CodyWoolaver commented May 13, 2019

ghost commented May 15, 2019

CodyWoolaver commented May 15, 2019

lukele commented May 15, 2019

ghost commented May 15, 2019

CodyWoolaver commented May 15, 2019 • edited Loading

ghost commented May 15, 2019 • edited by ghost Loading

CodyWoolaver commented May 15, 2019

ghost commented May 15, 2019

ghost commented May 15, 2019

lukele commented May 15, 2019

lukele commented May 15, 2019 • edited Loading

ghost commented May 15, 2019 • edited by ghost Loading

lukele commented May 15, 2019

ghost commented May 15, 2019 • edited by ghost Loading

lukele commented May 15, 2019 • edited Loading

ghost commented May 15, 2019 • edited by ghost Loading

lukele commented May 15, 2019

ghost commented May 15, 2019

ghost commented May 15, 2019

CodyWoolaver commented May 7, 2019 •

edited

Loading

miigotu commented May 8, 2019 •

edited

Loading

ghost commented May 8, 2019 •

edited by ghost

Loading

ghost commented May 8, 2019 •

edited by ghost

Loading

CodyWoolaver commented May 15, 2019 •

edited

Loading

ghost commented May 15, 2019 •

edited by ghost

Loading

lukele commented May 15, 2019 •

edited

Loading

ghost commented May 15, 2019 •

edited by ghost

Loading

ghost commented May 15, 2019 •

edited by ghost

Loading

lukele commented May 15, 2019 •

edited

Loading

ghost commented May 15, 2019 •

edited by ghost

Loading