Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cfscrape): Fix for challenge where part of formula is stored in a div #5462

Merged
merged 3 commits into from
May 15, 2019

Conversation

CodyWoolaver
Copy link
Contributor

@CodyWoolaver CodyWoolaver commented May 7, 2019

Original Credit to https://github.com/lukele - All i have done is carry on conversations - he did the heavy lifting.

This is not a full version and has not been released by any official cfscrape repo. I am creating this PR to alleviate the numerous pain-points that others are experiancing. I have been running this version on my own setup (with a deluge webapi as a search provider) for the last several weeks with zero issues.

I am open to comments and suggestions, however my actual knowledge into cfscrape is actually quite limited.

Fixes #5185
Fixes #5420
Fixes #5423

Proposed changes in this pull request:

cc'ing individuals related to ongoing conversations in #5423
@pro-src @mitch71h @lukele @WebSpider

@ghost
Copy link

ghost commented May 7, 2019

The user agent randomization, header order, and query string param order is going to be a problem if aiming for a general solution. collections.OrderedDict is an unused import, probably a remnant of somebody trying to solve the aforementioned. I would not rely on dictionary insertion order prior to python 3.7. However, if this is working for the domains that SickChill is targeting than it LGTM as a temporary fix.

@CodyWoolaver
Copy link
Contributor Author

I have removed the unused import.

Could you explain your issues with user agent randomization on class initialization? It uses a random user agent for each instance of the class instead of using the same random user agent for each one.

I am also not sure the importance of header order when it comes to making a request. To my knowledge there is nothing enforcing a specific order of headers in a request. Though i could be terribly wrong.

@miigotu
Copy link
Contributor

miigotu commented May 8, 2019

Hi, I am planning on switching to https://github.com/SickChill/cloudscraper with js2py after some testing. It is possible many of these fixes are already there

@ghost
Copy link

ghost commented May 8, 2019

@miigotu There are issues with js2py not being spec compliant, requiring excessive polyfills(lacks https://tc39.github.io/ecma262/#sec-additional-ecmascript-features-for-web-browsers), slow, and security issues. It was dropped by cloudflare-scrape due to the security issues that the author notes himself. I personally will not be maintaining a version that uses js2py. There is also the issue with cloudscraper.py unnecessarily playing with SSL ciphers and lowering security.... i.e. preventing the use of TLSv1.3 and causing CAPTCHA on systems with openssl versions prior to openssl v1.1.1.

@ghost
Copy link

ghost commented May 8, 2019

@CodyWoolaver Those problems are to due to Cloudflare's BIC(Browser Integrity Check). The HTTP spec doesn't define any header order or letter casing but Cloudflare, being what it is, has been able to profile requests and filter based on those attributes. This has been seen time and time again with countless bug reports.

@miigotu
Copy link
Contributor

miigotu commented May 8, 2019

I commented on the other issue, and I'm open to other options. We need to work out the drama surrounding all this it seems.

If there are issues with js2py, maybe we can improve js2py?
Node just creates so many problems for some users who don't have root or even sudo on their seedboxes, among other places.

If we must continue with a node reliant method, even just for the time being, that is ok also.

Whatever version is used, I will be maintaining a fork here.

@ghost
Copy link

ghost commented May 8, 2019

If there are issues with js2py, maybe we can improve js2py?

I've already fixed it once: PiotrDabkowski/Js2Py#157
It can be improved if you want to invest the time in it but there is so many edge cases that it's going to be a pain whenever it does break. It's like relying on fragile regular expressions to do web scrapping where the target site keeps updating but instead of changing the RE, you have to find your way around and update a large code base. If Cloudflare ever starts using es6 features, js2py will be horribly slow. There are many stale issues in the js2py repository and then there's the security aspect, ofc. IMHO, The current use of js2py in cloudscraper is very hacky but enough of the negativity, lol, you currently can use js2py.

I hope that makes sense as the reason why I don't want to maintain a version that uses it.

@CodyWoolaver
Copy link
Contributor Author

New cfscrape version has been built - I am going to pull in that change and test things out. They bumped the version from 1.9.7 to 2.0.0 so I am not sure if there are breaking changes

@ghost
Copy link

ghost commented May 13, 2019

@CodyWoolaver There isn't any breaking changes in the public API.

Node support did change: Anorov/cloudflare-scrape#241
Some security concerns were addressed.
Diff: https://github.com/Anorov/cloudflare-scrape/compare/f35c46..67f80e

If you guys run into this issue: Anorov/cloudflare-scrape#235
Please let us know. I've wrote a script to help generate reports for it. If you could run it, that'd be fantastic. It's really the only way that I see us being able to fully address the issue without shooting in the dark.

@CodyWoolaver
Copy link
Contributor Author

I am seeing some issues around the headers again. Same issue as before when it came to passing in additional headers. Accept in particular - I will dig deeper tonight. The issues are not around cloudflare, but other websites requiring specific header formats that are not being passed in. I don't have all the details yet.

@miigotu miigotu merged commit 3925a60 into SickChill:develop May 15, 2019
@ghost
Copy link

ghost commented May 15, 2019

@CodyWoolaver Have you identified any problems? The master branch has been updated to contain a fix to avoid CAPTCHA. The fix has also been released. It should be noted that whether or not you experience the problem that the recent release resolves is believed to be based on CPU.

@CodyWoolaver
Copy link
Contributor Author

Seems to be intermittent and I have not nailed down the full effects. I can assume my cpu is not a problem though I am using an Intel(R) Xeon(R) CPU E5-2620 v4

The standard requests work fine, it is just the interactions with the deluge web api

@lukele
Copy link

lukele commented May 15, 2019

@CodyWoolaver would you have a small test case for a to test with?

@ghost
Copy link

ghost commented May 15, 2019

If you are receiving a CAPTCHA from Cloudflare, the recent release fixes a case that is seemingly only experienced by users depending on openssl version and CPU. Otherwise, yes, CPU would be irrelevant.

I'll look over the requests codebase to review how session headers are handled, maybe I'll spot some improvement to be made.

@CodyWoolaver
Copy link
Contributor Author

CodyWoolaver commented May 15, 2019

My test case, within SickChill is to manually search for an episode. I receive a notification informing me the episode has been found, however in my logs i receive this message:

2019-05-15 14:03:08 WARNING SEARCHQUEUE-BACKLOG-268592 :: Deluge: Unable to send Torrent

The message is not very helpful, and I have a terrible sickrage setup for debugging.

@ghost
Copy link

ghost commented May 15, 2019

@CodyWoolaver All I need is the exact headers that should normally be sent. I should be able to debug it with that information. Also any other request specific information such as payload, query params, etc. would be helpful but not necessary if the headers are known to be the problem.

@CodyWoolaver
Copy link
Contributor Author

Setting a reminder for myself tonight to get back to you - Can't test it from work :(

@ghost
Copy link

ghost commented May 15, 2019

Ah, okay, thx. I'm multi-tasking at the moment but a little later I will experiment. I really think that I should be able to determine the problem if it resides with the library without any additional information. If @lukele doesn't beat me to it. 😃

@ghost
Copy link

ghost commented May 15, 2019

@lukele
Copy link

lukele commented May 15, 2019

@CodyWoolaver does this happen with a specific search provider/episode?

@lukele
Copy link

lukele commented May 15, 2019

@CodyWoolaver Do you have validate certificate enable in SickChill Deluge settings? I wonder if that might be causing the problems. Also, are you running Deluge on localhost or on another server?

From the code it also looks like running it SickChill with debug logs on should quickly help identify the issue, since it prints the data being sent, the host for the connection which is a good starting point.

@ghost
Copy link

ghost commented May 15, 2019

I did find one issue but I haven't finished looking into this yet: Anorov/cloudflare-scrape#247

Details

You can see how it's affected here: https://github.com/SickChill/SickChill/blob/175df89cabf295358b8caa2d96dafadb2347622d/sickbeard/helpers.py#L1373-L1378

If you want to see whether this is the problem or not:

# from collections import OrderedDict
 def make_session(): 
     session = requests.Session() 
  
     session.headers = OrderedDict(session.headers)
     session.headers.update({'User-Agent': USER_AGENT, 'Accept-Encoding': 'gzip,deflate'}) 
  
     session = cfscrape.create_scraper(sess=session) 

@lukele
Copy link

lukele commented May 15, 2019

Is only the order affected or are the headers itself affected as well? If the former, that should be no problem, since Cloudflare is not the issue in this case from what I understand.

@ghost
Copy link

ghost commented May 15, 2019

Yeh, it's the former but I'm just throwing it out there.. It does have something to do with the headers. 😃

I just found another issue. This will override the HTTPS adapter and undo the fix to avoid CAPTCHA. I can't think of a good solution for this at the moment... :/

https://github.com/SickChill/SickChill/blob/175df89cabf295358b8caa2d96dafadb2347622d/lib/cachecontrol/wrapper.py#L5-L21

@lukele
Copy link

lukele commented May 15, 2019

I worried something like this would happen. Unfortunately the way requests is build, there's no easy way around it. The CacheControlAdapter could inherit from the CaptchaAdapter maybe.

@ghost
Copy link

ghost commented May 15, 2019

The only problem with that is they have to modify a dependency. I don't think they mind in this case but it's probably not preferred.

@lukele
Copy link

lukele commented May 15, 2019

Unrelated to any cfscrape issues, this error could happen if the connection to a deluge host is no longer active, which was established during _get_auth. But yeah, debug logs would tell the whole story I assume.

@ghost
Copy link

ghost commented May 15, 2019

I've completed by review of cfscrape, requests, and both deluge api clients, one of which doesn't use cfscrape at all. The usage of cfscrape by the other shouldn't be causing problems if not related to Cloudflare. I would replace the usage of cfscrape with requests in helpers.py, specifically, return the regular requests session in make_session to eliminate cfscrape as the cause.

@ghost
Copy link

ghost commented May 15, 2019

I'm starting to feel bad about polluting this PR... lol... Sorry about that. If you want to open an issue over at cloudflare-scrape's repo, that's cool.

@ghost ghost mentioned this pull request May 30, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants