Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

buff数据爬一半终止 #42

Closed
gsuckerp opened this issue Nov 16, 2020 · 9 comments
Closed

buff数据爬一半终止 #42

gsuckerp opened this issue Nov 16, 2020 · 9 comments
Labels
question Further information is requested

Comments

@gsuckerp
Copy link

gsuckerp commented Nov 16, 2020

No description provided.

@gsuckerp
Copy link
Author

被buff封了。。。

@gsuckerp gsuckerp reopened this Nov 16, 2020
@gsuckerp
Copy link
Author

换个号还是爬不了??
2020-11-16 20:51:46,820 [INFO ] Page 292 / 379
2020-11-16 20:51:46,821 [INFO ] Successful attempt to fetch from 62ad83754a1bd6c7bf95ec6c12e43070110150e8
Traceback (most recent call last):
File "C:\Users\61669\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\61669\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\61669\PycharmProjects\pythonProject\oddish-master\oddish-master\src_main
.py", line 13, in
table = item_crawler.crawl()
File "C:\Users\61669\PycharmProjects\pythonProject\oddish-master\oddish-master\src\crawl\item_crawler.py", line 123, in crawl
return crawl_website()
File "C:\Users\61669\PycharmProjects\pythonProject\oddish-master\oddish-master\src\crawl\item_crawler.py", line 77, in crawl_website
csgo_items.extend(crawl_goods_by_price_section(None))
File "C:\Users\61669\PycharmProjects\pythonProject\oddish-master\oddish-master\src\crawl\item_crawler.py", line 110, in crawl_goods_by_price_section
items_json = page_json['data']['items']
KeyError: 'data'

@puppylpg
Copy link
Owner

puppylpg commented Nov 17, 2020

爬取出错的原因大概有以下几种:

  1. 一开始统计的有379页,爬到后来由于饰品变动,没这么多页了;
  2. 网络问题某页爬取失败导致某页没数据;
  3. cache设置时间过久,因为之前的bug导致存下来的cache结构有问题。这次读cache读到了有问题的数据。可以清一下缓存,也可以不清。

反正这几点我新改动的代码可以解决了。可以试一下新代码

@puppylpg
Copy link
Owner

被buff封了。。。

这个不应该吧?现在应该没这种问题了,在buff上爬的数据已经不多了,难道你用的过于频繁了?

@gsuckerp
Copy link
Author

被buff封了。。。

这个不应该吧?现在应该没这种问题了,在buff上爬的数据已经不多了,难道你用的过于频繁了?

啊这,确实太频繁了,不过好在现在buff和steam能分开爬,我能把爬buff的时间间隔设长一点

@puppylpg
Copy link
Owner

被buff封了。。。

这个不应该吧?现在应该没这种问题了,在buff上爬的数据已经不多了,难道你用的过于频繁了?

啊这,确实太频繁了,不过好在现在buff和steam能分开爬,我能把爬buff的时间间隔设长一点

哈哈,别太频繁,一天跑个一两次就够了。
或者爬取范围调小一点儿,看起来你一次爬的挺多的,三百多页面
另外爬取间隔调大一点儿

@puppylpg puppylpg added the question Further information is requested label Nov 17, 2020
@gsuckerp
Copy link
Author

或者爬取范围调小一点儿,看起来你一次爬的挺多的,三百多页面
另外爬取间隔调大一点儿

昨天设置3-6秒爬了三百多页,今天还是被封了
因为几个账号用一个电脑爬而且都被buff封过,网易已经知道我这几个账号是关联账号了
都怪我当初用大号爬设置时间间隔上限为2,结果变成不随机间隔爬取被封了
现在有什么办法呢,百度到说改user-agent,还是改ip啥的,请教一下

@puppylpg
Copy link
Owner

或者爬取范围调小一点儿,看起来你一次爬的挺多的,三百多页面
另外爬取间隔调大一点儿

昨天设置3-6秒爬了三百多页,今天还是被封了
因为几个账号用一个电脑爬而且都被buff封过,网易已经知道我这几个账号是关联账号了
都怪我当初用大号爬设置时间间隔上限为2,结果变成不随机间隔爬取被封了
现在有什么办法呢,百度到说改user-agent,还是改ip啥的,请教一下

换ip需要成本比较大,需要搞一些代理服务器。换user-agent确实应该考虑一下,我已开始把user-agent写死了,确实不太合适,最近考虑修改一下。

@puppylpg puppylpg reopened this Nov 17, 2020
@puppylpg
Copy link
Owner

#45 user-agent ok了

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants