Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

Add check before get page data. #43

Merged
merged 1 commit into from
Nov 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 22 additions & 15 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,18 +29,18 @@
- 爬取价格的时候增加进度显示,好对总时间有预期;
- 爬取的时候设置timeout=5s,超时报错返回;

# v1.6.0(2019-12-03)
## v1.6.0(2019-12-03)
* 功能
- 配置文件正式单独抽取为`config/config.ini`,配置起来更人性化;

# v2.0.0(2019-12-04)
## v2.0.0(2019-12-04)
* 功能
- 将buff作为工具人,先对buff使用价格过滤,只爬取过滤后满足价格区间的饰品。比起之前先爬取所有数据再筛选价格的方式,效率提升数倍;
- 文件名加上价格区间标识,某已爬取价格区间的数据不影响同一时间段对其他价格区间数据的爬取;
* bugfix
- 解析配置时使用`RawConfigParser`,[不转义配置中的百分号等特殊符号](https://stackoverflow.com/questions/14340366/configparser-and-string-with);

# v3.0.0(2019-12-05)
## v3.0.0(2019-12-05)
* 功能
- 工程正式命名为[oddish | 走路草](https://www.pokemon.com/us/pokedex/oddish),第43号神奇宝贝;
- 支持设置饰品类别黑名单白名单,详见README;
Expand All @@ -49,63 +49,70 @@
- 请求url时增加简单的超时重试机制;
- 使用`json.loads()`转换plain string list;

# v3.1.0(2019-12-07)
## v3.1.0(2019-12-07)
* bugfix
- 之前取steam售价的.25作为均价的实现有问题,改错地方了,已修复;

# v3.2.0(2020-04-19)
## v3.2.0(2020-04-19)
* 功能
- cookie不当导致登陆失败的情况下,给出友善的提醒,而不是直接崩掉,呈现一堆stacktrace;
- 默认排除掉除武器以外的饰品;
- 新增cookie示例;

# v3.3.0(2020-05-24)
## v3.3.0(2020-05-24)
* 功能
- 添加badge;
- 添加github sponsor按钮;

# v3.4.0(2020-07-27)
## v3.4.0(2020-07-27)
* 功能
- 支持自定义爬取间隔;

# v3.5.0(2020-08-06)
## v3.5.0(2020-08-06)
* 功能
- 将steam历史价格来源替换为社区市场;

# v3.6.0(2020-08-18)
## v3.6.0(2020-08-18)
* 功能
- 黑白名单支持通配符设定;
* bugfix
- 修复超时重试时参数传递错误;
- 移除过期配置项;
- 修复无法正确获取历史交易数量的错误;

# v3.6.1(2020-09-10)
## v3.6.1(2020-09-10)
* bugfix
- 饰品爬取较多时难免碰到某些条目爬取异常,忽略即可,程序正常继续执行;

# v3.7.0(2020-10-16)
## v3.7.0(2020-10-16)
* 功能
- 引入单页面缓存,增强爬取容错能力;
- 允许不通过代理直连;

# v3.8.0(2020-10-16)
## v3.8.0(2020-10-16)
* 功能
- 添加开源标准:开源许可、CONTRIBUTING、templates等;

# v3.8.1(2020-10-19)
## v3.8.1(2020-10-19)
* 功能
- 多issue template支持;

# v3.8.2(2020-10-23)
## v3.8.2(2020-10-23)
* bugfix
- 修复 Win 下缓存文件编码的问题
- 在缓存意外不合法时现在会重新抓取

# v3.8.3(2020-11-12)
## v3.8.3(2020-11-12)
* bugfix
- sticker类别变多了,默认黑白名单屏蔽sticker的时候使用通配符;
- 增加了一些校验,防止特殊情况数据不存在时报错:
+ 写cache前校验爬取内容是否超时为None;
+ 数据表为空校验。没爬到数据就不给出建议了;
+ 获取steam历史价格返回时,增加'prices' key存在性校验;

## v3.8.4(2020-11-17)
* bugfix
- 根据页数爬取数据时,增加数据结构校验,有可能爬取内容过多,饰品销售过快,一开始的页面数目已不存在;
* 功能
- 增加`requirements.txt`;
- readme增加关于使用uu加速器的doc;
15 changes: 10 additions & 5 deletions src/crawl/item_crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,13 +90,16 @@ def crawl_goods_by_price_section(category=None):

if root_json is not None:
if 'data' not in root_json:
log.info('Error happens:')
log.info(root_json)
log.error('Error happens:')
log.error(root_json)
if 'error' in root_json:
log.info('Error field: ' + root_json['error'])
log.info('Please paste correct buff cookie to config, current cookie:' + BUFF_COOKIE)
log.error('Error field: ' + root_json['error'])
log.error('Please paste correct buff cookie to config, current cookie:' + BUFF_COOKIE)
exit(1)

if ('total_page' not in root_json['data']) or ('total_count' not in root_json['data']):
log.error("No specific page and count info for root page. Please check buff data structure.")

total_page = root_json['data']['total_page']
total_count = root_json['data']['total_count']
log.info('Totally {} items of {} pages to crawl.'.format(total_count, total_page))
Expand All @@ -105,14 +108,16 @@ def crawl_goods_by_price_section(category=None):
log.info('Page {} / {}'.format(page_num, total_page))
page_url = goods_section_page_url(category, page_num)
page_json = get_json_dict(page_url, buff_cookies)
if page_json is not None:
if (page_json is not None) and ('data' in page_json) and ('items' in page_json['data']):
# items on this page
items_json = page_json['data']['items']
for item in items_json:
# get item
csgo_item = collect_item(item)
if csgo_item is not None:
category_items.append(csgo_item)
else:
log.warn("No specific data for page {}. Skip this page.".format(page_url))

return category_items

Expand Down