鱼C论坛

 找回密码
 立即注册
查看: 1212|回复: 3

python变量加1以后变成了大数的问题

[复制链接]
发表于 2017-11-14 17:21:37 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
事情是这样的的,我最近在做论文,想要爬取IEEE上面会议的文章,因此就写了个爬虫想要要统计一下各类论文的顺序,然而在统计论文的时候,发现自增运算会很离谱。
我先是在IEEE 2015 WCNC上面去爬取论文话题方向, 然后用话题方向中的关键字去ieeeexplore网站上面的会议论文界面去爬取论文的条目。
但是这样子做的时候出现了一个问题,自增的运算会出现数值的爆炸。
主程序是这样的
  1. from bs4 import BeautifulSoup
  2. import requests
  3. import exp   #这是个另写的函数
  4. download_url = 'http://wcnc2015.ieee-wcnc.org/call-for-papers'
  5. down_data = requests.get(download_url)
  6. soup = BeautifulSoup(down_data.text, 'lxml')
  7. topics = soup.select('#node-46 > div > div.field.field-name-body.field-type-text-with-summary.field-label-hidden > div > div > ul > li ')
  8. topicnumber = 1
  9. for topic in topics:
  10.     paperamount = ''
  11.     topicdata={
  12.         'topicid':str(topicnumber),
  13.         'title':str(topic).split('</li>',1)[0].split('\n\t\t',1)[1],
  14.         'amount':''
  15.     }
  16.     topicnumber +=1
  17.     if '&amp;' in topicdata['title']:
  18.         topicdata['title'] = topicdata['title'].replace('&amp;','&')
  19.     topicdata['amount'] = exp.get_amount_2015(topicdata['topicid'])
  20.     print(topicdata)
复制代码



函数的文件如下
  1. def get_amount_2015(topic_number):
  2.     from bs4 import BeautifulSoup
  3.     import requests
  4.     papercount = 0   #这个是计数器
  5.     for i in range(1,50):    #假设最多有50页,其实后面加了一个跳出的机制
  6.         url = 'http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?filter%3DAND%28p_IS_Number%3A7127309%29&rowsPerPage=100&pageNumber=' + str(i) + '&resultAction=REFINE&resultAction=ROWS_PER_PAGE'
  7.         wb_data = requests.get(url)
  8.         soup = BeautifulSoup(wb_data.text, 'lxml')
  9.         titles = soup.select('#results-blk > div > ul > li > div.txt > h3 > a > span')    #解析题目列表
  10.         next_exist = soup.select('#results-blk > div.pagination-wrap > div > a.next.ir')   #解析下一页指向的页码
  11.         cur = str(i)                           #提取当前页码
  12.         nex = str(next_exist).split('gotoPage(\'')[1].split('\')">')[0] #提取下一页指向的页码
  13.         for title in titles:
  14.             title = str(title.get_text())  #这里有点啰嗦,但是先不改,反正就是分条计数
  15.             temp = title
  16.             papercount += ('nterference' in temp)  if topic_number == '1' else papercount     #这里是出现问题的地方
  17.             papercount += ((('ognitive' in temp) or ('ltra-wideband' in temp) ) and (('adio' in temp) or ('ireless' in temp) or ('ell' in temp))) if topic_number == '2' else papercount
  18.             papercount += (('op' in temp) or('hop' in temp) or ('ooperat' in temp) ) if (topic_number == '3') else papercount
  19.             papercount += (('Modul' in temp) or ('cod' in temp) or ('divers' in temp) ) if topic_number == '4' else papercount
  20.             papercount += (('Equali' in temp) or ('Synchro' in temp) or (('estimation' in temp) and ('Channel' in temp))) if topic_number == '5' else papercount
  21.             papercount += ( ('Space-time' in temp) or ('STC' in temp) or ('antenna' in temp) ) if topic_number == '6' else papercount
  22.             papercount += ( ('OFDM' in temp) or ('Orthogonal Frequency' in temp) or ('Code Division Multiple' in temp) or ('CDMA' in temp) or ('Spread Spectrum' in temp)) if topic_number == '7' else papercount
  23.             papercount += ((('Model' in temp) or ('Character' in temp)) and ('Channel' in temp)) if topic_number == '8' else papercount
  24.             papercount += (('Interference' in temp) or (('Detection' in temp) and (('user' in temp) or ('User' in temp)))) if topic_number == '9' else papercount
  25.             papercount += ('Iterati' in temp) if topic_number == '10' else papercount
  26.             papercount += (('Theor' in temp) and (('Radio' in temp) or ('Wireless' in temp) or ('Cell' in temp))) if topic_number == '11' else papercount
  27.             papercount += ('Signal' in temp) if topic_number == '12' else papercount
  28.             papercount += ('Propagat' in temp) if topic_number == '13' else papercount
  29.             papercount += (('Multiple' in temp) and ('Access' in temp)) if topic_number == '14' else papercount
  30.             papercount += (('Cognit' in temp) or ('Cooperative' in temp)) if topic_number == '15' else papercount
  31.             papercount += ('Collaborat' in temp) if topic_number == '16' else papercount
  32.             papercount += (('Mesh' in temp) or ('Ad-hoc' in temp) or ('D2D' in temp) or ('Device-to-Device' in temp) or ('Relay' in temp) or ('Sensor' in temp)) if topic_number == '17' else papercount
  33.             papercount += ('Theor' in temp) if topic_number == '18' else papercount
  34.             papercount += (('Resource' in temp) or ('Allocation' in temp) or (('Resource' in temp) and ('Management' in temp)) or (('Resource' in temp) and ('Schedul' in temp)))  if topic_number == '19' else papercount
  35.             papercount += ((('Cross-layer' in temp) and ('A' in temp)) or (('Cross-layer' in temp) and ('Security' in temp))) if topic_number == '20' else papercount
  36.             papercount += (('Software-Defined' in temp) or ('SDN' in temp) or ('RFID' in temp) or ('Radio Frequency Identification' in temp)) if topic_number == '21' else papercount
  37.             papercount += (('Adaptab' in temp) or ('Reconfigur' in temp)) if topic_number == '22' else papercount
  38.             papercount += ('Protocols' in temp)  if topic_number == '23' else papercount
  39.             papercount += (('B3G' in temp) or ('4G' in temp) or ('4G' in temp) or ('WiMAX' in temp) or ('WLAN' in temp) or ('WPAN' in temp) or ('Wi-Fi' in temp)) if topic_number == '24' else papercount
  40.             papercount += ('QoS' in temp) if topic_number == '25' else papercount
  41.             papercount += (('Local' in temp) or ('Position' in temp))  if topic_number == '26' else papercount
  42.             papercount += (('Estimat' in temp) or ('Process' in temp)) if topic_number == '27' else papercount
  43.             papercount += (('Mesh' in temp) or ('Ad-hoc' in temp) or ('D2D' in temp) or ('Device-to-Device' in temp) or ('Relay' in temp) or ('Sensor' in temp)) if topic_number == '28' else papercount
  44.             papercount += (('Mobility' in temp) or ('Location' in temp) or ('Handoff' in temp) )if topic_number == '29' else papercount
  45.             papercount += (('IP' in temp) or ('TCP' in temp) or ('UDP' in temp)) if topic_number == '30' else papercount
  46.             papercount += (('Multicast' in temp) or ('Rout' in temp)) if topic_number == '31' else papercount
  47.             papercount += ('Routing' in temp) or ('Rout' in temp) if topic_number == '32' else papercount
  48.             papercount += (('Multimedia' in temp) or ('Traffic' in temp)) if topic_number == '33' else papercount
  49.             papercount += (('Broadcast' in temp) or ('Multicast' in temp) or ('Stream' in temp)) if topic_number == '34' else papercount
  50.             papercount += (('Congestion' in temp) or ('Admission' in temp) or ('Control' in temp)) if topic_number == '35' else papercount
  51.             papercount += (('Middleware' in temp) or ('Proxies' in temp) or ('Proxy' in temp)) if topic_number == '36' else papercount
  52.             papercount += (('Security' in temp) or ('Privacy' in temp)) if topic_number == '37' else papercount
  53.             papercount += (('E2E' in temp) or ('End to End' in temp)) if topic_number == '38' else papercount
  54.             papercount += ('Heterogeneous' in temp)  if topic_number == '39' else papercount
  55.             papercount += (('Capacity' in temp) or ('Throughput' in temp) or ('Outage' in temp) or ('Coverage' in temp)) if topic_number == '40' else papercount
  56.             papercount += (('Emerging' in temp) or ('application' in temp) ) if topic_number == '41' else papercount
  57.             papercount += ((('aware' in temp) or ('Aware' in temp) ) and (('Location' in temp) or ('Context' in temp)) )if topic_number == '42' else papercount
  58.             papercount += (('Medicine' in temp) or ('Telemedicine' in temp) or ('Health' in temp)) if topic_number == '43' else papercount
  59.             papercount += ('Transport' in temp) if topic_number == '44' else papercount
  60.             papercount += ((('Cognitive' in temp) or ('Sensor' in temp)) and ('Application' in temp) or ((' a ' in temp) or ('A' in temp)) ) if topic_number == '45' else papercount
  61.             papercount += ('Transport' in temp) if topic_number == '46' else papercount
  62.             papercount += ((('Content' in temp) and ('distribution' in temp))or ('Home' in temp)) if topic_number == '47' else papercount
  63.             papercount += (('Service' in temp) and (('Architecture' in temp) or ('Portab' in temp))) if topic_number == '48' else papercount
  64.             papercount += ('Transport' in temp) if topic_number == '49' else papercount
  65.             papercount += (('Interfaces' in temp) or ('P2P' in temp) or ('Peer-to-peer' in temp)) if topic_number == '50' else papercount
  66.             papercount += (('Dynamic' in temp) or ('Autonomic' in temp)) if topic_number == '51' else papercount
  67.             papercount += (('Regulation' in temp) or ('Standard' in temp) or ('Spectrum' in temp))if topic_number == '52' else papercount
  68.             papercount += (('Test' in temp) or ('Prototype' in temp)) if topic_number == '53' else papercount
  69.             papercount += (('Personal' in temp) or ('Discover' in temp) or ('Profil')) if topic_number == '53' else papercount

  70.         if (cur == nex):
  71.             break  # 页码到头以后跳出循环
  72.     return str(papercount)
复制代码


贴一行运行结果:
  1. {'topicid': '1', 'title': 'Interference characterization', 'amount': '759177558963387793036328736532440653553919368838858179201651800073597053587528681717195616028176755023399202719530867500113454411442568396262888749080744434
复制代码

这数怎么这么大?
吓得我赶紧debug下看看咋回事?
发现是计数出现了问题。
我发现把
  1. papercount += ('nterference' in temp)  if topic_number == '1' else papercount
复制代码

改成
  1.             if topic_number == '1':
  2.                 print('一次匹配开始——————,先输出papercount: ',papercount)
  3.                 print('检测一下匹配的布尔值,这也没毛病:','nterference' in temp)
  4.                 papercount += ('nterference' in temp)  #我不放心,再来一行检验是我匹配的问题吗?
  5.                 print('一次匹配完成——————,再输出一次papercount: ', papercount)
  6.                 print('===================================')
复制代码


输出的结果居然是
  1. ===================================
  2. 一次匹配开始——————,先输出papercount:  0
  3. 检测一下匹配的布尔值,这也没毛病: False
  4. 一次匹配完成——————,再输出一次papercount:  0
  5. ===================================
  6. 一次匹配开始——————,先输出papercount:  0
  7. 检测一下匹配的布尔值,这也没毛病: False
  8. 一次匹配完成——————,再输出一次papercount:  0
  9. ===================================
  10. 一次匹配开始——————,先输出papercount:  0
  11. 检测一下匹配的布尔值,这也没毛病: True
  12. 一次匹配完成——————,再输出一次papercount:  1
  13. ===================================
  14. 一次匹配开始——————,先输出papercount:  9007199254740992
  15. 检测一下匹配的布尔值,这也没毛病: False
  16. 一次匹配完成——————,再输出一次papercount:  9007199254740992
  17. ===================================
  18. 一次匹配开始——————,先输出papercount:  81129638414606681695789005144064
  19. 检测一下匹配的布尔值,这也没毛病: False
  20. 一次匹配完成——————,再输出一次papercount:  81129638414606681695789005144064
  21. ===================================
  22. 一次匹配开始——————,先输出papercount:  730750818665451459101842416358141509827966271488
  23. 检测一下匹配的布尔值,这也没毛病: True
  24. 一次匹配完成——————,再输出一次papercount:  730750818665451459101842416358141509827966271489
  25. ===================================
  26. 一次匹配开始——————,先输出papercount:  6582018229284824168619876730229402019930943462543326652649177088
  27. 检测一下匹配的布尔值,这也没毛病: False
  28. 一次匹配完成——————,再输出一次papercount:  6582018229284824168619876730229402019930943462543326652649177088
  29. ===================================
  30. 一次匹配开始——————,先输出papercount:  59285549689505892056868344324448208820874232148889098426616889693747311380791296
  31. 检测一下匹配的布尔值,这也没毛病: False
  32. 一次匹配完成——————,再输出一次papercount:  59285549689505892056868344324448208820874232148889098426616889693747311380791296
  33. ===================================
复制代码


false怎么变成了这样呢?有大神来帮忙看看吗?
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2017-11-14 17:55:23 | 显示全部楼层
你是 在连接字符串
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2017-11-14 20:05:01 | 显示全部楼层
SixPy 发表于 2017-11-14 17:55
你是 在连接字符串

然而
  1. 1+False
复制代码

也得到1啊,再说强制类型转换的话,Bool类型也是强制成int啊
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2017-11-14 20:42:27 | 显示全部楼层
已解决。
  1. papercount = papercount+ ('nterference' in temp) if topic_number == '1' else papercount
复制代码


  1. papercount += ('nterference' in temp)  if topic_number == '1' else papercount
复制代码

不等价,这种范式一定要是
  1. a = b if c else d
复制代码

的形式,不能马虎。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-3-29 07:23

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表