|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
事情是这样的的,我最近在做论文,想要爬取IEEE上面会议的文章,因此就写了个爬虫想要要统计一下各类论文的顺序,然而在统计论文的时候,发现自增运算会很离谱。
我先是在IEEE 2015 WCNC上面去爬取论文话题方向, 然后用话题方向中的关键字去ieeeexplore网站上面的会议论文界面去爬取论文的条目。
但是这样子做的时候出现了一个问题,自增的运算会出现数值的爆炸。
主程序是这样的
- from bs4 import BeautifulSoup
- import requests
- import exp #这是个另写的函数
- download_url = 'http://wcnc2015.ieee-wcnc.org/call-for-papers'
- down_data = requests.get(download_url)
- soup = BeautifulSoup(down_data.text, 'lxml')
- topics = soup.select('#node-46 > div > div.field.field-name-body.field-type-text-with-summary.field-label-hidden > div > div > ul > li ')
- topicnumber = 1
- for topic in topics:
- paperamount = ''
- topicdata={
- 'topicid':str(topicnumber),
- 'title':str(topic).split('</li>',1)[0].split('\n\t\t',1)[1],
- 'amount':''
- }
- topicnumber +=1
- if '&' in topicdata['title']:
- topicdata['title'] = topicdata['title'].replace('&','&')
- topicdata['amount'] = exp.get_amount_2015(topicdata['topicid'])
- print(topicdata)
复制代码
函数的文件如下
- def get_amount_2015(topic_number):
- from bs4 import BeautifulSoup
- import requests
- papercount = 0 #这个是计数器
- for i in range(1,50): #假设最多有50页,其实后面加了一个跳出的机制
- url = 'http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?filter%3DAND%28p_IS_Number%3A7127309%29&rowsPerPage=100&pageNumber=' + str(i) + '&resultAction=REFINE&resultAction=ROWS_PER_PAGE'
- wb_data = requests.get(url)
- soup = BeautifulSoup(wb_data.text, 'lxml')
- titles = soup.select('#results-blk > div > ul > li > div.txt > h3 > a > span') #解析题目列表
- next_exist = soup.select('#results-blk > div.pagination-wrap > div > a.next.ir') #解析下一页指向的页码
- cur = str(i) #提取当前页码
- nex = str(next_exist).split('gotoPage(\'')[1].split('\')">')[0] #提取下一页指向的页码
- for title in titles:
- title = str(title.get_text()) #这里有点啰嗦,但是先不改,反正就是分条计数
- temp = title
- papercount += ('nterference' in temp) if topic_number == '1' else papercount #这里是出现问题的地方
- papercount += ((('ognitive' in temp) or ('ltra-wideband' in temp) ) and (('adio' in temp) or ('ireless' in temp) or ('ell' in temp))) if topic_number == '2' else papercount
- papercount += (('op' in temp) or('hop' in temp) or ('ooperat' in temp) ) if (topic_number == '3') else papercount
- papercount += (('Modul' in temp) or ('cod' in temp) or ('divers' in temp) ) if topic_number == '4' else papercount
- papercount += (('Equali' in temp) or ('Synchro' in temp) or (('estimation' in temp) and ('Channel' in temp))) if topic_number == '5' else papercount
- papercount += ( ('Space-time' in temp) or ('STC' in temp) or ('antenna' in temp) ) if topic_number == '6' else papercount
- papercount += ( ('OFDM' in temp) or ('Orthogonal Frequency' in temp) or ('Code Division Multiple' in temp) or ('CDMA' in temp) or ('Spread Spectrum' in temp)) if topic_number == '7' else papercount
- papercount += ((('Model' in temp) or ('Character' in temp)) and ('Channel' in temp)) if topic_number == '8' else papercount
- papercount += (('Interference' in temp) or (('Detection' in temp) and (('user' in temp) or ('User' in temp)))) if topic_number == '9' else papercount
- papercount += ('Iterati' in temp) if topic_number == '10' else papercount
- papercount += (('Theor' in temp) and (('Radio' in temp) or ('Wireless' in temp) or ('Cell' in temp))) if topic_number == '11' else papercount
- papercount += ('Signal' in temp) if topic_number == '12' else papercount
- papercount += ('Propagat' in temp) if topic_number == '13' else papercount
- papercount += (('Multiple' in temp) and ('Access' in temp)) if topic_number == '14' else papercount
- papercount += (('Cognit' in temp) or ('Cooperative' in temp)) if topic_number == '15' else papercount
- papercount += ('Collaborat' in temp) if topic_number == '16' else papercount
- papercount += (('Mesh' in temp) or ('Ad-hoc' in temp) or ('D2D' in temp) or ('Device-to-Device' in temp) or ('Relay' in temp) or ('Sensor' in temp)) if topic_number == '17' else papercount
- papercount += ('Theor' in temp) if topic_number == '18' else papercount
- papercount += (('Resource' in temp) or ('Allocation' in temp) or (('Resource' in temp) and ('Management' in temp)) or (('Resource' in temp) and ('Schedul' in temp))) if topic_number == '19' else papercount
- papercount += ((('Cross-layer' in temp) and ('A' in temp)) or (('Cross-layer' in temp) and ('Security' in temp))) if topic_number == '20' else papercount
- papercount += (('Software-Defined' in temp) or ('SDN' in temp) or ('RFID' in temp) or ('Radio Frequency Identification' in temp)) if topic_number == '21' else papercount
- papercount += (('Adaptab' in temp) or ('Reconfigur' in temp)) if topic_number == '22' else papercount
- papercount += ('Protocols' in temp) if topic_number == '23' else papercount
- papercount += (('B3G' in temp) or ('4G' in temp) or ('4G' in temp) or ('WiMAX' in temp) or ('WLAN' in temp) or ('WPAN' in temp) or ('Wi-Fi' in temp)) if topic_number == '24' else papercount
- papercount += ('QoS' in temp) if topic_number == '25' else papercount
- papercount += (('Local' in temp) or ('Position' in temp)) if topic_number == '26' else papercount
- papercount += (('Estimat' in temp) or ('Process' in temp)) if topic_number == '27' else papercount
- papercount += (('Mesh' in temp) or ('Ad-hoc' in temp) or ('D2D' in temp) or ('Device-to-Device' in temp) or ('Relay' in temp) or ('Sensor' in temp)) if topic_number == '28' else papercount
- papercount += (('Mobility' in temp) or ('Location' in temp) or ('Handoff' in temp) )if topic_number == '29' else papercount
- papercount += (('IP' in temp) or ('TCP' in temp) or ('UDP' in temp)) if topic_number == '30' else papercount
- papercount += (('Multicast' in temp) or ('Rout' in temp)) if topic_number == '31' else papercount
- papercount += ('Routing' in temp) or ('Rout' in temp) if topic_number == '32' else papercount
- papercount += (('Multimedia' in temp) or ('Traffic' in temp)) if topic_number == '33' else papercount
- papercount += (('Broadcast' in temp) or ('Multicast' in temp) or ('Stream' in temp)) if topic_number == '34' else papercount
- papercount += (('Congestion' in temp) or ('Admission' in temp) or ('Control' in temp)) if topic_number == '35' else papercount
- papercount += (('Middleware' in temp) or ('Proxies' in temp) or ('Proxy' in temp)) if topic_number == '36' else papercount
- papercount += (('Security' in temp) or ('Privacy' in temp)) if topic_number == '37' else papercount
- papercount += (('E2E' in temp) or ('End to End' in temp)) if topic_number == '38' else papercount
- papercount += ('Heterogeneous' in temp) if topic_number == '39' else papercount
- papercount += (('Capacity' in temp) or ('Throughput' in temp) or ('Outage' in temp) or ('Coverage' in temp)) if topic_number == '40' else papercount
- papercount += (('Emerging' in temp) or ('application' in temp) ) if topic_number == '41' else papercount
- papercount += ((('aware' in temp) or ('Aware' in temp) ) and (('Location' in temp) or ('Context' in temp)) )if topic_number == '42' else papercount
- papercount += (('Medicine' in temp) or ('Telemedicine' in temp) or ('Health' in temp)) if topic_number == '43' else papercount
- papercount += ('Transport' in temp) if topic_number == '44' else papercount
- papercount += ((('Cognitive' in temp) or ('Sensor' in temp)) and ('Application' in temp) or ((' a ' in temp) or ('A' in temp)) ) if topic_number == '45' else papercount
- papercount += ('Transport' in temp) if topic_number == '46' else papercount
- papercount += ((('Content' in temp) and ('distribution' in temp))or ('Home' in temp)) if topic_number == '47' else papercount
- papercount += (('Service' in temp) and (('Architecture' in temp) or ('Portab' in temp))) if topic_number == '48' else papercount
- papercount += ('Transport' in temp) if topic_number == '49' else papercount
- papercount += (('Interfaces' in temp) or ('P2P' in temp) or ('Peer-to-peer' in temp)) if topic_number == '50' else papercount
- papercount += (('Dynamic' in temp) or ('Autonomic' in temp)) if topic_number == '51' else papercount
- papercount += (('Regulation' in temp) or ('Standard' in temp) or ('Spectrum' in temp))if topic_number == '52' else papercount
- papercount += (('Test' in temp) or ('Prototype' in temp)) if topic_number == '53' else papercount
- papercount += (('Personal' in temp) or ('Discover' in temp) or ('Profil')) if topic_number == '53' else papercount
- if (cur == nex):
- break # 页码到头以后跳出循环
- return str(papercount)
复制代码
贴一行运行结果:
- {'topicid': '1', 'title': 'Interference characterization', 'amount': '759177558963387793036328736532440653553919368838858179201651800073597053587528681717195616028176755023399202719530867500113454411442568396262888749080744434
复制代码
这数怎么这么大?
吓得我赶紧debug下看看咋回事?
发现是计数出现了问题。
我发现把- papercount += ('nterference' in temp) if topic_number == '1' else papercount
复制代码
改成- if topic_number == '1':
- print('一次匹配开始——————,先输出papercount: ',papercount)
- print('检测一下匹配的布尔值,这也没毛病:','nterference' in temp)
- papercount += ('nterference' in temp) #我不放心,再来一行检验是我匹配的问题吗?
- print('一次匹配完成——————,再输出一次papercount: ', papercount)
- print('===================================')
复制代码
输出的结果居然是- ===================================
- 一次匹配开始——————,先输出papercount: 0
- 检测一下匹配的布尔值,这也没毛病: False
- 一次匹配完成——————,再输出一次papercount: 0
- ===================================
- 一次匹配开始——————,先输出papercount: 0
- 检测一下匹配的布尔值,这也没毛病: False
- 一次匹配完成——————,再输出一次papercount: 0
- ===================================
- 一次匹配开始——————,先输出papercount: 0
- 检测一下匹配的布尔值,这也没毛病: True
- 一次匹配完成——————,再输出一次papercount: 1
- ===================================
- 一次匹配开始——————,先输出papercount: 9007199254740992
- 检测一下匹配的布尔值,这也没毛病: False
- 一次匹配完成——————,再输出一次papercount: 9007199254740992
- ===================================
- 一次匹配开始——————,先输出papercount: 81129638414606681695789005144064
- 检测一下匹配的布尔值,这也没毛病: False
- 一次匹配完成——————,再输出一次papercount: 81129638414606681695789005144064
- ===================================
- 一次匹配开始——————,先输出papercount: 730750818665451459101842416358141509827966271488
- 检测一下匹配的布尔值,这也没毛病: True
- 一次匹配完成——————,再输出一次papercount: 730750818665451459101842416358141509827966271489
- ===================================
- 一次匹配开始——————,先输出papercount: 6582018229284824168619876730229402019930943462543326652649177088
- 检测一下匹配的布尔值,这也没毛病: False
- 一次匹配完成——————,再输出一次papercount: 6582018229284824168619876730229402019930943462543326652649177088
- ===================================
- 一次匹配开始——————,先输出papercount: 59285549689505892056868344324448208820874232148889098426616889693747311380791296
- 检测一下匹配的布尔值,这也没毛病: False
- 一次匹配完成——————,再输出一次papercount: 59285549689505892056868344324448208820874232148889098426616889693747311380791296
- ===================================
复制代码
false怎么变成了这样呢?有大神来帮忙看看吗? |
|