爬虫代码,新手乐园,技术交流,鱼C论坛

爱笑的涛先森 发表于 2018-1-17 18:48:27

爬虫代码

楼主想爬京东金融众筹的源代码。写的有些粗糙（代码附上）
#设置超时和休眠时间
socket.setdefaulttimeout(20)
time.sleep(5)

def url_open(url):
response = urllib.request.urlopen(url)
req = urllib.request.Request(url)
req.add_header("……………………………………")#user—agent我删了

#代理设置
proxies = ['http://123.119.202.254:8118','http://113.67.164.198:8118','http://175.8.227.242:8118']
proxy = random.choice(proxies)
proxy_support = urllib.request.ProxyHandler({'http':proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
html = response .read()
return html
response.close()

#翻页以及保存源码
#85337.html"
def run():
p = 0
while p<20000:
   url = "https://z.jd.com/project/details/"
   page_num = 85337
   page_num += p
   page_url = url + str(page_num) + ".html"

   print(' =='+str(page_num)+'==start==')
   result = url_open(page_url)
   txt =str(p) +'.html'
   f = open(txt,'wb+')
   f.write(result)
   print('=='+str(p)+'====end==')
   p+=1
f.close()

爱笑的涛先森 发表于 2018-1-17 18:50:44

我设了模拟浏览器登录，也设置了代理ip，（ip是否有效未验证）

页: [1]

鱼C论坛's Archiver

爬虫代码