|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
众所周知ooxx那个用不了,所以我顺势想知道怎么用python伪装浏览器
import urllib.request
url = "http://www.ppmsg.net/siwameitui/201710/29335.html"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Accept':'image/webp,image/*,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Host':'pagead2.googlesyndication.com',
'Referer':'http://googleads.g.doubleclick.net/pagead/ads?client=ca-pub-4352453996606420&output=html&h=250&slotname=2477784167&adk=914185825&adf=3407270570&w=300&loeid=38893312&format=300x250&url=http%3A%2F%2Fjandan.net%2Fooxx&ea=0&flash=27.0.0&avail_w=336&wgl=1&adsid=NT&dt=1516093127008&bpp=9&bdt=56&fdt=11&idt=211&shv=r20180108&cbv=r20170110&saldr=aa&correlator=4715549173527&frm=23&ga_vid=2083610090.1516088230&ga_sid=1516093127&ga_hid=514950986&ga_fc=0&pv=2&iag=63&icsg=2&nhd=3&dssz=2&mdo=0&mso=0&u_tz=480&u_his=4&u_java=0&u_h=768&u_w=1366&u_ah=728&u_aw=1366&u_cd=24&u_nplug=13&u_nmime=30&adx=0&ady=0&biw=1309&bih=603&isw=336&ish=280&ifk=1482810460&scr_x=0&scr_y=0&eid=21061122%2C38893302%2C191880502%2C389613001%2C370204012&oid=3&nmo=1&zm=1.04&ref=http%3A%2F%2Fjandan.net%2Fooxx&rx=0&eae=2&fc=528&brdim=0%2C0%2C0%2C0%2C1366%2C0%2C1366%2C728%2C336%2C280&vis=1&rsz=%7C%7CaeE%7C&abl=CA&ppjl=f&pfx=0&fu=12&bc=1&ifi=1&dtd=342' }
opener = urllib.request.Request(url)
opener.addheaders = [headers]
data = urllib.request.urlopen(opener).read()
print(data)
以上是代码,把‘Referer’的value 改成None也行,其他网站试过一个可以使用,ooxx依旧403,对于这种情况如果我一定要爬ooxx要怎么解决呢?
- import urllib.request
- import re
- from bs4 import BeautifulSoup
- url = 'http://jandan.net/ooxx/page-472#comments'
- req = urllib.request.Request(url)
- req.add_header('User-Agent',"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
- responce = urllib.request.urlopen(req)
-
- html = responce.read().decode('utf-8')
- soup = BeautifulSoup(html, "lxml")
- #_r = r'<a href="(.*?#comments)"'
- #_result = re.findall(_r, html)
- print(soup.img)
复制代码
|
|