|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
今天无聊爬了一下豆瓣,我看到https://www.douban.com/photos/album/1649942160/里面是个黑白漫画,感觉有点意思,想把里面的图片爬取下来。
于是出现了这样的问题:'UCS-2' codec can't encode characters in position 40276-40276: Non-BMP character not supported in Tk
我觉得它的解决方法有点意思,所以发帖交流一下。哪位大神知道原理,可以在评论区告诉我,谢谢。
附上代码:
- import requests
- import os
- import re
- import sys
- non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
- url = 'https://www.douban.com/photos/album/1649942160/'
- def url_open(url):
- headers = {'User-Agent':
- 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
- ,'Referer':'https://www.douban.com/'
- }
- response = requests.get(url,headers=headers)
- return response
- #html = response.text
- #print(html)
- html = url_open(url).text.translate(non_bmp_map)#text
- #print(html)
- p = r'<img width="130" src="([^"]+\.jpg)"'
- img_addrs = re.findall(p,html)
- print(img_addrs)
- x = 1
- os.mkdir("douban")
- os.chdir("douban")
- for each in img_addrs:
- file = str(x) +".jpg"
- with open(file,"wb") as f:
- img = url_open(each).content
- f.write(img)
- x +=1
复制代码
|
-
|