大佬帮忙看看，这个爬虫程序到底是哪里出了问题，顺便帮忙优化一下？,Python交流,技术交流,鱼C论坛

chenyiyun 发表于 2023-5-14 21:22:37

大佬帮忙看看，这个爬虫程序到底是哪里出了问题，顺便帮忙优化一下？

一个webdriver动态爬取图片的程序
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
import os
PATH = os.getcwd()+"\\images\\崩坏：星穹铁道\\"

def init(browser):
tmp=int(input("向下刷新几次?(这将会影响到您爬取帖子的数量)(请填写阿拉伯数字)："))
# 创建浏览器对象

# 打开网页
url = 'https://www.miyoushe.com/sr/search?keyword=%E5%90%8C%E4%BA%BA%E5%9B%BE'
browser.get(url)
time.sleep(5)
for i in range(tmp):
   browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
   time.sleep(1)
html=browser.page_source
soup=BeautifulSoup(html,'html.parser')
webs=soup.find_all('a',{"class": "mhy-router-link mhy-article-card__link"})
for i in range(len(webs)):
   webs='https://www.miyoushe.com'+str(webs['href'])
return webs
#等待加载（错误部分）
def wait1(browser):
try:
   time.sleep(0.1)
   html=browser.page_source
   soup=BeautifulSoup(html,'html.parser')
   tmp=soup.find('div',{'class':'mhy-layout mhy-main-page mhy-article-page'}).find_all('img',{"class":""})
   print(tmp)
   return soup
except:
   print('unaccept')
   wait1(browser)

def get_into_each_url(browser,url):
browser.get(url)
soup=wait1(browser)
tmp1=str(soup.find('title').text)
#去除非法字符
while ':' in tmp1:
   tmp1=tmp1[:tmp1.find(':')]+'：'+tmp1
while '<' in tmp1:
   tmp1=tmp1[:tmp1.find('<')]+'《'+tmp1
while '>' in tmp1:
   tmp1=tmp1[:tmp1.find('>')]+'》'+tmp1
while '?' in tmp1:
   tmp1=tmp1[:tmp1.find('?')]+'？'+tmp1
while '|' in tmp1:
   tmp1=tmp1[:tmp1.find('|')]+'-'+tmp1
tmp2=str(soup.find('div',{'class':'mhy-article-page-updatetime'}).find_all('span').text)
file_name=tmp1[:tmp1.find('-崩坏')]+' '+tmp2+'\\'
print('正在爬取帖子《'+tmp1[:tmp1.find('-崩坏')]+'》图片')
imgs=soup.find_all('img',{"class":""})
i=0
while i<len(imgs):
   imgs=imgs['src']
   if (not("https://upload-bbs.miyoushe.com" in imgs)) or (not("?" in imgs)):
         del imgs
         continue
   i+=1
path=PATH+file_name
if not os.path.exists(path):
   os.makedirs(path)
for i in range(len(imgs)):
   res = requests.get(imgs)
   j=len(imgs)-1
   while imgs!=',':
         j-=1
   with open(path+str(i+1)+'.'+imgs,'wb') as img:
         img.write(res.content)



def main_script():
option=webdriver.ChromeOptions()
option.add_argument('--headless')
browser = webdriver.Chrome(options=option)
urls=init(browser)
for i in range(len(urls)):
   if 'https://www.miyoushe.com/bh3/404' in urls:
         continue
   print(urls,end=':')
   get_into_each_url(browser,urls)
browser.quit()
if __name__=="__main__":
main_script()

isdkz 发表于 2023-5-14 21:24:21

这个爬虫程序有几个问题：

1. 在函数 wait1 中，当出现异常时，虽然会递归调用 wait1 函数，但是没有返回值，会导致后面的代码出现错误。应该在递归调用 wait1 函数时，加上 return 语句。

2. 在函数 get_into_each_url 中，对于图片链接的过滤条件写得不够严谨。应该判断链接是否以 https://upload-bbs.miyoushe.com 开头，而不是判断链接中是否包含该字符串。

3. 在函数 get_into_each_url 中，如果下载的图片数量较多，会导致文件夹层级较深，不易管理。应该考虑将文件夹的命名方式改为更加简单明了的方式。

4. 在函数 get_into_each_url 中，下载的图片文件名是按照图片在页面中的顺序来命名的，这样会导致后面下载的图片可能会覆盖前面已经下载的图片。应该考虑使用图片的MD5值来作为文件名，这样可以保证文件名的唯一性。

优化建议：

1. 在爬取页面时，可以使用 requests 库来发送HTTP请求，而不是使用 webdriver 库。这样可以避免使用浏览器的开销，提高爬取效率。

2. 在爬取页面时，可以使用多线程或者异步IO等技术来提高爬取效率。

3. 在下载图片时，可以使用线程池或者协程等技术来提高下载效率。

4. 在下载图片时，可以使用断点续传等技术来提高下载可靠性。

chenyiyun 发表于 2023-5-14 21:43:23

isdkz 发表于 2023-5-14 21:24
这个爬虫程序有几个问题：

1. 在函数 wait1 中，当出现异常时，虽然会递归调用 wait1 函数，但是没有返回 ...

可以用requests库吗？
网站是用javascript动态刷新页面的，webdriver应该是最简便的——吧？

多线程不会……（菜鸟一个爬虫，涉猎不深，只是突然想写一个爬虫而已）

chenyiyun 发表于 2023-5-14 21:44:50

isdkz 发表于 2023-5-14 21:24
这个爬虫程序有几个问题：

1. 在函数 wait1 中，当出现异常时，虽然会递归调用 wait1 函数，但是没有返回 ...

返回值本来应该是变量soup，但不知道为什么返回不了？NopeType……
是要改成global吗？

sfqxx 发表于 2023-5-14 22:23:27

币

sfqxx 发表于 2023-5-14 22:23:55

嘿嘿

歌者文明清理员 发表于 2023-5-14 22:47:07

币

Threebody1 发表于 2023-5-14 22:47:41

币

歌者文明清理员 发表于 2023-5-14 22:47:45

币

歌者文明清理员 发表于 2023-5-14 22:49:49

币

Threebody1 发表于 2023-5-14 22:50:08

报错代码捏

歌者文明清理员 发表于 2023-5-14 22:51:48

应该是找不到.Find_all(div,{'class':''})
然后递归，还是报错
最大递归次数错误

落花盈满绣！ 发表于 2023-5-15 08:59:07

{:10_249:}

18975173112 发表于 2023-5-15 09:31:54

学习中

chenyiyun 发表于 2023-5-15 18:58:48

歌者文明清理员发表于 2023-5-14 22:51
应该是找不到.Find_all(div,{'class':''})
然后递归，还是报错
最大递归次数错误

敢问大佬说的是哪一句？

歌者文明清理员 发表于 2023-5-15 19:00:50

chenyiyun 发表于 2023-5-15 18:58
敢问大佬说的是哪一句？

line 31

chenyiyun 发表于 2023-5-15 19:14:45

chenyiyun 发表于 2023-5-14 21:44
返回值本来应该是变量soup，但不知道为什么返回不了？NopeType……
是要改成global吗？

呃，是我唐突了，居然忘加了

woshizhangpengp 发表于 2023-5-15 19:39:22

{:5_106:}

woshizhangpengp 发表于 2023-5-20 13:06:09

{:5_106:}

sfqxx 发表于 2023-5-20 19:19:15

币

页: [1] 2

鱼C论坛's Archiver

大佬帮忙看看，这个爬虫程序到底是哪里出了问题，顺便帮忙优化一下？