|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
目标URL: http://quotes.toscrape.com/js/
目标内容:网页内容,如:The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein
问题:去除头部字符串后,无法生成合适的Json数据。
- import requests
- import json
- from bs4 import BeautifulSoup
- url = "http://quotes.toscrape.com/js/"
- r = requests.get(url)
- soup = BeautifulSoup(r.text,"lxml")
- print(r.status_code)
- a = soup.findAll("script")[1].get_text()
- b = a[a.find("{"):a.find("];")-1]
- print(b)
- json_data = json.loads(b)
复制代码
报错信息如下:
- 200
- Traceback (most recent call last):
- File "F:\01_Python\try.py", line 1878, in <module>
- json_data = json.loads(b)
- File "C:\Python36\lib\json\__init__.py", line 354, in loads
- return _default_decoder.decode(s)
- File "C:\Python36\lib\json\decoder.py", line 342, in decode
- raise JSONDecodeError("Extra data", s, end)
- json.decoder.JSONDecodeError: Extra data: line 14 column 6 (char 450)
复制代码
打印b的格式如下,为了简便起见,删除了里面若干行材料:
- {
- "tags": [
- "change",
- "deep-thoughts",
- "thinking",
- "world"
- ],
- "author": {
- "name": "Albert Einstein",
- "goodreads_link": "/author/show/9810.Albert_Einstein",
- "slug": "Albert-Einstein"
- },
- "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
- },
- {
- "tags": [
- "abilities",
- "choices"
- ],
- "author": {
- "name": "J.K. Rowling",
- "goodreads_link": "/author/show/1077326.J_K_Rowling",
- "slug": "J-K-Rowling"
- },
- "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
- },
- {
- "tags": [
- "humor",
- "obvious",
- "simile"
- ],
- "author": {
- "name": "Steve Martin",
- "goodreads_link": "/author/show/7103.Steve_Martin",
- "slug": "Steve-Martin"
- },
- "text": "\u201cA day without sunshine is like, you know, night.\u201d"
- }
复制代码
所以现在卡在无法将b转变成合适的json数据,请教各位大神帮帮忙,谢谢!
另外,已经尝试过一些简单的验证,包括验证b的头部和尾部,替代\n和空格,但依然报错。
本帖最后由 sky 于 2018-1-19 10:36 编辑
数据的格式是字典组成的列表
你的b的字符串是一个大括号包着的许多字典就成了集合set
是没法loads的
- b = a[a.find("["):a.find("];")+1]
复制代码
用原本的格式试试
|
|