当网站有文本时,BeautifulSoup返回一个空字符串
在这里考虑这个网站:https : //dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/
我正在寻找右侧标题下的内容。这是我的示例代码,它应该返回内容列表但返回空字符串:
import requests as req
from bs4 import BeautifulSoup as bs
r = req.get('https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/').text
soup = bs(r)
par = soup.find('h3', text= 'Facilities')
for sib in par.next_siblings:
print(sib)
这将返回:
<ul>
<div></div>
</ul>
该网站不显示该类的任何 div 元素。此外,未捕获列表项。
回答
该框架中的设施和其他信息由 动态加载JavaScript,因此bs4在源中看不到它们,HTML因为它们根本不存在。
但是,您可以查询端点并获取所需的所有信息。
就是这样:
import json
import re
import time
import requests
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://dlnr.hawaii.gov/",
}
endpoint = f"https://stateparksadmin.ehawaii.gov/camping/park-site.json?parkId=57853&_={int(time.time())}"
response = requests.get(endpoint, headers=headers).text
data = json.loads(re.search(r"callback((.*));", response).group(1))
print("n".join(f for f in data["park info"]["facilities"]))
输出:
Boat Ramp
Campsites
Picnic table
Restroom
Showers
Trash Cans
Water Fountain
这是整个JSON:
{
"park info": {
"name": "Ahupuau02bba u02bbO Kahana State Park",
"id": 57853,
"island": "Oahu",
"activities": [
"Beachgoing",
"Camping",
"Dogs on Leash",
"Fishing",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"Boat Ramp",
"Campsites",
"Picnic table",
"Restroom",
"Showers",
"Trash Cans",
"Water Fountain"
],
"prohibited": [
"No Motorized Vehicles/ATV's",
"No Alcoholic Beverages",
"No Open Fires",
"No Smoking",
"No Commercial Activities"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.556086,
"longitude": -157.875579
},
"hiking": [
{
"name": "Nakoa Trail",
"id": 17,
"activities": [
"Dogs on Leash",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"No Drinking Water"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [
"Flash Flood"
],
"photos": [],
"location": {
"latitude": 21.551087,
"longitude": -157.881228
},
"has_google_street": false
},
{
"name": "Kapau2018eleu2018ele Trail",
"id": 18,
"activities": [
"Dogs on Leash",
"Hiking",
"Sightseeing"
],
"facilities": [
"No Drinking Water",
"Restroom",
"Trash Cans"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.554744,
"longitude": -157.876601
},
"has_google_street": false
}
]
}
}
- Your initial attempt in the post above doesn't indicate anything about scrapy. You should not come up with any new requirement once your question is answered. However, you can always create a new post describing any new issue @maverick.