使用selenium抓取页面链接总是返回有限数量的链接

html5 • 2022年9月19日 pm2:53 • 问答

我想从这个页面“https://m.aiscore.com/basketball/20210610”中抓取所有匹配链接，但只能得到限制数量的匹配：我试过这个代码：

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless") 
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)

url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)

driver.maximize_window()
driver.implicitly_wait(60) 

driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")    

soup = BeautifulSoup(driver.page_source, 'html.parser')

links = [i['href'] for i in soup.select('.w100.flex a')]
links_length = len(links) #always return 16
driver.quit()

当我运行代码时，我总是只得到 16 个匹配链接，但页面有 35 个匹配。我需要获取页面中的所有匹配链接。

回答

由于滚动时正在加载站点，我尝试一次滚动一个屏幕，直到我们需要滚动到的高度大于窗口的总滚动高度。

我使用 aset来存储匹配链接以避免添加现有的匹配链接。

在运行这个时，我能够找到所有的链接。希望这对你也有用。

import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless") 
driver = webdriver.Chrome(executable_path=r"C:UsersUserDownloadschromedriver.exe", options=options)

url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)
# Wait till the webpage is loaded
time.sleep(2)

# wait for 1sec after scrolling
scroll_wait = 1

# Gets the screen height
screen_height = driver.execute_script("return window.screen.height;")
driver.implicitly_wait(60) 

# Number of scrolls. Initially 1
ScrollNumber = 1

# Set to store all the match links
ans = set()

while True:
    # Scrolling one screen at a time until
    driver.execute_script(f"window.scrollTo(0, {screen_height * ScrollNumber})")
    ScrollNumber += 1
    
    # Wait for some time after scroll
    time.sleep(scroll_wait)
    
    # Updating the scroll_height after each scroll
    scroll_height = driver.execute_script("return document.body.scrollHeight;")
    
    # Fetching the data that we need - Links to Matches
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for j in soup.select('.w100 .flex a'):
        if j['href'] not in ans:
            ans.add(j['href'])
    # Break when the height we need to scroll to is larger than the scroll height
    if (screen_height) * ScrollNumber > scroll_height:
        break
    
    
print(f'Links found: {len(ans)}')

Output:

Links found: 61

以上是使用selenium抓取页面链接总是返回有限数量的链接的全部内容。

THE END

二维码

如何在Java数组上设置新属性或方法？

< <上一篇

如何使用Haskell从Internet下载文件？

下一篇>>

搜索内容

使用selenium抓取页面链接总是返回有限数量的链接

回答

目录

目录

推荐文章

最新文章