如何使用puppeteer从网站获取所有链接

html5 • 2022年9月19日 am2:00 • 问答

好吧，我想要一种方法来使用 puppeteer 和 for 循环来获取站点上的所有链接并将它们添加到数组中，在这种情况下，我想要的链接不是 html 标签中的链接，它们是链接直接在源代码、javascript 文件链接等中......我想要这样的东西：

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

但是如何获取对网站源代码中的 javascript 样式文件和所有 URL 的所有引用？我只是找到一个帖子和一个问题，教或展示它如何从标签中获取链接，而不是从源代码中获取所有链接。

假设您想获取此页面上的所有标签，例如：

查看源：https : //www.nike.com/

如何获取所有脚本标签并返回控制台？我view-source:https://nike.com之所以这样说是因为您可以获得脚本标签，我不知道您是否可以在不显示源代码的情况下做到这一点，但是我考虑过显示和获取脚本标签，因为这是我的想法，但是我不知道如何去做吧

回答

可以仅使用 node.js 从 URL 获取所有链接，而无需 puppeteer：

主要有两个步骤：

获取 URL 的源代码。
解析链接的源代码。

node.js 中的简单实现：

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.com');

// Extract html source from response, then process it:
promise.then(function(response) {
    htmlSource = response.data
    getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
    // Regular expression that matches syntax for a link (/sf/answers/266660481/):
    LINK_REGEX = /https?://(www.)?[-a-zA-Z0-9@:%._+~#=]{1,256}.[a-zA-Z0-9()]{1,6}b([-a-zA-Z0-9()@:%_+.~#?&//=]*)/gi;

    // Use the regular expression from above to find all the links:
    matches = htmlString.match(LINK_REGEX);

    // Output to console:
    console.log(matches);

    // Alternatively, return the array of links for further processing:
    return matches;
}

示例用法：

$ node get-links.js
[
    'http://www.w3.org/2000/svg',
    ...
    'https://s3.nikecdn.com/unite/scripts/unite.min.js',
    'https://www.nike.com/android-icon-192x192.png',
    ...
    'https://connect.facebook.net/',
... 658 more items
]

笔记：

为简单起见，我使用 axios 库并避免来自 nike.com 的“访问被拒绝”错误。可以使用任何其他方法来获取 HTML 源代码，例如：
- 本机 node.js http/https 库
- Puppeteer（使用 puppeteer获取完整的网页源 html - 但总是缺少某些部分）

以上是如何使用puppeteer从网站获取所有链接的全部内容。

THE END

二维码

Delphi将TArray<Strings>转换为DateTime

< <上一篇

无法将WinRT/C++组件加载到我的UWP/C#应用程序

下一篇>>

搜索内容

如何使用puppeteer从网站获取所有链接

回答

目录

目录

推荐文章

最新文章