在C#中有效地在大文件中搜索字符串

html5 • 2022年11月2日 pm9:44 • 问答

我正在构建一个通过比较哈希来扫描文件的应用程序。我需要搜索超过 1GB 的哈希值来获取文件的哈希值。我为此找到了其他解决方案，例如 Aho-Corasick，但它比File.ReadLines(file).Contains(str).

这是迄今为止最快的代码，使用File.ReadLines. 扫描一个文件大约需要 8 秒，而使用 Aho-Corasick 扫描一个文件大约需要 2 分钟。由于显而易见的原因，我无法将整个哈希文件读入内存。

IEnumerable<DirectoryInfo> directories = new DirectoryInfo(scanPath).EnumerateDirectories();
IEnumerable<FileInfo> files = new DirectoryInfo(scanPath).EnumerateFiles();

FileInfo hashes = new FileInfo(hashPath);
await Task.Run(() =>
{
    IEnumerable<string> lines = File.ReadLines(hashes.FullName);
    
    foreach (FileInfo file in files) {
        if (!AuthenticodeTools.IsTrusted(file.FullName))
        {
            string hash = getHash(file.FullName);
            if (lines.Contains(hash)) flaggedFiles.Add(file.FullName);
        }
        filesScanned += 1;
    }
});
foreach (DirectoryInfo directory in directories)
{
    await scan(directory.FullName, hashPath);
    directoriesScanned += 1;
}

编辑：根据请求，以下是文件内容的示例：

5c269c9ec0255bbd9f4e20420233b1a7
63510b1eea36a23b3520e2b39c35ef4e
0955924ebc1876f0b849b3b9e45ed49d

它们是 MD5 哈希值。

回答

由于哈希固定为 32 个十六进制数字（16 个字节），因此它们应该以二进制格式存储，没有空格。我们可以通过简单的乘法对每个散列进行直接查找。

如果我们然后按顺序对文件中的散列进行排序，我们可以通过对每个散列进行二分搜索来加快速度。

可以使用以下CompareHashes函数作为比较函数进行排序。

完成后，我们可以进行二分查找。

二分搜索是一种搜索排序列表的简单算法。它具有O (log ₂ n) 复杂度，因此，对于您拥有的散列数量，最多只需要大约 25 次查找。算法如下：

从中间开始。
如果我们正在寻找的项目在那里，那么很好。
如果更早，则将搜索的高点更改为在此之前的一个。将差值除以二，然后循环回到步骤 2。
如果较晚，则将搜索的低点更改为在此之后的一个。将差值除以二，然后循环回到步骤 2。
如果我们到达最后一个，那么我们将找不到该项目。

（ArraySortHelper为此，我已经从.Net Framework 中提取并修改了一些代码。）

public static bool ContainsHash(FileStream hashFile, byte[] hash)
{
    const long hashSize = 16;
    var curHash = new byte[hashSize];
    long lo = 0;
    long hi = hashFile.Length / hashSize - 1;
    while (lo <= hi)
    {
        long i = lo + ((hi - lo) >> 1);
        hashFile.Read(curHash, i * hashSize, hashSize);

        int order = CompareHashes(curHash, hash);
 
        if (order == 0) return true;
        if (order < 0)
        {
            lo = i + 1;
        }
        else
        {
            hi = i - 1;
        }
    }
    return false;
}

public static int CompareHashes(byte[] b1, byte[] b2)
{
    var comp = 0;
    for (int i = 0; i < b1.Length; i++)
    {
        comp = b1[i].CompareTo(b2[i]);
        if(comp != 0) return comp;
    }
    return comp;
}

我们只需要打开哈希文件一次，并将哈希传递给函数FileStream，加上一个哈希进行比较。

我可能有一些轻微的错误，因为我没有测试过。我希望其他人测试和编辑这个答案。

@Zer0 I could claim that Charlieface's solution is nothing more than a tiny home-made database, consisting of a single table with one field. If this answer is in the scope of the OP's question, then so are memory mapped files and [Disk Based Data Structures](https://archive.codeplex.com/?p=mmf) and whatnot.

以上是在C#中有效地在大文件中搜索字符串的全部内容。

THE END

二维码

css实现盒子水平垂直居中–3种常用方法

< <上一篇

向Discord频道发送消息

下一篇>>

搜索内容

在C#中有效地在大文件中搜索字符串

回答

目录

目录

推荐文章

最新文章