.Net的“Random”类中的错误?
c#
我正在看一个问题,该问题讨论的是 Fisher-Yates shuffle 算法的错误实现,但我对错误实现时存在偏差感到困惑。
这两个算法是:
private Random _random = new Random();
public int[] FisherYates(int[] source)
{
int[] output = source.ToArray();
for (var i = 0; i < output.Length; i++)
{
var j = _random.Next(i, output.Length);
(output[i], output[j]) = (output[j], output[i]);
}
return output;
}
public int[] FisherYatesBad(int[] source)
{
int[] output = source.ToArray();
for (var i = 0; i < output.Length; i++)
{
var j = _random.Next(0, output.Length);
(output[i], output[j]) = (output[j], output[i]);
}
return output;
}
一个非常微妙的不同,但足以引起巨大的偏见。
良好的实施:
错误的实现:
为了清楚这些图,我从数字 0 到 99 开始,使用任何算法创建 10_000_000 次随机播放,然后对每个随机播放中的值进行平均以获得一组数字。如果 shuffle 尝试随机,那么所有 100 个数字都属于相同的正态分布。
现在,一切都很好,但我想我会检查这些方法是否产生有效的结果:
public int[] OrderByRandomNext(int[] source) => source.OrderBy(x => _random.Next()).ToArray();
public int[] OrderByRandomNextDouble(int[] source) => source.OrderBy(x => _random.NextDouble()).ToArray();
两者都很好,但它们是公平的洗牌吗?
OrderByRandomNext:
OrderByRandomNextDouble:
请注意,每个中的1和100数字都明显较低?
好吧,我认为这可能是OrderBy工作原理的人工制品。所以我用另一个随机数生成器测试了它——Eric Lippert 在他改进的随机系列中使用的一个。
public int[] OrderByBetterRandomNextDouble(int[] source) => source.OrderBy(x => BetterRandom.NextDouble()).ToArray();
public static class BetterRandom
{
private static readonly ThreadLocal<RandomNumberGenerator> crng =
new ThreadLocal<RandomNumberGenerator>(RandomNumberGenerator.Create);
private static readonly ThreadLocal<byte[]> bytes =
new ThreadLocal<byte[]>(() => new byte[sizeof(int)]);
public static int NextInt()
{
crng.Value.GetBytes(bytes.Value);
return BitConverter.ToInt32(bytes.Value, 0) & int.MaxValue;
}
public static double NextDouble()
{
while (true)
{
long x = NextInt() & 0x001FFFFF;
x <<= 31;
x |= (long)NextInt();
double n = x;
const double d = 1L << 52;
double q = n / d;
if (q != 1.0)
return q;
}
}
}
好吧,这是图表:
没有偏见!
这是我生成数据的代码(在 LINQPad 中运行):
void Main()
{
var n = 100;
var s = 1000000;
var numbers = Enumerable.Range(0, n).ToArray();
var algorithms = new Func<int[], int[]>[]
{
FisherYates,
OrderByRandomNext,
OrderByRandomNextDouble,
OrderByBetterRandomNextDouble,
};
var averages =
algorithms
.Select(algorithm =>
Enumerable
.Range(0, numbers.Length)
.Select(x =>
Enumerable
.Range(0, s)
.Select(y => algorithm(numbers))
.Aggregate(0.0, (a, v) => a + (double)v[x] / s))
.ToArray())
.Select(x => new
{
averages = x,
distribution = Accord.Statistics.Distributions.Univariate.NormalDistribution.Estimate(x.Skip(1).SkipLast(1).ToArray()),
first = x.First(),
last = x.Last(),
})
.Select(x => new
{
x.averages,
x.distribution,
x.first,
x.last,
first_prob =x.distribution.DistributionFunction(x.first),
last_prob = x.distribution.DistributionFunction(x.last),
})
.ToArray();
var d =
averages.Dump();
}
private Random _random = new Random();
public int[] FisherYates(int[] source)
{
int[] output = source.ToArray();
for (var i = 0; i < output.Length; i++)
{
var j = _random.Next(i, output.Length);
(output[i], output[j]) = (output[j], output[i]);
}
return output;
}
public int[] OrderByRandomNext(int[] source) => source.OrderBy(x => _random.Next()).ToArray();
public int[] OrderByRandomNextDouble(int[] source) => source.OrderBy(x => _random.NextDouble()).ToArray();
public int[] OrderByBetterRandomNextDouble(int[] source) => source.OrderBy(x => BetterRandom.NextDouble()).ToArray();
public static class BetterRandom
{
private static readonly ThreadLocal<RandomNumberGenerator> crng =
new ThreadLocal<RandomNumberGenerator>(RandomNumberGenerator.Create);
private static readonly ThreadLocal<byte[]> bytes =
new ThreadLocal<byte[]>(() => new byte[sizeof(int)]);
public static int NextInt()
{
crng.Value.GetBytes(bytes.Value);
return BitConverter.ToInt32(bytes.Value, 0) & int.MaxValue;
}
public static double NextDouble()
{
while (true)
{
long x = NextInt() & 0x001FFFFF;
x <<= 31;
x |= (long)NextInt();
double n = x;
const double d = 1L << 52;
double q = n / d;
if (q != 1.0)
return q;
}
}
}
这是我生成的数据:
分布| 第一 | 最后 | first_prob | 最后一个问题 -------------------------------------------------- ------ | ------------------ | ------------------ | --------------- | --------------------- N(x; ? = 49.50267467345823, ?² = 0.0008896228453062147) | 49.505465999987585 | 49.49833699998965 | 0.5372807100387846 | 0.44218570467529394 N(x; ? = 49.50503062243786, ?² = 0.0009954477334487531) | 49.36330799998817 | 49.37124399998651 | 3.529550818615057E-06 | 1.115772521409486E-05 N(x; ? = 49.505720877539765, ?² = 0.0008257970106087029) | 49.37231699998847 | 49.386660999990106 | 1.7228855271333998E-06 | 1.712972513601141E-05 N(x; ? = 49.49994663264188, ?² = 0.0007518765247716318) | 49.50191999998847 | 49.474235999989205 | 0.5286859991636343 | 0.17421285127499514
这是我的问题。用什么的了System.Random,偏置它引入了?
回答
.NET 中(包括).NET 5 中的默认 RNG 具有已知的偏差和性能问题,大部分记录在此处https://github.com/dotnet/runtime/issues/23198:
- Donald E. Knuth 的减法随机数生成器实现中的一个错字,实际效果未知。
- 具有未知实际效果的不同模数(2^32-1 而不是 2 的幂)。
Next(0, int.MaxValue)有很大的偏见。NextDouble()只产生 2^31 个可能的值,它可以从大约 2^62 个不同的值。
这就是 .NET 6 实现更好算法 ( xoshiro256** ) 的原因。当您在new Random()没有种子的情况下实例化实例时,您将获得更好的 RNG 。这在https://github.com/dotnet/runtime/pull/47085 中有描述。不幸的是,在提供种子时替换旧的 RNG 并不容易,因为人们可能依赖当前的、有偏见的 RNG 的行为。
尽管 xoshiro256** 也有一些记录在案的缺陷(和反驳),但我发现它非常适合我的目的。我已经从 .NET 6复制了改进的实现并使用了它。
旁注:LINQ 查询是惰性求值的(又名“延迟执行”)。如果您在.OrderBylambda 中使用 RNG,如果您迭代多次,您可能会得到令人困惑的结果,因为每次都可能更改顺序。一些排序算法依赖于这样一个事实,即元素不会突然改变它们的相对顺序才能正常工作。返回不一致的排序值会破坏这种排序算法。当然,今天OrderBy在 LINQ-to-Objects 中的实现工作正常,但没有文件保证它必须使用“随机”变化的值。一个合理的选择是.OrderBy(e => HashCode.Combine(0x1337, e))。
- For those still not using .Net 6 (because it's not yet released :P), one can use the [XoshiroPRNG.Net](https://www.nuget.org/packages/XoshiroPRNG.Net/) package. It provides a performant implementation of several PRNGs in the "Xoshiro family". I am the maintainer of that package, and if you need an algo I haven't coded, feel free to ask.
- I think there might be some misunderstanding here about the practice of producing a shuffle by providing a random key. Though as the question notes, certain choices of RNG can produce bias, the fundamental technique of "associate a random key with each item and sort by that key" is fine in LINQ. The implementation generates each key exactly once and stores it; the key generating lambda is NOT called every time that the item is encountered during the sort.
- All that said, the more general point made in this answer is worth keeping in mind. Executing a query twice can unexpectedly produce different results, either because the underlying collection changed or because the lambdas in the query are impure. If your program logic relies upon queries producing the same results twice, then you need to do work to ensure that you are not in one of those situations.