对数样条产生不一致的结果
我需要我的 R 脚本始终为相同的输入生成相同的结果。但是,我注意到logspline包中的函数logspline在 Windows 操作系统和 ubuntu 操作系统上产生不同的输出。不过,两者都使用logspline包 2.1.16。
这是代码:
x <- c(0.3453205379,0.3497529927,0.3460179029,0.3433414591,0.3565925053,0.3318019585,0.3322091870,0.3314076990,0.3413768315,0.3305650805,0.3342775671,0.3362692445,0.3321054345,0.5982416984,0.5509602046,0.6600000000,0.3339818725,0.3307459063,0.3314632807,0.3356930476,0.3300000000,0.3324504116,0.3470739551,0.3441385006,0.3316070520,0.3399635743,0.3316989471,0.3308044524,0.3536822479,0.3315414656)
fnc <- logspline(x)
plot.logspline(fnc)
x <- c(0.3453205379,0.3497529927,0.3460179029,0.3433414591,0.3565925053,0.3318019585,0.3322091870,0.3314076990,0.3413768315,0.3305650805,0.3342775671,0.3362692445,0.3321054345,0.5982416984,0.5509602046,0.6600000000,0.3339818725,0.3307459063,0.3314632807,0.3356930476,0.3300000000,0.3324504116,0.3470739551,0.3441385006,0.3316070520,0.3399635743,0.3316989471,0.3308044524,0.3536822479,0.3315414656)
fnc <- logspline(x)
plot.logspline(fnc)
这是窗户上的结果 - 装有 7 节:
这是 ubuntu 上的结果 - 配备 5 节:
有人可以向我解释差异的原因吗?有没有办法强制函数在任何环境中产生一致的结果?
logspline(x, error.action=0) 在 Windows 上输出:
logspline(x, error.action=0) linux上的输出:
knots A(1)/D(2) loglik AIC minimum penalty maximum penalty
4 2 40.53 -70.85 113.83 Inf
5 2 97.44 -181.28 5.48 113.83
6 2 99.48 -181.96 NA NA
7 2 102.92 -185.44 -0.13 5.48
8 1 102.86 -181.90 NA NA
the present optimal number of knots is 7
penalty(AIC) was the default: BIC=log(samplesize): log( 30 )= 3.4
sessionInfo() 在 Windows 上输出:
knots A(1)/D(2) loglik AIC minimum penalty maximum penalty
4 2 87.33 -164.46 8.38 Inf
5 2 91.52 -169.44 2.07 8.38
6 1 91.95 -166.88 NA NA
7 1 93.60 -166.78 0.00 2.07
8 1 93.60 -163.38 NA NA
the present optimal number of knots is 5
penalty(AIC) was the default: BIC=log(samplesize): log( 30 )= 3.4
sessionInfo() linux上的输出:
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] logspline_2.1.16
loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0
回答
Not a complete answer, but this was way too long for a comment.
I tried different systems and this is what I found:
- docker:
rocker/r-base:4.1.0returns 5 knots (gccandgfortranversion 10.2.1LAPACKopenblas-pthread/libopenblasp-r0.3.13) - docker:
rocker/r-base:3.6.3returns 5 knots (gccandgfortranversion 9.2.1-30LAPACKlapack/liblapack.so.3.9.0) - Fresh docker install Ubuntu 21.04 (alt 20.04) with 4.1.0 returns 5 knots (
gccandgfortranversion 10.3.0 (9.3.0 for ubuntu:20.04)LAPACKlapack/liblapack.so.3.9.0) - Windows 4.1.0 returns 7 knots (not sure which
gcc,gfortran,LAPACKversion was used) - Ubuntu 18.04 (docker)
ubuntu:18.04with R 4.1.0 returns 7 knots (gccandgfortranversion 7.5.0,LAPACKversion liblapack.so.3.7.1, alternatively tested with /atlas/liblapack.so.3.10.3, same results) - Ubuntu 18.04 with 4.0.2 returns 7 knots (
gccandgfortranversion 7.5.0LAPACKlibopenblasp-r0.2.20)
The package logspline uses multiple fortran and C functions, one of them being nlogcensor. The code below calculates many things, among others the nknots.
MWE for the internal code in logspline::logspline (line 483ff of R/logspline.R in version 0.2.8 (github), for the CRAN version (current 0.2.16) it is line 486ff).
logspline() calls in nlogcensor() a C function which is defined in src/nlsd.c at line 166ff (latest version line 167ff).
library(logspline)
intpars <- c(30L, 0L, 0L, 0L, 1L, 1L, -1L)
data <- c(0.33, 0.3305650805, 0.3307459063, 0.3308044524, 0.331407699,
0.3314632807, 0.3315414656, 0.331607052, 0.3316989471, 0.3318019585,
0.3321054345, 0.332209187, 0.3324504116, 0.3339818725, 0.3342775671,
0.3356930476, 0.3362692445, 0.3399635743, 0.3413768315, 0.3433414591,
0.3441385006, 0.3453205379, 0.3460179029, 0.3470739551, 0.3497529927,
0.3536822479, 0.3565925053, 0.5509602046, 0.5982416984, 0.66,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0)
dpars <- c(-1, 0, 0)
maxp <- 65
kts <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0)
# actual code
z <- .C("nlogcensor",
ip = as.integer(intpars),
coef = as.double(data),
dp = as.double(dpars),
logl = as.double(rep(0, maxp)),
ad = as.integer(rep(0, maxp)),
kts = as.double(kts),
PACKAGE = "logspline")
z$ip[2]
#> 7 some older systems, 5 for newer systems
This narrows down the source of error. And this is where my current journey ends. Unfortunately, I don't have much C experience and no FORTRAN experience.
I tried to follow the C routines, but the code is little formatted and hard to read for me. In line 328 of nlsd.c (version 0.2.16 line 335), the final nknots is assigned as intpars[1]=(*spc).nk; (C is zero indexed, R is 1 indexed).
Maybe you can reprodue the error using the nlogcensor code call and confirm different different Fortran or LAPACK toolchains.
Answer Attempt
As can be seen from the text above, there is no simple way to make the package fully reproducible. That is, set.seed(123) and controlling the version of R and the package itself does not help as I suspect the toolchain version is different and leads to the differences.
The results from the different tests above suggest, that if you have a liblapack version >= 3.9.0 (or libopenblasp >= r0.3.13), you get 5 knots. Older versions lead to 7 knots.
As you usually have little control over this, you might want to look into docker and use a specified version, e.g., rocker/r-base:4.1.0. This might be a bit of an overkill for this problem (not sure how or for what you use the code), but this is the only way I currently see you can achieve full reproducibility as you have control over the toolchain as well.
Hope that clears your problem a bit or points you to a potential solution.
Docker Scripts
In case you want to replicate the results:
For the rocker/r-base images, just use docker run --rm -it rocker/r-base:4.1.0 and then within R install.packages("logspline") and the MWE code from above.
The full install script for ubuntu is a bit more complicated, see below:
library(logspline)
intpars <- c(30L, 0L, 0L, 0L, 1L, 1L, -1L)
data <- c(0.33, 0.3305650805, 0.3307459063, 0.3308044524, 0.331407699,
0.3314632807, 0.3315414656, 0.331607052, 0.3316989471, 0.3318019585,
0.3321054345, 0.332209187, 0.3324504116, 0.3339818725, 0.3342775671,
0.3356930476, 0.3362692445, 0.3399635743, 0.3413768315, 0.3433414591,
0.3441385006, 0.3453205379, 0.3460179029, 0.3470739551, 0.3497529927,
0.3536822479, 0.3565925053, 0.5509602046, 0.5982416984, 0.66,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0)
dpars <- c(-1, 0, 0)
maxp <- 65
kts <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0)
# actual code
z <- .C("nlogcensor",
ip = as.integer(intpars),
coef = as.double(data),
dp = as.double(dpars),
logl = as.double(rep(0, maxp)),
ad = as.integer(rep(0, maxp)),
kts = as.double(kts),
PACKAGE = "logspline")
z$ip[2]
#> 7 some older systems, 5 for newer systems
and then use the code from the MWE above.