对数样条产生不一致的结果

我需要我的 R 脚本始终为相同的输入生成相同的结果。但是,我注意到logspline包中的函数logspline在 Windows 操作系统和 ubuntu 操作系统上产生不同的输出。不过,两者都使用logspline包 2.1.16。

这是代码:

x <- c(0.3453205379,0.3497529927,0.3460179029,0.3433414591,0.3565925053,0.3318019585,0.3322091870,0.3314076990,0.3413768315,0.3305650805,0.3342775671,0.3362692445,0.3321054345,0.5982416984,0.5509602046,0.6600000000,0.3339818725,0.3307459063,0.3314632807,0.3356930476,0.3300000000,0.3324504116,0.3470739551,0.3441385006,0.3316070520,0.3399635743,0.3316989471,0.3308044524,0.3536822479,0.3315414656)
fnc <- logspline(x)
plot.logspline(fnc)
x <- c(0.3453205379,0.3497529927,0.3460179029,0.3433414591,0.3565925053,0.3318019585,0.3322091870,0.3314076990,0.3413768315,0.3305650805,0.3342775671,0.3362692445,0.3321054345,0.5982416984,0.5509602046,0.6600000000,0.3339818725,0.3307459063,0.3314632807,0.3356930476,0.3300000000,0.3324504116,0.3470739551,0.3441385006,0.3316070520,0.3399635743,0.3316989471,0.3308044524,0.3536822479,0.3315414656)
fnc <- logspline(x)
plot.logspline(fnc)

这是窗户上的结果 - 装有 7 节:

这是 ubuntu 上的结果 - 配备 5 节:

有人可以向我解释差异的原因吗?有没有办法强制函数在任何环境中产生一致的结果?

logspline(x, error.action=0) 在 Windows 上输出

logspline(x, error.action=0) linux上的输出

 knots A(1)/D(2) loglik     AIC minimum penalty maximum penalty
     4         2  40.53  -70.85          113.83             Inf
     5         2  97.44 -181.28            5.48          113.83
     6         2  99.48 -181.96              NA              NA
     7         2 102.92 -185.44           -0.13            5.48
     8         1 102.86 -181.90              NA              NA
the present optimal number of knots is  7 
penalty(AIC) was the default: BIC=log(samplesize): log( 30 )= 3.4 

sessionInfo() 在 Windows 上输出

 knots A(1)/D(2) loglik     AIC minimum penalty maximum penalty
     4         2  87.33 -164.46            8.38             Inf
     5         2  91.52 -169.44            2.07            8.38
     6         1  91.95 -166.88              NA              NA
     7         1  93.60 -166.78            0.00            2.07
     8         1  93.60 -163.38              NA              NA
the present optimal number of knots is  5 
penalty(AIC) was the default: BIC=log(samplesize): log( 30 )= 3.4

sessionInfo() linux上的输出

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] logspline_2.1.16

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0  

回答

Not a complete answer, but this was way too long for a comment.

I tried different systems and this is what I found:

  • docker: rocker/r-base:4.1.0 returns 5 knots (gcc and gfortran version 10.2.1 LAPACK openblas-pthread/libopenblasp-r0.3.13)
  • docker: rocker/r-base:3.6.3 returns 5 knots (gcc and gfortran version 9.2.1-30 LAPACK lapack/liblapack.so.3.9.0)
  • Fresh docker install Ubuntu 21.04 (alt 20.04) with 4.1.0 returns 5 knots (gcc and gfortran version 10.3.0 (9.3.0 for ubuntu:20.04) LAPACK lapack/liblapack.so.3.9.0)
  • Windows 4.1.0 returns 7 knots (not sure which gcc, gfortran, LAPACK version was used)
  • Ubuntu 18.04 (docker) ubuntu:18.04 with R 4.1.0 returns 7 knots (gcc and gfortran version 7.5.0, LAPACK version liblapack.so.3.7.1, alternatively tested with /atlas/liblapack.so.3.10.3, same results)
  • Ubuntu 18.04 with 4.0.2 returns 7 knots (gcc and gfortran version 7.5.0 LAPACK libopenblasp-r0.2.20)

The package logspline uses multiple fortran and C functions, one of them being nlogcensor. The code below calculates many things, among others the nknots.

MWE for the internal code in logspline::logspline (line 483ff of R/logspline.R in version 0.2.8 (github), for the CRAN version (current 0.2.16) it is line 486ff).

logspline() calls in nlogcensor() a C function which is defined in src/nlsd.c at line 166ff (latest version line 167ff).

library(logspline)
intpars <- c(30L, 0L, 0L, 0L, 1L, 1L, -1L)
data <- c(0.33, 0.3305650805, 0.3307459063, 0.3308044524, 0.331407699, 
          0.3314632807, 0.3315414656, 0.331607052, 0.3316989471, 0.3318019585, 
          0.3321054345, 0.332209187, 0.3324504116, 0.3339818725, 0.3342775671, 
          0.3356930476, 0.3362692445, 0.3399635743, 0.3413768315, 0.3433414591, 
          0.3441385006, 0.3453205379, 0.3460179029, 0.3470739551, 0.3497529927, 
          0.3536822479, 0.3565925053, 0.5509602046, 0.5982416984, 0.66, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0)
dpars <- c(-1, 0, 0)
maxp <- 65
kts <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0)

# actual code
z <- .C("nlogcensor",
        ip = as.integer(intpars),
        coef = as.double(data),
        dp = as.double(dpars),
        logl = as.double(rep(0, maxp)),
        ad = as.integer(rep(0, maxp)),
        kts = as.double(kts),
        PACKAGE = "logspline")

z$ip[2]
#> 7 some older systems, 5 for newer systems

This narrows down the source of error. And this is where my current journey ends. Unfortunately, I don't have much C experience and no FORTRAN experience.

I tried to follow the C routines, but the code is little formatted and hard to read for me. In line 328 of nlsd.c (version 0.2.16 line 335), the final nknots is assigned as intpars[1]=(*spc).nk; (C is zero indexed, R is 1 indexed).

Maybe you can reprodue the error using the nlogcensor code call and confirm different different Fortran or LAPACK toolchains.

Answer Attempt

As can be seen from the text above, there is no simple way to make the package fully reproducible. That is, set.seed(123) and controlling the version of R and the package itself does not help as I suspect the toolchain version is different and leads to the differences.

The results from the different tests above suggest, that if you have a liblapack version >= 3.9.0 (or libopenblasp >= r0.3.13), you get 5 knots. Older versions lead to 7 knots.

As you usually have little control over this, you might want to look into docker and use a specified version, e.g., rocker/r-base:4.1.0. This might be a bit of an overkill for this problem (not sure how or for what you use the code), but this is the only way I currently see you can achieve full reproducibility as you have control over the toolchain as well.

Hope that clears your problem a bit or points you to a potential solution.

Docker Scripts

In case you want to replicate the results:

For the rocker/r-base images, just use docker run --rm -it rocker/r-base:4.1.0 and then within R install.packages("logspline") and the MWE code from above.

The full install script for ubuntu is a bit more complicated, see below:

library(logspline)
intpars <- c(30L, 0L, 0L, 0L, 1L, 1L, -1L)
data <- c(0.33, 0.3305650805, 0.3307459063, 0.3308044524, 0.331407699, 
          0.3314632807, 0.3315414656, 0.331607052, 0.3316989471, 0.3318019585, 
          0.3321054345, 0.332209187, 0.3324504116, 0.3339818725, 0.3342775671, 
          0.3356930476, 0.3362692445, 0.3399635743, 0.3413768315, 0.3433414591, 
          0.3441385006, 0.3453205379, 0.3460179029, 0.3470739551, 0.3497529927, 
          0.3536822479, 0.3565925053, 0.5509602046, 0.5982416984, 0.66, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0)
dpars <- c(-1, 0, 0)
maxp <- 65
kts <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0)

# actual code
z <- .C("nlogcensor",
        ip = as.integer(intpars),
        coef = as.double(data),
        dp = as.double(dpars),
        logl = as.double(rep(0, maxp)),
        ad = as.integer(rep(0, maxp)),
        kts = as.double(kts),
        PACKAGE = "logspline")

z$ip[2]
#> 7 some older systems, 5 for newer systems

and then use the code from the MWE above.


以上是对数样条产生不一致的结果的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>