-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Returning NaN for simple multiple linear regression case in 1.4.1 #21
Comments
Thanks, I'll look into it. |
In 1.4.1: >> require 'statsample'
>> Statsample::VERSION
=> "1.4.1"
>> @a=[27.0, 12.0, 16.0, 25.0].to_vector(:scale)
>> @b=[10.0, 15.0, 19.0, 2.0].to_vector(:scale)
>> @y=[1, 1, 1, 1].to_vector(:scale)
>> lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y')
>> lr.r
=> NaN
>> lr.r2
=> NaN
>> lr.coeffs.each do |k, v|
| puts "#{k}: #{v}"
| end
a: NaN
b: NaN
=> {"a"=>NaN, "b"=>NaN} And in 1.4.0: >> gem "statsample", "=1.4.0"
>> require 'statsample'
>> Statsample::VERSION
=> "1.4.0"
>> @a=[27.0, 12.0, 16.0, 25.0].to_vector(:scale)
>> @b=[10.0, 15.0, 19.0, 2.0].to_vector(:scale)
>> @y=[1, 1, 1, 1].to_vector(:scale)
>> ds={'a'=>@a,'b'=>@b,'y'=>@y}.to_dataset
>> lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y')
>> lr.r
=> NaN
>> lr.r2
=> NaN
>> lr.coeffs.each do |k, v|
| puts "#{k}: #{v}"
| end
a: NaN
b: NaN
=> {"a"=>NaN, "b"=>NaN} I just downloaded 1.4.1 and 1.4.0 from rubygems and I got the same result regardless of the version used. Will have to look at how Regression::Multiple is implemented to have a better idea of what is happening. Thanks for creating a test, it'll be useful. Please post here if you find anything that can help. :) |
I tried to simplify as much as possible while still able to see the bad behaviour, obviously forgetting to check whether the simplified version had the good behaviour in the previous version. I'll trace back my steps and find something simple where the error occurs only in 1.4.1 but not in 1.4.0 |
How did you find about this behavior? Can you host the original code in a gist (if possible), so we can work from there? I made some stylistic changes to lib/statsample.rb and lib/statsample/{dataset,matrix,reliability}.rb, but I might have introduced a bug. Or there were changes in Claudio's repository that were introduced after the 1.4.0 release... anyway, I'll keep looking into it. :) |
Here's what I could reproduce: With 1.4.0rvm gemset create test140
rvm gemset use test140
gem install 'statsample' -v 1.4.0
irb 2.1.2 :001 > require 'statsample'
=> true
2.1.2 :002 > Statsample::VERSION
=> "1.4.0"
2.1.2 :003 >
2.1.2 :004 > regression_analysis = Statsample::Analysis.store(Statsample::Regression::Multiple) do
2.1.2 :005 > dataset_inputs = {
2.1.2 :006 > '4' => [27.0,12.0,16.0,25.0,0.0,13.0,14.0,28.0,1.0,18.0,24.0,19.0,7.0,27.0,17.0,17.0,16.0,24.0,21.0,22.0,16.0,24.0,22.0,13.0].to_vector(:scale),
2.1.2 :007 > '5' => [10.0,15.0,19.0,2.0,20.0,13.0,7.0,24.0,5.0,17.0,16.0,29.0,15.0,20.0,23.0,11.0,14.0,4.0,19.0,19.0,3.0,9.0,6.0,20.0].to_vector(:scale),
2.1.2 :008 > '6' => [0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001].to_vector(:scale)
2.1.2 :009?> }
2.1.2 :010?> ds = dataset(dataset_inputs)
2.1.2 :011?> lr(ds, '6')
2.1.2 :012?> end
=> #<Statsample::Analysis::Suite:0x000000038b72e0 @block=#<Proc:0x000000038b73a8@(irb):4>, @name=Statsample::Regression::Multiple, @attached=[], @output=#<IO:<STDOUT>>>
2.1.2 :013 >
2.1.2 :014 > results = regression_analysis.run
=> #<Statsample::Regression::Multiple::RubyEngine:0x000000038bcfd8 @matrix_cor=Matrix[[1.0, 0.009923807720864935, 0.0], [0.009923807720864935, 1.0, 0.0], [0.0, 0.0, 1.0]], @matrix_cov=Matrix[[1.0, 0.009923807720864935, 0.0], [0.009923807720864935, 1.0, 0.0], [0.0, 0.0, 1.0]], @no_covariance=true, @y_var="6", @fields=["4", "5"], @n_predictors=2, @predictors_n=2, @matrix_x=Matrix[[1.0, 0.009923807720864935], [0.009923807720864935, 1.0]], @matrix_x_cov=Matrix[[1.0, 0.009923807720864935], [0.009923807720864935, 1.0]], @matrix_y=Matrix[[0.0], [0.0]], @matrix_y_cov=Matrix[[0.0], [0.0]], @y_sd=4.25288070139903e-17, @x_sd={"4"=>7.540110136434694, "5"=>7.2631689494604075}, @cases=24, @x_mean={"4"=>17.625, "5"=>14.166666666666666}, @y_mean=0.07999200000000005, @name="Multiple reggresion of 4,5 on 6", @digits=3, @coeffs_stan=[0.0, 0.0], @coeffs=[0.0, 0.0], @valid_cases=24, @total_cases=24, @ds=#<Statsample::Dataset:29747360 @name=Dataset 1 @fields=[4,5,6] cases=24, @dy=Vector(type:scale, n:24)[0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001], @ds_valid=#<Statsample::Dataset:30012060 @name=Dataset 1 @fields=[4,5,6] cases=24, @ds_indep=#<Statsample::Dataset:30010980 @name=Dataset 1 @fields=[4,5] cases=24, @dep_columns=[[27.0, 12.0, 16.0, 25.0, 0.0, 13.0, 14.0, 28.0, 1.0, 18.0, 24.0, 19.0, 7.0, 27.0, 17.0, 17.0, 16.0, 24.0, 21.0, 22.0, 16.0, 24.0, 22.0, 13.0], [10.0, 15.0, 19.0, 2.0, 20.0, 13.0, 7.0, 24.0, 5.0, 17.0, 16.0, 29.0, 15.0, 20.0, 23.0, 11.0, 14.0, 4.0, 19.0, 19.0, 3.0, 9.0, 6.0, 20.0]]>
2.1.2 :015 > results.r
=> 0.0
2.1.2 :016 > results.r2
=> 0.0 with 1.4.1rvm gemset create test141
rvm gemset use test141
gem install 'statsample' -v 1.4.1
irb 2.1.2 :001 > require 'statsample'
=> true
2.1.2 :002 > Statsample::VERSION
=> "1.4.1"
2.1.2 :003 >
2.1.2 :004 > regression_analysis = Statsample::Analysis.store(Statsample::Regression::Multiple) do
2.1.2 :005 > dataset_inputs = {
2.1.2 :006 > '4' => [27.0,12.0,16.0,25.0,0.0,13.0,14.0,28.0,1.0,18.0,24.0,19.0,7.0,27.0,17.0,17.0,16.0,24.0,21.0,22.0,16.0,24.0,22.0,13.0].to_vector(:scale),
2.1.2 :007 > '5' => [10.0,15.0,19.0,2.0,20.0,13.0,7.0,24.0,5.0,17.0,16.0,29.0,15.0,20.0,23.0,11.0,14.0,4.0,19.0,19.0,3.0,9.0,6.0,20.0].to_vector(:scale),
2.1.2 :008 > '6' => [0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001].to_vector(:scale)
2.1.2 :009?> }
2.1.2 :010?> ds = dataset(dataset_inputs)
2.1.2 :011?> lr(ds, '6')
2.1.2 :012?> end
=> #<Statsample::Analysis::Suite:0x00000002197ce0 @block=#<Proc:0x00000002197da8@(irb):4>, @name=Statsample::Regression::Multiple, @attached=[], @output=#<IO:<STDOUT>>>
2.1.2 :013 >
2.1.2 :014 > results = regression_analysis.run
=> #<Statsample::Regression::Multiple::RubyEngine:0x000000021915c0 @matrix_cor=Matrix[[1.0, 0.009923807720864888, NaN], [0.009923807720864888, 1.0, NaN], [NaN, NaN, 1.0]], @matrix_cov=Matrix[[1.0, 0.009923807720864888, NaN], [0.009923807720864888, 1.0, NaN], [NaN, NaN, 1.0]], @no_covariance=true, @y_var="6", @fields=["4", "5"], @n_predictors=2, @predictors_n=2, @matrix_x=Matrix[[1.0, 0.009923807720864888], [0.009923807720864888, 1.0]], @matrix_x_cov=Matrix[[1.0, 0.009923807720864888], [0.009923807720864888, 1.0]], @matrix_y=Matrix[[NaN], [NaN]], @matrix_y_cov=Matrix[[NaN], [NaN]], @y_sd=0.0, @x_sd={"4"=>7.540110136434694, "5"=>7.2631689494604075}, @cases=24, @x_mean={"4"=>17.625, "5"=>14.166666666666666}, @y_mean=0.07999200000000001, @name="Multiple reggresion of 4,5 on 6", @digits=3, @coeffs_stan=[NaN, NaN], @coeffs=[NaN, NaN], @valid_cases=24, @total_cases=24, @ds=#<Statsample::Dataset:17599380 @name=Dataset 1 @fields=[4,5,6] cases=24, @dy=Vector(type:scale, n:24)[0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001], @ds_valid=#<Statsample::Dataset:17258600 @name=Dataset 1 @fields=[4,5,6] cases=24, @ds_indep=#<Statsample::Dataset:17257420 @name=Dataset 1 @fields=[4,5] cases=24, @dep_columns=[[27.0, 12.0, 16.0, 25.0, 0.0, 13.0, 14.0, 28.0, 1.0, 18.0, 24.0, 19.0, 7.0, 27.0, 17.0, 17.0, 16.0, 24.0, 21.0, 22.0, 16.0, 24.0, 22.0, 13.0], [10.0, 15.0, 19.0, 2.0, 20.0, 13.0, 7.0, 24.0, 5.0, 17.0, 16.0, 29.0, 15.0, 20.0, 23.0, 11.0, 14.0, 4.0, 19.0, 19.0, 3.0, 9.0, 6.0, 20.0]]>
2.1.2 :015 > results.r
=> NaN
2.1.2 :016 > results.r2
=> NaN |
For your convenience, here's a version of the code to copy paste into the irb: require 'statsample'
Statsample::VERSION
regression_analysis = Statsample::Analysis.store(Statsample::Regression::Multiple) do
dataset_inputs = {
'4' => [27.0,12.0,16.0,25.0,0.0,13.0,14.0,28.0,1.0,18.0,24.0,19.0,7.0,27.0,17.0,17.0,16.0,24.0,21.0,22.0,16.0,24.0,22.0,13.0].to_vector(:scale),
'5' => [10.0,15.0,19.0,2.0,20.0,13.0,7.0,24.0,5.0,17.0,16.0,29.0,15.0,20.0,23.0,11.0,14.0,4.0,19.0,19.0,3.0,9.0,6.0,20.0].to_vector(:scale),
'6' => [0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001].to_vector(:scale)
}
ds = dataset(dataset_inputs)
lr(ds, '6')
end
results = regression_analysis.run
results.r
results.r2 |
Another thing that might not be clearly visible from the above is that the dependent variable is always the same value, so the coefficients are resolved to be 0.0 and the constant to the value of the dependent variable. If one value of the dependent variable is changed, it works both in 1.4.1 and 1.4.0 |
Running Ruby 2.1.2p95: >> gem "statsample", "= 1.4.0"
=> true
>> require "statsample"
=> true
>> Statsample::VERSION
=> "1.4.0"
>> regression_analysis = Statsample::Analysis.store(Statsample::Regression::Multiple) do
?> dataset_inputs = {
?> '4' => [27.0,12.0,16.0,25.0,0.0,13.0,14.0,28.0,1.0,18.0,24.0,19.0,7.0,27.0,17.0,17.0,16.0,24.0,21.0,22.0,16.0,24.0,22.0,13.0].to_vector(:scale),
?> '5' => [10.0,15.0,19.0,2.0,20.0,13.0,7.0,24.0,5.0,17.0,16.0,29.0,15.0,20.0,23.0,11.0,14.0,4.0,19.0,19.0,3.0,9.0,6.0,20.0].to_vector(:scale),
?> '6' => [0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001].to_vector(:scale)
>> }
>> ds = dataset(dataset_inputs)
>> lr(ds, '6')
>> end
=> #<Statsample::Analysis::Suite:0x007fb9a13fb800 @block=#<Proc:0x007fb9a13fb8c8@(irb):5>, @name=Statsample::Regression::Multiple, @attached=[], @output=#<IO:<STDOUT>>>
>>
?> results = regression_analysis.run
=> #<Statsample::Regression::Multiple::RubyEngine:0x007fb9a13c9c60 @matrix_cor=Matrix[[1.0, 0.009923807720864888, NaN], [0.009923807720864888, 1.0, NaN], [NaN, NaN, 1.0]], @matrix_cov=Matrix[[1.0, 0.009923807720864888, NaN], [0.009923807720864888, 1.0, NaN], [NaN, NaN, 1.0]], @no_covariance=true, @y_var="6", @fields=["4", "5"], @n_predictors=2, @predictors_n=2, @matrix_x=Matrix[[1.0, 0.009923807720864888], [0.009923807720864888, 1.0]], @matrix_x_cov=Matrix[[1.0, 0.009923807720864888], [0.009923807720864888, 1.0]], @matrix_y=Matrix[[NaN], [NaN]], @matrix_y_cov=Matrix[[NaN], [NaN]], @y_sd=0.0, @x_sd={"4"=>7.540110136434694, "5"=>7.2631689494604075}, @cases=24, @x_mean={"4"=>17.625, "5"=>14.166666666666666}, @y_mean=0.07999200000000001, @name="Multiple reggresion of 4,5 on 6", @digits=3, @coeffs_stan=[NaN, NaN], @coeffs=[NaN, NaN], @valid_cases=24, @total_cases=24, @ds=#<Statsample::Dataset:70217625390900 @name=Dataset 1 @fields=[4,5,6] cases=24, @dy=Vector(type:scale, n:24)[0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001], @ds_valid=#<Statsample::Dataset:70217624644040 @name=Dataset 1 @fields=[4,5,6] cases=24, @ds_indep=#<Statsample::Dataset:70217624642940 @name=Dataset 1 @fields=[4,5] cases=24, @dep_columns=[[27.0, 12.0, 16.0, 25.0, 0.0, 13.0, 14.0, 28.0, 1.0, 18.0, 24.0, 19.0, 7.0, 27.0, 17.0, 17.0, 16.0, 24.0, 21.0, 22.0, 16.0, 24.0, 22.0, 13.0], [10.0, 15.0, 19.0, 2.0, 20.0, 13.0, 7.0, 24.0, 5.0, 17.0, 16.0, 29.0, 15.0, 20.0, 23.0, 11.0, 14.0, 4.0, 19.0, 19.0, 3.0, 9.0, 6.0, 20.0]]>
>> results.r
=> NaN
>> results.r2
=> NaN I get the same result with Ruby 2.1.5. Well, there's obviously something else different between our systems -- I can't make it work with Statsample 1.4.0 or 1.4.1, both installed via rubygems. What system are you using? I'm on a Mac OSX 10.10.2, with It appears the line responsible for generating those |
Hi again, sorry, I didn't get notified of your reply (maybe because of the DDoS on github?) I'm running on:
According to Wikipedia R^2 is not defined for the case we have (where all the values of the dependent variable are equal). So NaN is actually acceptable for this case after all. The reason why I was actually getting a different value could be due to differences in rounding (?). So I would be OK to close this issue if you do not want to hunt down the difference further. If you did, I'd be happy to continue trying to narrow down the case in which it worked for me in 1.4.0. |
No problem, that DDoS was very problematic indeed. Anyway, I thought about it and yeah, R^2 doesn't make sense in this context. Can you link to the passage that says it is undefined in this situation? And I don't know if closing it is really a good idea. Maybe we should print a warning or raise an exception if we get to this situation? I certainly need to add this case to the documentation at the very least. |
Sure: At the end of the second paragraph it says that neither formula is defined for the case that y_1 = ... = y_n = average of y values. |
I actually gave it a bit more thought and it came to me that just because my application flow went to R^2 first, I never actually checked the coefficients and the constant. Although R^2 is fine as NaN (running the calculation as described in Wikipedia gave a I think I might have hit upon a corner case in 1.4.0 that for some reason worked. If I replace the The case in which the dependent variable is unaffected by any of the explaining variables is a special case which is obviously simplistic. Perhaps a good solution is to first check whether all the elements in the dependent vector are identical. If they are, the solution is trivial: coefficients are 0 and the constant is equal to the value of any dependent variable entry. |
Hi again, I've updated the my fork with separate tests for GslEngine and RubyEngine. The issue is confined to RubyEngine and works as expected (by the new tests) in the GslEngine. I am uncertain about creating a pull request, because I haven't gotten the tests to pass on a lot of the other tests of the suite. |
Created a pull request to clbustos: |
Sorry for my absence. I finally got some time to work on open source. Can you send your pull request to this repository? We want to concentrate our efforts in the SciRuby forks so it's easier for someone to find our projects. |
And thanks for your explanation.
This seems like a good solution indeed. I'm thinking about how to check that the elements are identical when we're working with floating-point precision. I'll update this issue this week. |
I just ran a regression model with the same data in R, since it's nowadays the standard in statistical computing. They seem to have the same solution as the one proposed by einpaule. R also estimates the slope coefficients to 0 and the intercept (constant) to the value of y. All the computed statistics (like R^2, F-statistic, t, etc.) are NaN or NA, and warnings of the following kind are produced: Warning message:
In summary.lm(mod) : essentially perfect fit: summary may be unreliable
Warning message:
In anova.lm(mod) :
ANOVA F-tests on an essentially perfect fit are unreliable |
I'm finding some unexpected behaviour in the 1.4.1, which was not occurring in 1.4.0. (I've tried to keep to the format of some of the tests in the test suite in the example):
I've added this as a test on a fork: https://github.com/einpaule/statsample .
Can someone confirm this is an issue?
The text was updated successfully, but these errors were encountered: