Doing the same statistical analysis in R, Julia, and Python

Analysis of variance in R, Julia, and Python

I took a dataset that I know how to analyse in R, and tried to do the same thing in Julia and Python. I failed. That may be because of my inexperience in Julia and python.
Here is what happened.
I apologise in advanceโ€ฆ parts of this are technical.

Analysis of expt1 collagen data with R

One starts by reading data from a .csv file into an R dataframe.

> coll.df <- read.table("expt1.csv", header=T,sep=",")
> coll.df[1:3,]
      Farm  Grade Sheep Field    Area       Red     Green      Blue Count
1 Glensloy smooth  3457     1 737.625 0.7345387 0.3044948 0.2639451   732
2 Glensloy smooth  3457     1  41.625 0.8706839 0.2332855 0.4815878    41
3 Glensloy smooth  3457     1 375.500 0.8332495 0.2858970 0.3113243   374

Listing the first few rows of a dataframe is a good way to check, but a better way is to use the str() functionโ€ฆ the name โ€˜strโ€™ stands for โ€˜storageโ€™โ€ฆ str() works on any R object, not just a dataframe.

> str(coll.df)
'data.frame':	127239 obs. of  9 variables:
 $ Farm : chr  "Glensloy" "Glensloy" "Glensloy" "Glensloy" ...
 $ Grade: chr  "smooth" "smooth" "smooth" "smooth" ...
 $ Sheep: int  3457 3457 3457 3457 3457 3457 3457 3457 3457 3457 ...
 $ Field: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Area : num  737.6 41.6 375.5 49.5 88 ...
 $ Red  : num  0.735 0.871 0.833 0.849 0.829 ...
 $ Green: num  0.304 0.233 0.286 0.233 0.204 ...
 $ Blue : num  0.264 0.482 0.311 0.303 0.245 ...
 $ Count: int  732 41 374 49 87 1397 2200 205 10 1882 ...

The str() function describes each column of the dataframe.
For analysis, the columns Farm, Grade, Sheep and Field need to be โ€˜factorsโ€™โ€ฆ that is variables with a discrete number of levels. The varaibles to be analysed can be real numbers or integers. So

> coll.df$Farm <- factor(coll.df$Farm)
> coll.df$Grade <- factor(coll.df$Grade)
> coll.df$Sheep <- factor(coll.df$Sheep)
> coll.df$Field <- factor(coll.df$Field)
> str(coll.df)
'data.frame':	127239 obs. of  9 variables:
 $ Farm : Factor w/ 2 levels "Glensloy","Manton": 1 1 1 1 1 1 1 1 1 1 ...
 $ Grade: Factor w/ 2 levels "smooth","wrinkledbn": 1 1 1 1 1 1 1 1 1 1 ...
 $ Sheep: Factor w/ 36 levels "3448","3449",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ Field: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Area : num  737.6 41.6 375.5 49.5 88 ...
 $ Red  : num  0.735 0.871 0.833 0.849 0.829 ...
 $ Green: num  0.304 0.233 0.286 0.233 0.204 ...
 $ Blue : num  0.264 0.482 0.311 0.303 0.245 ...
 $ Count: int  732 41 374 49 87 1397 2200 205 10 1882 ...

Note it shows how many โ€˜levelsโ€™ of each Factor column
Factors are the effects we analyse forโ€ฆ we particularly want to see if the 2 levels of Factor Grade differ.

We estimate effects by fitting a โ€˜modelโ€™ to data. In R a โ€˜modelโ€™ is described by a โ€˜formulaโ€™. The formula for an initial simple model for Area in the coll.df dataset is

Area ~ 1 + Farm + Grade + Sheep

The โ€˜1โ€™ stands for the overall mean. Effects like Farm and Grade will be deviations from the mean. We leave out Field because that is the replicationโ€ฆ ie the means of estimating error.
In R, the function to fit a fixed effects model is aov()

> coll.fit <- aov(Area ~ 1 + Farm + Grade, coll.df)
> summary(coll.fit)
                Df    Sum Sq   Mean Sq F value   Pr(>F)    
Farm             1 1.429e+10 1.429e+10 2022.32  < 2e-16 ***
Grade            1 3.778e+08 3.778e+08   53.46 2.66e-13 ***
Residuals   127236 8.991e+11 7.066e+06                     
---
Signif. codes:  0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1

The summary() function gives an analysis of variance table, which shows that there was significant variation in Area of collagen bundles between Farms and between Grades. The significance testing in this table is not correctโ€ฆ Farm and Grade should be tested against Sheep, as errorโ€ฆ but Sheep is not fitted separate from Residual.

We need to check if the chosen model is appropriateโ€ฆ for example there may be an interaction between Farm and Grade. To check this we try another model

ov(Area ~ 1 + Farm + Grade + Farm:Grade + Farm:Grade/Sheep, coll.df)
> summary(coll.fit2)
                     Df    Sum Sq   Mean Sq F value   Pr(>F)    
Farm                  1 1.429e+10 1.429e+10 2058.94  < 2e-16 ***
Grade                 1 3.778e+08 3.778e+08   54.43 1.62e-13 ***
Farm:Grade            1 2.928e+09 2.928e+09  421.87  < 2e-16 ***
Farm:Grade:Sheep     32 1.329e+10 4.154e+08   59.85  < 2e-16 ***
Residuals        127203 8.829e+11 6.941e+06                     
---
Signif. codes:  0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1

We now see there may be a significant interaction between Farm and Grade.

> model.tables(coll.fit2, type="means")
Tables of means
Grand mean
         
694.3989 

 Farm 
    Glensloy Manton
       470.5   1196
rep  87972.0  39267

 Grade 
     smooth wrinkledbn
      643.5      751.2
rep 67124.0    60115.0

 Farm:Grade 
          Grade
Farm       smooth wrinkledbn
  Glensloy   339    654     
  rep      51266  36706     
  Manton    1406   1054     
  rep      15858  23409     

So the difference between smooth and wrinkled is not the same on the two Farms.
I expected smooth skinned sheep to have smaller collagen bundles. That is only so on Glensloy Farm.

To fix the significance test I need to declare Sheep and Field-within-Sheep and Collagen-bundles-within-Field to be the โ€˜error modelโ€™.

> coll.fit5 <- aov(Area ~ 1 + Farm + Grade + Farm:Grade + Error(Sheep + Sheep/Field) , coll.df)
> summary(coll.fit5)

Error: Sheep
           Df    Sum Sq   Mean Sq F value  Pr(>F)    
Farm        1 1.429e+10 1.429e+10  34.399 1.6e-06 ***
Grade       1 3.778e+08 3.778e+08   0.909  0.3474    
Farm:Grade  1 2.928e+09 2.928e+09   7.048  0.0123 *  
Residuals  32 1.329e+10 4.154e+08                    
---
Signif. codes:  0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1

Error: Sheep:Field
           Df    Sum Sq  Mean Sq F value Pr(>F)
Residuals 144 7.558e+09 52489297               

Error: Within
              Df    Sum Sq Mean Sq F value Pr(>F)
Residuals 127059 8.753e+11 6889105               

The analysis now shows 3 nested error levels โ€ฆ Sheep, Fields within Sheep, and Within which means variance between bundles within a microscope Field. The appropriat error level for testing Farm and Grade differences is Sheep.
Now the significance tests show that only the Farm effect is highly significantโ€ฆ the Farm:Grade interaction is dubious. We know the two Farms differ in wrinkleโ€ฆ it is not surprising that they also differ in collagen bundle size.
It seems the measurements were not sufficiently sensitive to detect the smaller difference between grades.

> model.tables(coll.fit5, type="means")
Tables of means
Grand mean
         
694.3989 

 Farm 
    Glensloy Manton
       470.5   1196
rep  87972.0  39267

 Grade 
     smooth wrinkledbn
      643.5      751.2
rep 67124.0    60115.0

 Farm:Grade 
          Grade
Farm       smooth wrinkledbn
  Glensloy   339    654     
  rep      51266  36706     
  Manton    1406   1054     
  rep      15858  23409     
> 

Those are the class means for the final fitted model. They are inconclusive, except for the difference between Farms. These means are the same as for fit4โ€ฆ changing the error model does not alter the class means.
The experimant was apparently not sufficient to detect smaller difference between Grades within each Farm.

2 Likes

Analysis of the โ€œexpt1.csvโ€ dataset with Julia

One starts, the same as in R, by reading the .csv file into a DataFrame

julia> using Pkg
julia> using DataFrames,CSV
julia> df = CSV.read("expt1.csv",delim=",",DataFrame)
127239ร—9 DataFrame
    Row โ”‚ Farm      Grade       Sheep  Field  Area      Red       Green     Blue      Cou โ‹ฏ
        โ”‚ String15  String15    Int64  Int64  Float64   Float64   Float64   Float64   Int โ‹ฏ
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
      1 โ”‚ Glensloy  smooth       3457      1   737.625  0.734539  0.304495  0.263945    7 โ‹ฏ
      2 โ”‚ Glensloy  smooth       3457      1    41.625  0.870684  0.233286  0.481588
      3 โ”‚ Glensloy  smooth       3457      1   375.5    0.833249  0.285897  0.311324    3
      4 โ”‚ Glensloy  smooth       3457      1    49.5    0.84906   0.232893  0.303401
      5 โ”‚ Glensloy  smooth       3457      1    88.0    0.828623  0.204372  0.245076      โ‹ฏ
      6 โ”‚ Glensloy  smooth       3457      1  1411.12   0.820913  0.271683  0.280649   13
      7 โ”‚ Glensloy  smooth       3457      1  2214.38   0.794125  0.173276  0.18097    22
      8 โ”‚ Glensloy  smooth       3457      1   206.625  0.821176  0.28836   0.331382    2
      9 โ”‚ Glensloy  smooth       3457      1    10.375  0.807059  0.308627  0.359608      โ‹ฏ
     10 โ”‚ Glensloy  smooth       3457      1  1891.25   0.759536  0.189146  0.207795   18
     11 โ”‚ Glensloy  smooth       3457      1   752.875  0.750648  0.140119  0.152391    7
     12 โ”‚ Glensloy  smooth       3457      1     1.0    0.764706  0.317647  0.509804
     13 โ”‚ Glensloy  smooth       3457      1   257.25   0.756631  0.146349  0.140003    2 โ‹ฏ
     14 โ”‚ Glensloy  smooth       3457      1    26.375  0.767572  0.284163  0.281297
   โ‹ฎ    โ”‚    โ‹ฎ          โ‹ฎ         โ‹ฎ      โ‹ฎ       โ‹ฎ         โ‹ฎ         โ‹ฎ         โ‹ฎ        โ‹ฎ โ‹ฑ
 127226 โ”‚ Manton    wrinkledbn   3523      5   137.25   0.838104  0.266462  0.309043    1
 127227 โ”‚ Manton    wrinkledbn   3523      5  1760.88   0.802461  0.181332  0.157286   17
 127228 โ”‚ Manton    wrinkledbn   3523      5  1613.25   0.819052  0.223463  0.201177   16 โ‹ฏ
 127229 โ”‚ Manton    wrinkledbn   3523      5  2246.38   0.828742  0.228816  0.20749    22
 127230 โ”‚ Manton    wrinkledbn   3523      5   152.25   0.818098  0.232644  0.236142    1
 127231 โ”‚ Manton    wrinkledbn   3523      5     8.0    0.769118  0.141176  0.110294
 127232 โ”‚ Manton    wrinkledbn   3523      5  2131.88   0.828878  0.259939  0.241366   21 โ‹ฏ
 127233 โ”‚ Manton    wrinkledbn   3523      5  1468.5    0.824338  0.216141  0.198868   14
 127234 โ”‚ Manton    wrinkledbn   3523      5  1428.88   0.812394  0.186982  0.179073   14
 127235 โ”‚ Manton    wrinkledbn   3523      5     3.0    0.827451  0.183007  0.205229
 127236 โ”‚ Manton    wrinkledbn   3523      5  1250.25   0.792318  0.163076  0.152771   12 โ‹ฏ
 127237 โ”‚ Manton    wrinkledbn   3523      5   126.875  0.807278  0.182776  0.180518    1
 127238 โ”‚ Manton    wrinkledbn   3523      5     1.0    0.847059  0.352941  0.356863
 127239 โ”‚ Manton    wrinkledbn   3523      5    17.875  0.849135  0.367474  0.36609
                                                           1 column and 127211 rows omitted

julia> 

Julia tells us the โ€˜typeโ€™ of each column of the DataFrame, in that shaded row under the column headers.

Julia has a package called anovabase.jl which does AOV for linear models. It has various sub-packages for different types of models. The appropriate one for my โ€˜expt1โ€™ is FixedEffectModels.jl.

(@v1.11) pkg> add FixedEffectModels
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed Combinatorics โ”€โ”€โ”€โ”€โ”€ v1.0.3
   Installed FixedEffects โ”€โ”€โ”€โ”€โ”€โ”€ v2.4.1
   Installed GroupedArrays โ”€โ”€โ”€โ”€โ”€ v0.3.3
   Installed Vcov โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ v0.8.1
   Installed FixedEffectModels โ”€ v1.12.0
    Updating `~/.julia/environments/v1.11/Project.toml`
  [9d5cd8c9] + FixedEffectModels v1.12.0
    Updating `~/.julia/environments/v1.11/Manifest.toml`
  [861a8166] + Combinatorics v1.0.3
  [9d5cd8c9] + FixedEffectModels v1.12.0
  [c8885935] + FixedEffects v2.4.1
  [6407cd72] + GroupedArrays v0.3.3
  [ec2bfdc2] + Vcov v0.8.1
Precompiling project...
  5 dependencies successfully precompiled in 55 seconds. 342 already precompiled.

julia> using FixedEffectModels

That installs the package and mankes it available to the current REPL environment.

Lets try a simple model

julia> fit1 = reg(df, @formula(Area ~  fe(Farm) + fe(Grade)))
                       FixedEffectModel                       
==============================================================
Number of obs:          127239  Converged:                true
dof (model):                 0  dof (residuals):        127235
Rยฒ:                      0.016  Rยฒ adjusted:             0.016
F-statistic:               NaN  P-value:                 0.000
Rยฒ within:               0.000  Iterations:                  4
==============================================================
  Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

==============================================================


julia> 

I dont see any AOV table, and I dont kow how to get a table of means. The fit is poor (R^2 = 0.016).

There is another package called AnovaFixedEffectModels.jl.
We will try that

julia> Pkg.add("AnovaFixedEffectModels")
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed PDMats โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ v0.11.35
   Installed Distributions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ v0.25.119
   Installed AnovaFixedEffectModels โ”€ v0.2.5
   Installed AnovaBase โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ v0.8.0
    Updating `~/.julia/environments/v1.11/Project.toml`
  [e4965305] + AnovaFixedEffectModels v0.2.5
    Updating `~/.julia/environments/v1.11/Manifest.toml`
  [946dddda] + AnovaBase v0.8.0
  [e4965305] + AnovaFixedEffectModels v0.2.5
  [31c24e10] + Distributions v0.25.119
  [1a297f60] + FillArrays v1.13.0
  [90014a1f] + PDMats v0.11.35
  [1fd47b50] + QuadGK v2.11.2
  [4607b0f0] + SuiteSparse
Precompiling project...
  12 dependencies successfully precompiled in 31 seconds. 347 already precompiled.

So that adds the package and compiles itโ€ฆthen

julia> using AnovaFixedEffectModels

julia> fit1 = lfe(@formula(Area ~ fe(Farm) + fe(Grade)),df)
                       FixedEffectModel                       
==============================================================
Number of obs:          127239  Converged:                true
dof (model):                 0  dof (residuals):        127235
Rยฒ:                      0.016  Rยฒ adjusted:             0.016
F-statistic:               NaN  P-value:                 0.000
Rยฒ within:               0.000  Iterations:                  4
==============================================================
  Estimate  Std. Error  t-stat  Pr(>|t|)  Lower 95%  Upper 95%
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

==============================================================

The same as before.
Apparently what I need to do is anova(fit1)
So I need package โ€œANOVAโ€

julia> using ANOVA
 โ”‚ Package ANOVA not found, but a package named ANOVA is available from a registry. 
 โ”‚ Install package?
 โ”‚   (@v1.11) pkg> add ANOVA 
 โ”” (y/n/o) [y]: y
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package AnovaFixedEffectModels [e4965305]:
 AnovaFixedEffectModels [e4965305] log:
 โ”œโ”€possible versions are: 0.1.0 - 0.2.5 or uninstalled
 โ”œโ”€restricted to versions * by project [eca84257], leaving only versions: 0.1.0 - 0.2.5
 โ”‚ โ””โ”€project [eca84257] log:
 โ”‚   โ”œโ”€possible versions are: 0.0.0 or uninstalled
 โ”‚   โ””โ”€project [eca84257] is fixed to version 0.0.0
 โ””โ”€restricted by compatibility requirements with StatsModels [3eaba693] to versions: uninstalled โ€” no versions left
   โ””โ”€StatsModels [3eaba693] log:
     โ”œโ”€possible versions are: 0.3.0 - 0.7.4 or uninstalled
     โ”œโ”€restricted to versions * by project [eca84257], leaving only versions: 0.3.0 - 0.7.4
     โ”‚ โ””โ”€project [eca84257] log: see above
     โ””โ”€restricted by compatibility requirements with ANOVA [0825541b] to versions: 0.3.0 - 0.5.0
       โ””โ”€ANOVA [0825541b] log:
         โ”œโ”€possible versions are: 0.1.0 or uninstalled
         โ””โ”€restricted to versions * by an explicit requirement, leaving only versions: 0.1.0
Stacktrace:
..... lots of lines

There is some version clash?
So I kill the REPL and start again clean โ€ฆ

julia> using ANOVA
exactly the same problem?

I need to install AnovaBase package first

julia> using 
       AnovaBase
 โ”‚ Package AnovaBase not found, but a package named AnovaBase is available from a registry. 
 โ”‚ Install package?
 โ”‚   (@v1.11) pkg> add AnovaBase 
 โ”” (y/n/o) [y]: y
   Resolving package versions...
    Updating `~/.julia/environments/v1.11/Project.toml`
  [946dddda] + AnovaBase v0.8.0
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`

Now I can get anova to run without version errors

julia> using AnovaFixedEffectModels
julia> fit1 = lfe(@formula(Area ~ fe(Farm) + fe(Grade)),df)
julia> anova(fit1)
ERROR: ArgumentError: Invalid set of model specification for ANOVA; not enough variables provided.
Stacktrace:
 [1] FullModel(model::FixedEffectModel, type::Int64, null::Bool, test_intercept::Bool)
   @ AnovaBase ~/.julia/packages/AnovaBase/SGDXz/src/AnovaBase.jl:150
 [2] anova(::Type{FTest}, model::FixedEffectModel; type::Int64)
   @ AnovaFixedEffectModels ~/.julia/packages/AnovaFixedEffectModels/S6PMl/src/anova.jl:47
 [3] anova
   @ ~/.julia/packages/AnovaFixedEffectModels/S6PMl/src/anova.jl:47 [inlined]
 [4] anova(models::FixedEffectModel)
   @ AnovaFixedEffectModels ~/.julia/packages/AnovaFixedEffectModels/S6PMl/src/anova.jl:31
 [5] top-level scope
   @ REPL[12]:1

But anova() seems to need more arguments than the โ€˜fitโ€™ object?:
The help page says this

help?> anova
search: anova anova_lfe anova_type anova_test anovatable

  anova(Test::Type{<: GoodnessOfFit}, <anovamodel>; <keyword arguments>)
  anova(<models>...; test::Type{<: GoodnessOfFit}, <keyword arguments>)
  anova(Test::Type{<: GoodnessOfFit}, <model>; <keyword arguments>)
  anova(Test::Type{<: GoodnessOfFit}, <models>...; <keyword arguments>)

  Analysis of variance.

  Return AnovaResult{M, Test, N}. See AnovaResult for details.

    โ€ข  anovamodel: a AnovaModel.

    โ€ข  models: RegressionModel(s). If mutiple models are provided, they should be
       nested, fitted with the same data and the last one is the most complex.

    โ€ข  Test: test statistics for goodness of fit. Available tests are
       LikelihoodRatioTest (LRT) and FTest.

So maybe it wants a Test specified

but
this link
https://github.com/marcpabst/ANOVA.jl
explicitely gives an example

using RDatasets, GLM, ANOVA

data = dataset(โ€œdatasetsโ€, โ€œToothGrowthโ€)
categorical!(data, :Dose)

model = fit(LinearModel,
@formula(Len ~ Supp + Dose),
data,
contrasts = Dict(:Supp => EffectsCoding(), :Dose => EffectsCoding()))
Anova(model)
ANOVA table for linear model
DF SS MSS F p
Supp 1.0 205.35 205.35 14.0166 0.0004
Dose 2.0 2426.43 1213.22 82.8109 <1e-16
Residuals 56.0 820.425 14.6504 0.0 <1e-99

It mis-spells anova() as Anova(), but there is only one argument... the model fit object. ???
It turns out there are at least 4 different versions of ANOVA.jl on Github
https://github.com/marcpabst/ANOVA.jl
that is the one with the above example
then
https://github.com/ZaneMuir/ANOVA.jl
this one uses lower case anova() and works directly on data, not from the fit object, 
then also
https://github.com/BioTurboNick/SimpleANOVA.jl
https://github.com/yufongpeng/AnovaMixedModels.jl
The above link also has an example

using AnovaFixedEffectModels

AnovaFixedEffectModels.jl supports FixedEffectModels.

lfe is the same as reg, but the order of arguments is closer to other modeling packages.

fem1 = lfe(@formula(gpa ~ fe(student) + occasion + job), gpa)
aovf = anova(fem1)

Analysis of Variance

Type 1 test / F test

gpa ~ :(fe(student)) + occasion + job

Table:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
             DOF   Exp.SS  Mean Square   F value  Pr(>|F|)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
occasion       1  39.56        39.56    717.6351    <1e-99
job            2   3.0710       1.5355   27.8545    <1e-11
(Residuals)  997  54.96         0.0551              
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

also

None of then seem to correspond the the REPL anova help .
So which ANOVA do I have, and where is the documentation?

I can not proceed. I have this example

using AnovaFixedEffectModels

AnovaFixedEffectModels.jl supports FixedEffectModels.

lfe is as same as reg, but the order of arguments is closer to other modeling packages.

fem1 = lfe(@formula(gpa ~ fe(student) + occasion + job), gpa)
aovf = anova(fem1)

Analysis of Variance

Type 1 test / F test

gpa ~ :(fe(student)) + occasion + job

Table:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
             DOF   Exp.SS  Mean Square   F value  Pr(>|F|)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
occasion       1  39.56        39.56    717.6351    <1e-99
job            2   3.0710       1.5355   27.8545    <1e-11
(Residuals)  997  54.96         0.0551              
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

and it simply does not work.

2 Likes

Analysis of expt1 collagen data with Python

This will be an experience โ€ฆ I have never written a line of Python.

trinity:[nevj]:~/juliawork/dermis.collagen.images$ python
Python 3.13.3 (main, Apr  9 2025, 17:13:31) [GCC 14.2.1 20250207] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

OK, it seems to have a work prompt, or a REPL or something.
Now lets try to read a .csv file

>>> import pandas as pd
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

I guess I need to install pandas first?

$ pacman -Ss pandas
world/python-pandas 2.2.3-1
    High-performance, easy-to-use data structures and data analysis tools for
    Python
# pacman -S python-pandas
resolving dependencies...
(5/5) installing python-pandas                     [---------------------] 100%

Now try in python to read a .csv file

Python 3.13.3 (main, Apr  9 2025, 17:13:31) [GCC 14.2.1 20250207] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> 
>>> df = pd.read_csv('expt1.csv')
>>> df
            Farm       Grade  Sheep  ...     Green      Blue  Count
0       Glensloy      smooth   3457  ...  0.304495  0.263945    732
1       Glensloy      smooth   3457  ...  0.233286  0.481588     41
2       Glensloy      smooth   3457  ...  0.285897  0.311324    374
3       Glensloy      smooth   3457  ...  0.232893  0.303401     49
4       Glensloy      smooth   3457  ...  0.204372  0.245076     87
...          ...         ...    ...  ...       ...       ...    ...
127234    Manton  wrinkledbn   3523  ...  0.183007  0.205229      3
127235    Manton  wrinkledbn   3523  ...  0.163076  0.152771   1244
127236    Manton  wrinkledbn   3523  ...  0.182776  0.180518    125
127237    Manton  wrinkledbn   3523  ...  0.352941  0.356863      1
127238    Manton  wrinkledbn   3523  ...  0.367474  0.366090     17

[127239 rows x 9 columns]

Brilliant. I seem to have a DataFrame in Python.

Now I have been advised that this is not a desirable way to proceed. I am using the preinstalled global version of Python that is intimately connected with the Linux OS.
I should start again and make a user-owned virtual environment which will contain my personal project-specific version of Python. I can then install all my required libraries in the virtual environment, and not interfere with the system.

So, starting again, make a project directory, use the UV utility to creat a .venv subdirectory, install python, and use pip to install required libraries

uv venv
uv python install
uv pip install pandas
uv pip install scipy
uv pip install statsmodels

The directory .venv now contains

$ ls .venv
bin  CACHEDIR.TAG  lib  lib64  pyvenv.cfg

$ ls .venv//bin
activate      activate.fish  activate_this.py  numpy-config  python3
activate.bat  activate.nu    deactivate.bat    pydoc.bat     python3.13
activate.csh  activate.ps1   f2py              python

$ ls .venv/lib/py*/si*
dateutil                               pytz
numpy                                  pytz-2025.2.dist-info
numpy-2.2.6.dist-info                  scipy
numpy.libs                             scipy-1.15.3.dist-info
packaging                              scipy.libs
packaging-25.0.dist-info               six-1.17.0.dist-info
pandas                                 six.py
pandas-2.2.3.dist-info                 statsmodels
pandas.libs                            statsmodels-0.14.4.dist-info
patsy                                  tzdata
patsy-1.0.1.dist-info                  tzdata-2025.2.dist-info
__pycache__                            _virtualenv.pth
python_dateutil-2.9.0.post0.dist-info  _virtualenv.py

plus a few other things

I can now run a python script with uv run scriptname.py
but
if I want to start a Python REPL I have to activate the virtual environment

source .venv/bin/activate
python
>>>

Now try again to read a .csv file

$ python
>>> import pandas as pd
>>> 
>>> df = pd.read_csv('expt1.csv')
>>> df
            Farm       Grade  Sheep  ...     Green      Blue  Count
0       Glensloy      smooth   3457  ...  0.304495  0.263945    732
1       Glensloy      smooth   3457  ...  0.233286  0.481588     41
2       Glensloy      smooth   3457  ...  0.285897  0.311324    374
3       Glensloy      smooth   3457  ...  0.232893  0.303401     49
4       Glensloy      smooth   3457  ...  0.204372  0.245076     87
...          ...         ...    ...  ...       ...       ...    ...
127234    Manton  wrinkledbn   3523  ...  0.183007  0.205229      3
127235    Manton  wrinkledbn   3523  ...  0.163076  0.152771   1244
127236    Manton  wrinkledbn   3523  ...  0.182776  0.180518    125
127237    Manton  wrinkledbn   3523  ...  0.352941  0.356863      1
127238    Manton  wrinkledbn   3523  ...  0.367474  0.366090     17

[127239 rows x 9 columns]

It works, now how do I do an analysis of variance in Python?
It looks like you can use the ols() function

>>> model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
Traceback (most recent call last):
  File "<python-input-6>", line 1, in <module>
    model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
             ^^^
NameError: name 'ols' is not defined
>>> 

OK, I need to install or import something to get ols

>>> import numpy as np
>>> model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
Traceback (most recent call last):
  File "<python-input-9>", line 1, in <module>
    model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
             ^^^
NameError: name 'ols' is not defined

I need to import part of scipy and statsmodels to get ols()

>>> import scipy.stats as stats
>>> import statsmodels.formula.api as smf
>>> model1 =smf.ols('Area ~ C(Farm) + C(Grade)', df).fit()
>>> print(model1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Area   R-squared:                       0.016
Model:                            OLS   Adj. R-squared:                  0.016
Method:                 Least Squares   F-statistic:                     1038.
Date:                Thu, 15 May 2025   Prob (F-statistic):               0.00
Time:                        21:23:53   Log-Likelihood:            -1.1839e+06
No. Observations:              127239   AIC:                         2.368e+06
Df Residuals:                  127236   BIC:                         2.368e+06
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
=========
                             coef    std err          t      P>|t|      [0.025   
   0.975]
---------------------------------------------------------------------------------
---------
Intercept                424.3224     10.964     38.701      0.000     402.833   
  445.812
C(Farm)[T.Manton]        705.7196     16.359     43.140      0.000     673.656   
  737.783
C(Grade)[T.wrinkledbn]   110.6674     15.136      7.312      0.000      81.001   
  140.334
==============================================================================
Omnibus:                   259203.996   Durbin-Watson:                   1.921
Prob(Omnibus):                  0.000   Jarque-Bera (JB):       2179334936.138
Skew:                          16.663   Prob(JB):                         0.00
Kurtosis:                     643.280   Cond. No.                         2.72
==============================================================================

OK, it has fitted a multiple regressionโ€ฆ I wanted a fixed effect model.
I have the wrong ols()

>>> from statsmodels.formula.api import ols
>>> from statsmodels.stats.anova import anova_lm
>>> model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
>>>
>>> model1.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Area   R-squared:                       0.016
Model:                            OLS   Adj. R-squared:                  0.016
Method:                 Least Squares   F-statistic:                     1038.
Date:                Thu, 15 May 2025   Prob (F-statistic):               0.00
Time:                        21:40:57   Log-Likelihood:            -1.1839e+06
No. Observations:              127239   AIC:                         2.368e+06
Df Residuals:                  127236   BIC:                         2.368e+06
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
=========
                             coef    std err          t      P>|t|      [0.025   
   0.975]
---------------------------------------------------------------------------------
---------
Intercept                424.3224     10.964     38.701      0.000     402.833   
  445.812
C(Farm)[T.Manton]        705.7196     16.359     43.140      0.000     673.656   
  737.783
C(Grade)[T.wrinkledbn]   110.6674     15.136      7.312      0.000      81.001   
  140.334
==============================================================================
Omnibus:                   259203.996   Durbin-Watson:                   1.921
Prob(Omnibus):                  0.000   Jarque-Bera (JB):       2179334936.138
Skew:                          16.663   Prob(JB):                         0.00
Kurtosis:                     643.280   Cond. No.                         2.72
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly 
specified.

That is just a regressionโ€ฆ I want an anova table for a fixed effects model. I need to do

>>> anova_lm(model1)
                df        sum_sq       mean_sq            F        PR(>F)
C(Farm)        1.0  1.429054e+10  1.429054e+10  2022.317244  0.000000e+00
C(Grade)       1.0  3.777585e+08  3.777585e+08    53.458256  2.656887e-13
Residual  127236.0  8.991031e+11  7.066421e+06          NaN           NaN
>>> 

If we compare this with the first model fitted in R, it is exactly the same. Success!
Now try a model with interaction

>>> model2 = ols('Area ~ C(Farm) + C(Grade) + C(Farm):C(Grade)', df).fit()
>>> model2.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Area   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     832.8
Date:                Sat, 17 May 2025   Prob (F-statistic):               0.00
Time:                        21:55:48   Log-Likelihood:            -1.1837e+06
No. Observations:              127239   AIC:                         2.367e+06
Df Residuals:                  127235   BIC:                         2.367e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
===========================
                                               coef    std err          t      P>
|t|      [0.025      0.975]
---------------------------------------------------------------------------------
---------------------------
Intercept                                  338.8571     11.721     28.909      0.
000     315.883     361.831
C(Farm)[T.Manton]                         1067.4783     24.115     44.266      0.
000    1020.213    1114.744
C(Grade)[T.wrinkledbn]                     315.4990     18.146     17.387      0.
000     279.933     351.065
C(Farm)[T.Manton]:C(Grade)[T.wrinkledbn]  -668.2950     32.777    -20.389      0.
000    -732.537    -604.053
==============================================================================
Omnibus:                   259289.282   Durbin-Watson:                   1.927
Prob(Omnibus):                  0.000   Jarque-Bera (JB):       2192884626.973
Skew:                          16.674   Prob(JB):                         0.00
Kurtosis:                     645.272   Cond. No.                         6.49
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly 
specified.


>>> anova_lm(model2)
                        df        sum_sq       mean_sq            F        PR(>F)
C(Farm)                1.0  1.429054e+10  1.429054e+10  2028.908896  0.000000e+00
C(Grade)               1.0  3.777585e+08  3.777585e+08    53.632501  2.431477e-13
C(Farm):C(Grade)       1.0  2.928109e+09  2.928109e+09   415.720047  2.930235e-92
Residual          127235.0  8.961750e+11  7.043463e+06          NaN           NaN
>>> 

That is the same as in R too, except I have not yet included a separate Sheep term.

Now before adding Sheep to the model, lets see how to calclate means for the classes defined by the model.
I can get the fitted coefficients like this

>>> coefficients = model2.params
>>> print(coefficients)
Intercept                                    338.857130
C(Farm)[T.Manton]                           1067.478332
C(Grade)[T.wrinkledbn]                       315.499011
C(Farm)[T.Manton]:C(Grade)[T.wrinkledbn]    -668.294995
dtype: float64

but they are not means, they are regression equation coefficients. They are already listed by model2.summary().
The Intercept is the mean for the first Farm and the First Grade. The others are confusing slopes. I want class means for the fitted modelโ€ฆ in R this is provided by the function model.tables()โ€ฆ Python does not seem to have that function.
The table I want is

|Farm | | |
| | Smooth | Wrinkled |
| | | |
| Glensloy | 339 | 664 |
| Manton | 1406 | 1054 |
| | | |

These values are calculable from tthe regression coefficients
339 + 1067 = 1406
339 + 315 = 654
339 + 1067 + 315 - 668 = 10053

but I would like a general procedure that will do this for any model.

The most likely place to find something useful is in the formulae package

uv pip install formulae

Its documentation s here

With that I can at least get a design matrix.

>>> from formulae import design_matrices
>>> dm = design_matrices("Farm + Grade",df)
>>> dm.common.as_dataframe().head()
   Intercept  Farm[Manton]  Grade[wrinkledbn]
0        1.0           0.0                0.0
1        1.0           0.0                0.0
2        1.0           0.0                0.0
3        1.0           0.0                0.0
4        1.0           0.0                0.0

That is for the model
Area ~ Intercept + Farm + Grade
If I want to compute cell means, I use a different model
Area ~ 0 + Farm*Grade
which says fit all combinations of Farm and Grade, but no grand mean

>>> dm = design_matrices("0 + Farm*Grade",df)
>>> dm.common.as_dataframe().head()
   Farm[Glensloy]  Farm[Manton]  Grade[wrinkledbn]  Farm[Manton]:Grade[wrinkledbn]
0             1.0           0.0                0.0                             0.0
1             1.0           0.0                0.0                             0.0
2             1.0           0.0                0.0                             0.0
3             1.0           0.0                0.0                             0.0
4             1.0           0.0                0.0                             0.0

Then I get a design matrix with 4 columns of 0โ€™s and 1โ€™s indicating which Farm and which Grade each row (=observation) belongs to.
I can use that matrix to calculate Farm and Grade means for Area from the dataframe.

>>> dmfg = dm.common.as_dataframe()
>>> dmfg.shape
(127239, 4)
>>> np.dot(dmfg["Farm[Glensloy]"],df["Area"])
np.float64(41390646.125)

That does a dot product of 2 vectorsโ€ฆthe โ€œFarm[Glensloy]โ€ column of the design matrix and the โ€œAreaโ€ column of the dataframe.
Now , that is a sum, I need to divide it by N to get a meanโ€ฆ N is the numbe rof 1โ€™s in the โ€œFarm[Glensloy]โ€ column of dmfg.

>>> np.dot(dmfg["Farm[Glensloy]"],df["Area"])/np.dot(dmfg["Farm[Glensloy]"],dmfg["Farm[Glensloy]"])
np.float64(470.4979553153276)

470.5 is the wrong answer.
Because the design matrix is wrongโ€ฆ I used the wrong model formula to make dfmg. I neeed the โ€˜patsyโ€™ formula package and a different design matrix constructor

>>> from patsy import ModelDesc
>>> dmat1 = dmatrices("Area ~ 0 + Farm:Grade",df)[1]
Traceback (most recent call last):
  File "<python-input-64>", line 1, in <module>
    dmat1 = dmatrices("Area ~ 0 + Farm:Grade",df)[1]
            ^^^^^^^^^
NameError: name 'dmatrices' is not defined

I need something to supply dmatrices()

from patsy import ModelDesc, Term, EvalFactor, dmatrices

Now try to make a design matrix

>>> dmat1 = dmatrices("Area ~ 0 + Farm:Grade",df)
>>> dmat1
(DesignMatrix with shape (127239, 1)
     Area
  737.625
   41.625
  375.500
   49.500
   88.000
 1411.125
 2214.375
  206.625
   10.375
 1891.250
  752.875
    1.000
  257.250
   26.375
  306.250
    6.250
 3233.875
  133.000
  838.625
   13.750
 2066.000
   73.250
   10.125
   56.750
   13.750
    2.000
   66.750
  233.875
20825.375
    2.000
[127209 rows omitted]
Terms:
'Area' (column 0)
(to view full data, use np.asarray(this_obj)), DesignMatrix with shape (127239, 4)
Columns:
['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]']
Terms:
'Farm:Grade' (columns 0:4)
(to view full data, use np.asarray(this_obj)))

I dont get it. What sort of an object is dmat1?
This is different to the example given here

Reading further

I can get rid of the โ€˜Areaโ€™ part of dmat1 by omitting it from the formula, and using dmatrix() instead of dmatrices()

>>> from patsy import dmatrices, dmatrix
>>> dmat1 = dmatrix("0 + Farm:Grade",df)
>>> dmat1
DesignMatrix with shape (127239, 4)
Columns:
['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]']
Terms:
'Farm:Grade' (columns 0:4)
(to view full data, use np.asarray(this_obj))
>>> np.asarray(dmat1)
array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       ...,
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]], shape=(127239, 4))
>>> 

That looks right , it should have 4 columns.
I need the zero (0) in the formula to stop it making a design matrix column for the grand mean.

Now lets try to use โ€˜dmat1โ€™ to compute some cell means

>>> np.dot(dmat1["Farm[Glensloy:Grade[smooth]"],df["Area"])/np.dot(dmat1["Farm[Glensloy]:G\
rade[smooth]"/np.dot(dmat1["Farm[Glensloy]:Grade[smooth]"],dmat1["Farm[Glensloy]:Grade[smo\
oth]"])
...            hangs

>>> np.dot(dmat1["Farm[Glensloy:Grade[smooth]"],df["Area"])
Traceback (most recent call last):
  File "<python-input-62>", line 1, in <module>
    np.dot(dmat1["Farm[Glensloy:Grade[smooth]"],df["Area"])
           ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

>>> dmat1["Farm[Glensloy:Grade[smooth]"]
Traceback (most recent call last):
  File "<python-input-63>", line 1, in <module>
    dmat1["Farm[Glensloy:Grade[smooth]"]
    ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

The index is invalid?
Try single quote
>>> dmat1['Farm[Glensloy:Grade[smooth]']
Traceback (most recent call last):
  File "<python-input-65>", line 1, in <module>
    dmat1['Farm[Glensloy:Grade[smooth]']
    ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I dont get it, that is the column name, not an index
np.asarray(dmat1['Farm[Glensloy:Grade[smooth]'] hangs but np.asarray(dmat1)` prints the array.

>>> dmat1.design_info
DesignInfo(['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]'], factor_infos={EvalFactor('Farm'): FactorInfo(factor=EvalFactor('Farm'), type='categorical', state=<factor state>, categories=('Glensloy', 'Manton')), EvalFactor('Grade'): FactorInfo(factor=EvalFactor('Grade'), type='categorical', state=<factor state>, categories=('smooth', 'wrinkledbn'))}, term_codings=OrderedDict({Term([EvalFactor('Farm'), EvalFactor('Grade')]): [SubtermInfo(factors=(EvalFactor('Farm'), EvalFactor('Grade')), contrast_matrices={EvalFactor('Farm'): ContrastMatrix(array([[1., 0.],
       [0., 1.]]), ['[Glensloy]', '[Manton]']), EvalFactor('Grade'): ContrastMatrix(array([[1., 0.],
       [0., 1.]]), ['[smooth]', '[wrinkledbn]'])}, num_columns=4)]}))
>>> 

>>> dmat1.design_info.column_names
['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]']
>>> 

It all looks right , I just cant access the array elements?
I think it is because dmat1 is a sparse matrix โ€ฆ it is all 0โ€™s and 1โ€™s.

>>> import scipy.sparse
>>> from scipy.sparse import issparse
>>> issparse(dmat1)
False

No , it is not that?

>>> type(dmat1)
<class 'patsy.design_info.DesignMatrix'>
>>> 

(continued next reply)

3 Likes

(Python continuation)
Maybe I could convert it to a normal numpy array?

uv pip install dmatrix2pn
uv pip install xgboost
>>> from dmatrix2np import dmatrix_to_numpy
>>> npdmat1 = dmatrix_to_numpy(dmat1)
Traceback (most recent call last):
  File "<python-input-33>", line 1, in <module>
    npdmat1 = dmatrix_to_numpy(dmat1)
  File "/home/nevj/Projects/collagenuv/.venv/lib/python3.13/site-packages/dmatrix2np/dmatrix_to_numpy.py", line 31, in dmatrix_to_numpy
    raise InvalidInput("Type error: input parameter is not DMatrix")
dmatrix2np.exceptions.InvalidInput: Type error: input parameter is not DMatrix

No , that DMatrix is another type of object??
So I still dont know hoew to handle an object produced by dmatrix()??
It is of type DesignMatrix

>>> dmat1
DesignMatrix with shape (127239, 4)
Columns:
['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]']
Terms:
'Farm:Grade' (columns 0:4)
(to view full data, use np.asarray(this_obj))
>>> np.asarray(dmat1)
array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       ...,
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]], shape=(127239, 4))
>>> 

OK, that converts it to an np array, but, it loses the column names, it is just an array with unlabelled columns.

So retry the dot product

>>> np.dot(np.transpose(df["Area"]),np.asarray(dmat1))
array([17371849.625, 22301667.75 , 24018796.5  , 24662305.625])
>>> 

So it does all 4 design matrix columns at once
Now divide by the number of 1โ€™s in each column

>>> np.dot(np.transpose(np.asarray(dmat1)),np.asarray(dmat1))
array([[51266.,     0.,     0.,     0.],
       [    0., 15858.,     0.,     0.],
       [    0.,     0., 36706.,     0.],
       [    0.,     0.,     0., 23409.]])

That doesnt look rightโ€ฆ I want 4 numbers, equal to the sum of each column.
That is probably what is in the diagonal, and the others are zero because the effects are orthogonal.
There nust be a better way to get column sums?

>>> np.diag(np.dot(np.transpose(np.asarray(dmat1)),np.asarray(dmat1)))

That is better โ€ฆ a vector of countsโ€™

Now all together

>>> areasums = np.dot(np.transpose(df["Area"]),np.asarray(dmat1))
>>> counts = np.diag(np.dot(np.transpose(np.asarray(dmat1)),np.asarray(dmat1)))
>>> areasums/counts
array([ 338.85712997, 1406.3354616 ,  654.35614069, 1053.53947734])

Those are the correct class means., for the model with interaction.
I can get the Farm means like this

>>> dmat2 =  dmatrix("0 + Farm",df)
>>> areasums = np.dot(np.transpose(df["Area"]),np.asarray(dmat2))
>>> counts = np.diag(np.dot(np.transpose(np.asarray(dmat2)),np.asarray(dmat2)))
>>> areasums/counts
array([ 470.49795532, 1196.01633369])

It only works if there is one effect in the formulaโ€ฆ otherwise constraints get in the way.

Now lets look at error levels.

>>> model3 = ols('Area ~ C(Farm) + C(Grade) + C(Farm):C(Grade) + C(Sheep) + C(Sheep)/C(Fie\
ld)', df)\
... .fit()
>>> model3

(continued next reply)
3 Likes

(python continued)
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x7100801e1220>

model3.summary
<bound method RegressionResults.summary of <statsmodels.regression.linear_model.OLSResults object at 0x7100801e10f0>>
anova_lm(model3)
df sum_sq mean_sq F PR(>F)
C(Farm) 1.0 1.429054e+10 1.429054e+10 2074.243257 0.000000e+00
C(Grade) 1.0 3.777585e+08 3.777585e+08 54.830877 1.321681e-13
C(Sheep) 35.0 1.643259e+10 4.695027e+08 68.147356 0.000000e+00
C(Farm):C(Grade) 1.0 4.318367e+06 4.318367e+06 0.626802 4.285322e-01
C(Sheep):C(Field) 144.0 7.566013e+09 5.254175e+07 7.626328 8.972476e-146
Residual 127059.0 8.753758e+11 6.889522e+06 NaN NaN


, thatis not correct. The degrees of freedom are wrong

 I did not want Sheep:Field, I wanted Sheep + Field within Sheep.  
I tried omitting  Sheep

model3 = ols(โ€˜Area ~ C(Farm) + C(Grade) + C(Farm):C(Grade) + C(Sheep)/C(Field)โ€™, df).
fit()
model3.summary
<bound method RegressionResults.summary of <statsmodels.regression.linear_model.OLSResults object at 0x7100801e15b0>>
anova_lm(model3)
df sum_sq mean_sq F PR(>F)
C(Farm) 1.0 1.429054e+10 1.429054e+10 2074.243257 0.000000e+00
C(Grade) 1.0 3.777585e+08 3.777585e+08 54.830877 1.321681e-13
C(Sheep) 35.0 1.643259e+10 4.695027e+08 68.147356 0.000000e+00
C(Farm):C(Grade) 1.0 4.318367e+06 4.318367e+06 0.626802 4.285322e-01
C(Sheep):C(Field) 144.0 7.566013e+09 5.254175e+07 7.626328 8.972476e-146
Residual 127059.0 8.753758e+11 6.889522e+06 NaN NaN

 Exactly the same.  The MS for sheep is wrong

It looks like Python does not have the nesting notation (eg Sheep/Field means Sheep and Field within Sheep))
Also it does not have Error() for declaring multiple error levels.
That makes it hard to analyse a repeated measures experiment.
There may be special packages in python?

 A search for repeated measures  analysis comes up with
 - AnovaRM() function in Statsmodels package
 - rm_anova() function in Pingouin package
 - MNE package

So hav a look at the Statsmodels  function first

AnovaRM(data=df,depvar=โ€˜C(Farm) + C(Grade) + C(Farm):C(Grade)โ€™, subject=โ€˜Areaโ€™,within=
โ€˜Sheepโ€™).fit()
Traceback (most recent call last):
File โ€œโ€, line 1, in
AnovaRM(data=df,depvar=โ€˜C(Farm) + C(Grade) + C(Farm):C(Grade)โ€™, subject=โ€˜Areaโ€™,within=โ€˜Sheepโ€™).fit()

~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File โ€œ/home/nevj/Projects/collagenuv/.venv/lib/python3.13/site-packages/statsmodels/stats/anova.pyโ€, line 500, in init
if not data.equals(data.drop_duplicates(subset=[subject] + within)):
~~~~^~
TypeError: can only concatenate list (not โ€œstrโ€) to list

It seems AnovaRM does not support model formulae.... 

Next try pingouin

$ uv pip install pingouin

import pingouin as pg
rmanova = pg.rm_anova(dv=[โ€˜Farmโ€™,โ€˜Groupโ€™], within=[โ€˜Sheepโ€™,โ€˜Fieldโ€™],subject=โ€˜Areaโ€™,dat
a=df,detailed=True)
Traceback (most recent call last):
File โ€œโ€, line 1, in
rmanova = pg.rm_anova(dv=[โ€˜Farmโ€™,โ€˜Groupโ€™], within=[โ€˜Sheepโ€™,โ€˜Fieldโ€™],subject=โ€˜Areaโ€™,data=df,detailed=True)
File โ€œ/home/nevj/Projects/collagenuv/.venv/lib/python3.13/site-packages/pingouin/parametric.pyโ€, line 516, in rm_anova
return rm_anova2(dv=dv, within=within, data=data, subject=subject, effsize=effsize)
File โ€œ/home/nevj/Projects/collagenuv/.venv/lib/python3.13/site-packages/pingouin/parametric.pyโ€, line 681, in rm_anova2
data = _check_dataframe(dv=dv, within=within, data=data, subject=subject, effects=โ€œwithinโ€)
File โ€œ/home/nevj/Projects/collagenuv/.venv/lib/python3.13/site-packages/pingouin/utils.pyโ€, line 368, in _check_dataframe
if data[dv].dtype.kind not in โ€œfiโ€:

   ~~~~^^^^

File โ€œ/home/nevj/Projects/collagenuv/.venv/lib/python3.13/site-packages/pandas/core/frame.pyโ€, line 4108, in getitem
indexer = self.columns._get_indexer_strict(key, โ€œcolumnsโ€)[1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File โ€œ/home/nevj/Projects/collagenuv/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.pyโ€, line 6200, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File โ€œ/home/nevj/Projects/collagenuv/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.pyโ€, line 6252, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: โ€œ[โ€˜Groupโ€™] not in indexโ€

 It seem rm_anova() only accepts one dependent variable... ie it will only do a one-way anova with repeated measures.
2 Likes

Conclusion

It seems that only R can do the type of analysis I require, without me having to writ
e code.
Python got the AOV done, but required substantial fiddling to get the model class mea
ns, and failed to get the error levels for repeated measures.
Julia failed altogetherโ€ฆits anova routines are a mess of conflicting functions from
different packages.

I would say that if you want to do statistical analysis you are better off with a mat
ure environment like R where all the necessary functions are integrated , than with a
new environment like Julia where all you have is collection of functons, each follow
ing their own conventions. Python is somewhere in between.

2 Likes

Interesting, I tried to follow, but got lost, went to sleep all this counting of sheep did not helpโ€ฆ

Sorry Neville could not resist the joke. Next time will count the legs and divide by 4 โ€ฆ

Or just use excel (libreoffice calc) instead

2 Likes

Excel does not cut the mustard. Maybe SAS?

i did warn it was technical.
I think the real conclusion is R is well coordinated, Python is on the way, and Julia is a mess of uncoordinated librariesโ€ฆ That is for statistical work. Julia was quite good for image analysis. I have not tried Python for image analysis.

2 Likes

There is the saying

Lies, damded lies and statistics.

I never got that deep into analysis to need more than things like standard dรฉviation, max, min, average, then graph the results. As i always used the same tools comparer month on month the results were compatible.

Mine was for hospital info, lengths of stay, admission rates, theatre times, waiting times and sadly deaths. This was used to chart one hospital against another, what made it more difficult was trying to find results of a meaning full nature. This became obvious when tracking a specialist service.

2 Likes

That dataset has 127239 observations.
Can you imagine that in a spreadsheet?
All 3 ( R, Python, Julia) put data in a dataframe objectโ€ฆ it is in memory but can be swapped. Notice when I listed it, they showed the first few rows and last few rows only.
They all coped with the data readin quite speedily.
I have seen work done in R with millions of rows of dataโ€ฆ eg DNA matching. It does not put all that in memoryโ€ฆ it buffers it.

Yeah, Excel would be fine for what you wanted.

2 Likes

1,048,576 rows is excel limit so no big deal, if you have 64bit processor and memory to suit, I used one sheet for the data then the calculations on sheet 2

Did work on bigger data sets through linking and sql but its so far back cannot remember much more

2 Likes