Analysis of expt1 collagen data with Python
This will be an experience โฆ I have never written a line of Python.
trinity:[nevj]:~/juliawork/dermis.collagen.images$ python
Python 3.13.3 (main, Apr 9 2025, 17:13:31) [GCC 14.2.1 20250207] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
OK, it seems to have a work prompt, or a REPL or something.
Now lets try to read a .csv file
>>> import pandas as pd
Traceback (most recent call last):
File "<python-input-0>", line 1, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
I guess I need to install pandas first?
$ pacman -Ss pandas
world/python-pandas 2.2.3-1
High-performance, easy-to-use data structures and data analysis tools for
Python
# pacman -S python-pandas
resolving dependencies...
(5/5) installing python-pandas [---------------------] 100%
Now try in python to read a .csv file
Python 3.13.3 (main, Apr 9 2025, 17:13:31) [GCC 14.2.1 20250207] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>>
>>> df = pd.read_csv('expt1.csv')
>>> df
Farm Grade Sheep ... Green Blue Count
0 Glensloy smooth 3457 ... 0.304495 0.263945 732
1 Glensloy smooth 3457 ... 0.233286 0.481588 41
2 Glensloy smooth 3457 ... 0.285897 0.311324 374
3 Glensloy smooth 3457 ... 0.232893 0.303401 49
4 Glensloy smooth 3457 ... 0.204372 0.245076 87
... ... ... ... ... ... ... ...
127234 Manton wrinkledbn 3523 ... 0.183007 0.205229 3
127235 Manton wrinkledbn 3523 ... 0.163076 0.152771 1244
127236 Manton wrinkledbn 3523 ... 0.182776 0.180518 125
127237 Manton wrinkledbn 3523 ... 0.352941 0.356863 1
127238 Manton wrinkledbn 3523 ... 0.367474 0.366090 17
[127239 rows x 9 columns]
Brilliant. I seem to have a DataFrame in Python.
Now I have been advised that this is not a desirable way to proceed. I am using the preinstalled global version of Python that is intimately connected with the Linux OS.
I should start again and make a user-owned virtual environment which will contain my personal project-specific version of Python. I can then install all my required libraries in the virtual environment, and not interfere with the system.
So, starting again, make a project directory, use the UV utility to creat a .venv
subdirectory, install python, and use pip to install required libraries
uv venv
uv python install
uv pip install pandas
uv pip install scipy
uv pip install statsmodels
The directory .venv now contains
$ ls .venv
bin CACHEDIR.TAG lib lib64 pyvenv.cfg
$ ls .venv//bin
activate activate.fish activate_this.py numpy-config python3
activate.bat activate.nu deactivate.bat pydoc.bat python3.13
activate.csh activate.ps1 f2py python
$ ls .venv/lib/py*/si*
dateutil pytz
numpy pytz-2025.2.dist-info
numpy-2.2.6.dist-info scipy
numpy.libs scipy-1.15.3.dist-info
packaging scipy.libs
packaging-25.0.dist-info six-1.17.0.dist-info
pandas six.py
pandas-2.2.3.dist-info statsmodels
pandas.libs statsmodels-0.14.4.dist-info
patsy tzdata
patsy-1.0.1.dist-info tzdata-2025.2.dist-info
__pycache__ _virtualenv.pth
python_dateutil-2.9.0.post0.dist-info _virtualenv.py
plus a few other things
I can now run a python script with uv run scriptname.py
but
if I want to start a Python REPL I have to activate the virtual environment
source .venv/bin/activate
python
>>>
Now try again to read a .csv file
$ python
>>> import pandas as pd
>>>
>>> df = pd.read_csv('expt1.csv')
>>> df
Farm Grade Sheep ... Green Blue Count
0 Glensloy smooth 3457 ... 0.304495 0.263945 732
1 Glensloy smooth 3457 ... 0.233286 0.481588 41
2 Glensloy smooth 3457 ... 0.285897 0.311324 374
3 Glensloy smooth 3457 ... 0.232893 0.303401 49
4 Glensloy smooth 3457 ... 0.204372 0.245076 87
... ... ... ... ... ... ... ...
127234 Manton wrinkledbn 3523 ... 0.183007 0.205229 3
127235 Manton wrinkledbn 3523 ... 0.163076 0.152771 1244
127236 Manton wrinkledbn 3523 ... 0.182776 0.180518 125
127237 Manton wrinkledbn 3523 ... 0.352941 0.356863 1
127238 Manton wrinkledbn 3523 ... 0.367474 0.366090 17
[127239 rows x 9 columns]
It works, now how do I do an analysis of variance in Python?
It looks like you can use the ols()
function
>>> model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
Traceback (most recent call last):
File "<python-input-6>", line 1, in <module>
model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
^^^
NameError: name 'ols' is not defined
>>>
OK, I need to install or import something to get ols
>>> import numpy as np
>>> model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
Traceback (most recent call last):
File "<python-input-9>", line 1, in <module>
model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
^^^
NameError: name 'ols' is not defined
I need to import part of scipy and statsmodels to get ols()
>>> import scipy.stats as stats
>>> import statsmodels.formula.api as smf
>>> model1 =smf.ols('Area ~ C(Farm) + C(Grade)', df).fit()
>>> print(model1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Area R-squared: 0.016
Model: OLS Adj. R-squared: 0.016
Method: Least Squares F-statistic: 1038.
Date: Thu, 15 May 2025 Prob (F-statistic): 0.00
Time: 21:23:53 Log-Likelihood: -1.1839e+06
No. Observations: 127239 AIC: 2.368e+06
Df Residuals: 127236 BIC: 2.368e+06
Df Model: 2
Covariance Type: nonrobust
=================================================================================
=========
coef std err t P>|t| [0.025
0.975]
---------------------------------------------------------------------------------
---------
Intercept 424.3224 10.964 38.701 0.000 402.833
445.812
C(Farm)[T.Manton] 705.7196 16.359 43.140 0.000 673.656
737.783
C(Grade)[T.wrinkledbn] 110.6674 15.136 7.312 0.000 81.001
140.334
==============================================================================
Omnibus: 259203.996 Durbin-Watson: 1.921
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2179334936.138
Skew: 16.663 Prob(JB): 0.00
Kurtosis: 643.280 Cond. No. 2.72
==============================================================================
OK, it has fitted a multiple regressionโฆ I wanted a fixed effect model.
I have the wrong ols()
>>> from statsmodels.formula.api import ols
>>> from statsmodels.stats.anova import anova_lm
>>> model1 = ols('Area ~ C(Farm) + C(Grade)', df).fit()
>>>
>>> model1.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Area R-squared: 0.016
Model: OLS Adj. R-squared: 0.016
Method: Least Squares F-statistic: 1038.
Date: Thu, 15 May 2025 Prob (F-statistic): 0.00
Time: 21:40:57 Log-Likelihood: -1.1839e+06
No. Observations: 127239 AIC: 2.368e+06
Df Residuals: 127236 BIC: 2.368e+06
Df Model: 2
Covariance Type: nonrobust
=================================================================================
=========
coef std err t P>|t| [0.025
0.975]
---------------------------------------------------------------------------------
---------
Intercept 424.3224 10.964 38.701 0.000 402.833
445.812
C(Farm)[T.Manton] 705.7196 16.359 43.140 0.000 673.656
737.783
C(Grade)[T.wrinkledbn] 110.6674 15.136 7.312 0.000 81.001
140.334
==============================================================================
Omnibus: 259203.996 Durbin-Watson: 1.921
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2179334936.138
Skew: 16.663 Prob(JB): 0.00
Kurtosis: 643.280 Cond. No. 2.72
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
That is just a regressionโฆ I want an anova table for a fixed effects model. I need to do
>>> anova_lm(model1)
df sum_sq mean_sq F PR(>F)
C(Farm) 1.0 1.429054e+10 1.429054e+10 2022.317244 0.000000e+00
C(Grade) 1.0 3.777585e+08 3.777585e+08 53.458256 2.656887e-13
Residual 127236.0 8.991031e+11 7.066421e+06 NaN NaN
>>>
If we compare this with the first model fitted in R, it is exactly the same. Success!
Now try a model with interaction
>>> model2 = ols('Area ~ C(Farm) + C(Grade) + C(Farm):C(Grade)', df).fit()
>>> model2.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Area R-squared: 0.019
Model: OLS Adj. R-squared: 0.019
Method: Least Squares F-statistic: 832.8
Date: Sat, 17 May 2025 Prob (F-statistic): 0.00
Time: 21:55:48 Log-Likelihood: -1.1837e+06
No. Observations: 127239 AIC: 2.367e+06
Df Residuals: 127235 BIC: 2.367e+06
Df Model: 3
Covariance Type: nonrobust
=================================================================================
===========================
coef std err t P>
|t| [0.025 0.975]
---------------------------------------------------------------------------------
---------------------------
Intercept 338.8571 11.721 28.909 0.
000 315.883 361.831
C(Farm)[T.Manton] 1067.4783 24.115 44.266 0.
000 1020.213 1114.744
C(Grade)[T.wrinkledbn] 315.4990 18.146 17.387 0.
000 279.933 351.065
C(Farm)[T.Manton]:C(Grade)[T.wrinkledbn] -668.2950 32.777 -20.389 0.
000 -732.537 -604.053
==============================================================================
Omnibus: 259289.282 Durbin-Watson: 1.927
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2192884626.973
Skew: 16.674 Prob(JB): 0.00
Kurtosis: 645.272 Cond. No. 6.49
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
>>> anova_lm(model2)
df sum_sq mean_sq F PR(>F)
C(Farm) 1.0 1.429054e+10 1.429054e+10 2028.908896 0.000000e+00
C(Grade) 1.0 3.777585e+08 3.777585e+08 53.632501 2.431477e-13
C(Farm):C(Grade) 1.0 2.928109e+09 2.928109e+09 415.720047 2.930235e-92
Residual 127235.0 8.961750e+11 7.043463e+06 NaN NaN
>>>
That is the same as in R too, except I have not yet included a separate Sheep term.
Now before adding Sheep to the model, lets see how to calclate means for the classes defined by the model.
I can get the fitted coefficients like this
>>> coefficients = model2.params
>>> print(coefficients)
Intercept 338.857130
C(Farm)[T.Manton] 1067.478332
C(Grade)[T.wrinkledbn] 315.499011
C(Farm)[T.Manton]:C(Grade)[T.wrinkledbn] -668.294995
dtype: float64
but they are not means, they are regression equation coefficients. They are already listed by model2.summary()
.
The Intercept is the mean for the first Farm and the First Grade. The others are confusing slopes. I want class means for the fitted modelโฆ in R this is provided by the function model.tables()โฆ Python does not seem to have that function.
The table I want is
|Farm | | |
| | Smooth | Wrinkled |
| | | |
| Glensloy | 339 | 664 |
| Manton | 1406 | 1054 |
| | | |
These values are calculable from tthe regression coefficients
339 + 1067 = 1406
339 + 315 = 654
339 + 1067 + 315 - 668 = 10053
but I would like a general procedure that will do this for any model.
The most likely place to find something useful is in the formulae package
uv pip install formulae
Its documentation s here
With that I can at least get a design matrix.
>>> from formulae import design_matrices
>>> dm = design_matrices("Farm + Grade",df)
>>> dm.common.as_dataframe().head()
Intercept Farm[Manton] Grade[wrinkledbn]
0 1.0 0.0 0.0
1 1.0 0.0 0.0
2 1.0 0.0 0.0
3 1.0 0.0 0.0
4 1.0 0.0 0.0
That is for the model
Area ~ Intercept + Farm + Grade
If I want to compute cell means, I use a different model
Area ~ 0 + Farm*Grade
which says fit all combinations of Farm and Grade, but no grand mean
>>> dm = design_matrices("0 + Farm*Grade",df)
>>> dm.common.as_dataframe().head()
Farm[Glensloy] Farm[Manton] Grade[wrinkledbn] Farm[Manton]:Grade[wrinkledbn]
0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0
Then I get a design matrix with 4 columns of 0โs and 1โs indicating which Farm and which Grade each row (=observation) belongs to.
I can use that matrix to calculate Farm and Grade means for Area from the dataframe.
>>> dmfg = dm.common.as_dataframe()
>>> dmfg.shape
(127239, 4)
>>> np.dot(dmfg["Farm[Glensloy]"],df["Area"])
np.float64(41390646.125)
That does a dot product of 2 vectorsโฆthe โFarm[Glensloy]โ column of the design matrix and the โAreaโ column of the dataframe.
Now , that is a sum, I need to divide it by N to get a meanโฆ N is the numbe rof 1โs in the โFarm[Glensloy]โ column of dmfg.
>>> np.dot(dmfg["Farm[Glensloy]"],df["Area"])/np.dot(dmfg["Farm[Glensloy]"],dmfg["Farm[Glensloy]"])
np.float64(470.4979553153276)
470.5 is the wrong answer.
Because the design matrix is wrongโฆ I used the wrong model formula to make dfmg. I neeed the โpatsyโ formula package and a different design matrix constructor
>>> from patsy import ModelDesc
>>> dmat1 = dmatrices("Area ~ 0 + Farm:Grade",df)[1]
Traceback (most recent call last):
File "<python-input-64>", line 1, in <module>
dmat1 = dmatrices("Area ~ 0 + Farm:Grade",df)[1]
^^^^^^^^^
NameError: name 'dmatrices' is not defined
I need something to supply dmatrices()
from patsy import ModelDesc, Term, EvalFactor, dmatrices
Now try to make a design matrix
>>> dmat1 = dmatrices("Area ~ 0 + Farm:Grade",df)
>>> dmat1
(DesignMatrix with shape (127239, 1)
Area
737.625
41.625
375.500
49.500
88.000
1411.125
2214.375
206.625
10.375
1891.250
752.875
1.000
257.250
26.375
306.250
6.250
3233.875
133.000
838.625
13.750
2066.000
73.250
10.125
56.750
13.750
2.000
66.750
233.875
20825.375
2.000
[127209 rows omitted]
Terms:
'Area' (column 0)
(to view full data, use np.asarray(this_obj)), DesignMatrix with shape (127239, 4)
Columns:
['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]']
Terms:
'Farm:Grade' (columns 0:4)
(to view full data, use np.asarray(this_obj)))
I dont get it. What sort of an object is dmat1?
This is different to the example given here
Reading further
I can get rid of the โAreaโ part of dmat1 by omitting it from the formula, and using dmatrix() instead of dmatrices()
>>> from patsy import dmatrices, dmatrix
>>> dmat1 = dmatrix("0 + Farm:Grade",df)
>>> dmat1
DesignMatrix with shape (127239, 4)
Columns:
['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]']
Terms:
'Farm:Grade' (columns 0:4)
(to view full data, use np.asarray(this_obj))
>>> np.asarray(dmat1)
array([[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
...,
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], shape=(127239, 4))
>>>
That looks right , it should have 4 columns.
I need the zero (0) in the formula to stop it making a design matrix column for the grand mean.
Now lets try to use โdmat1โ to compute some cell means
>>> np.dot(dmat1["Farm[Glensloy:Grade[smooth]"],df["Area"])/np.dot(dmat1["Farm[Glensloy]:G\
rade[smooth]"/np.dot(dmat1["Farm[Glensloy]:Grade[smooth]"],dmat1["Farm[Glensloy]:Grade[smo\
oth]"])
... hangs
>>> np.dot(dmat1["Farm[Glensloy:Grade[smooth]"],df["Area"])
Traceback (most recent call last):
File "<python-input-62>", line 1, in <module>
np.dot(dmat1["Farm[Glensloy:Grade[smooth]"],df["Area"])
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
>>> dmat1["Farm[Glensloy:Grade[smooth]"]
Traceback (most recent call last):
File "<python-input-63>", line 1, in <module>
dmat1["Farm[Glensloy:Grade[smooth]"]
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
The index is invalid?
Try single quote
>>> dmat1['Farm[Glensloy:Grade[smooth]']
Traceback (most recent call last):
File "<python-input-65>", line 1, in <module>
dmat1['Farm[Glensloy:Grade[smooth]']
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I dont get it, that is the column name, not an index
np.asarray(dmat1['Farm[Glensloy:Grade[smooth]'] hangs but
np.asarray(dmat1)` prints the array.
>>> dmat1.design_info
DesignInfo(['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]'], factor_infos={EvalFactor('Farm'): FactorInfo(factor=EvalFactor('Farm'), type='categorical', state=<factor state>, categories=('Glensloy', 'Manton')), EvalFactor('Grade'): FactorInfo(factor=EvalFactor('Grade'), type='categorical', state=<factor state>, categories=('smooth', 'wrinkledbn'))}, term_codings=OrderedDict({Term([EvalFactor('Farm'), EvalFactor('Grade')]): [SubtermInfo(factors=(EvalFactor('Farm'), EvalFactor('Grade')), contrast_matrices={EvalFactor('Farm'): ContrastMatrix(array([[1., 0.],
[0., 1.]]), ['[Glensloy]', '[Manton]']), EvalFactor('Grade'): ContrastMatrix(array([[1., 0.],
[0., 1.]]), ['[smooth]', '[wrinkledbn]'])}, num_columns=4)]}))
>>>
>>> dmat1.design_info.column_names
['Farm[Glensloy]:Grade[smooth]', 'Farm[Manton]:Grade[smooth]', 'Farm[Glensloy]:Grade[wrinkledbn]', 'Farm[Manton]:Grade[wrinkledbn]']
>>>
It all looks right , I just cant access the array elements?
I think it is because dmat1 is a sparse matrix โฆ it is all 0โs and 1โs.
>>> import scipy.sparse
>>> from scipy.sparse import issparse
>>> issparse(dmat1)
False
No , it is not that?
>>> type(dmat1)
<class 'patsy.design_info.DesignMatrix'>
>>>
(continued next reply)