I have this Julia program
using Pkg
using DataFrames,CSV
farm = "Glensloy"
grade = "smooth"
for arg in ARGS
df = CSV.read(arg,delim= ",",DataFrame)
sheep = arg[1:4]
field = arg[6]
df.Farm .= farm
df.Grade .= grade
df.Sheep .= sheep
df.Field .= field
df = select(df,[:Farm, :Grade, :Sheep, :Field, :Area, :Red, :Green, :Blue, :Count])
# cols = [:Farm, :Grade, :Sheep, :Field];
# df = select(df, cols, Not(cols))
outfilename = string("lab",arg)
CSV.write(outfilename,df)
end
which I use as follows
$ cd ~/juliawork/dermis.collagen.images/Glensloy/smooth
$ julia /home/nevj/juliawork/csvlabgs.jl *.csv
to add some data labelling columns to every .csv file in a directory.
Here is the result
A new
lab3457_1.jpg.csv
file corresponding to each
3457_1.jpg.csv
file.
The labelled .csv files look like this
$ head -3 lab3448_1.jpg.csv
Farm,Grade,Sheep,Field,Area,Red,Green,Blue,Count
Glensloy,wrinkledbn,3448,1,181.0,0.8033233749664436,0.20620361517714916,0.22953362628228247,177
Glensloy,wrinkledbn,3448,1,132.0,0.8204894467841747,0.21903025353139685,0.2688250569987667,129
....
Note: I can use this program in a directory other than ~/juliawork
because it does not use an environment.
The trouble is, that version only works for the ~/juliawork/dermis.collagen.images/Glensloy/smooth
directory… the labels 'Glensloy 'and ‘smooth’ are hard coded into it.
It is too much trouble to have it find the labels by parsing the directory name… so I will just make a version for each of the 6 subdirectories.
So I do the other 5 subdirectories
then
I can start pooling all the individual .csv files to make one dataset.
Two steps
- Pool all the
lab*.csv
files within each subdirectory
cd :~/juliawork/dermis.collagen.images/Manton/smooth
head -n 1 lab3506_1.jpg.csv > all && tail -n+2 -q lab*.csv >> all
where lab3506_1.jpg.csv
is the first labelled .csv ffile in that subdirectory… head
copies the header line, then tail
copies all files without the header line being duplicated.
Repeat for each of 6 subdirectories
- Combine the ‘all’ files. I will first omit the on-wrinkle results ( I will compare on-wrinkle and between-wrinkle samples later)
cd ~/juliawork/dermis.collagen.images
head -1 Glensloy/smooth/all > expt1.csv && tail -n+2 -q Glensloy/smooth/all >> expt1.csv && tail -n+2 -q Glensloy/wrinkled/between/all >> expt1.csv && tail -n+2 -q Manton/smooth/all >> expt1.csv && tail -n+2 -q Manton/wrinkled/between/all >> expt1.csv
$ ls -l expt1.csv
-rw-r--r-- 1 nevj nevj 11598408 May 6 21:32 expt1.csv
Yes, it is 11.5Mb …not huge but quite a bit of data
Now I am ready for an analysis.
Note : there are better methods of pooling .csv files.
One is to use the program csvstack
from package csvkit
I will be writing about that separately.