Linux utilities for managing .csv files

nevj · May 7, 2025, 10:59am

Utilities for managing .csv files

I was processing some image data recently. It was in the form of .csv files, which have data values separated by commas. They look like this

Farm,Grade,Sheep,Field,Area,Red,Green,Blue,Count
Glensloy,smooth,3465,5,1.0,0.7843137383460999,0.5764706134796143,0.5137255191802979,1
Glensloy,smooth,3465,5,23.375,0.7505541495654894,0.18942882217790769,0.25200341577115265,23
Glensloy,smooth,3465,5,16.25,0.8262255005538464,0.21200980711728334,0.27990197017788887,16
Glensloy,smooth,3465,5,23.125,0.7923351282423193,0.22834225337613712,0.3222816484895619,22
....

They usually start with a line of field headings, then follow lines of data fields.

I found that .csv files do not process readily with normal Linux utilities, but I discovered two packages which are specific for processing .csv files. They are called csvkit and miller.

Installation

The packages csvkit and miller are available in Debian based and Arch based distros.
Lets install them in my MX

apt install csvkit csvkit-doc
apt install miller

Simple. What you get is a whole set of new commands specifically for .csv files

$ man -k csv
csv2rec (1)          - csv to rec converter
csvclean (1)         - manual page for csvclean 1.0.7
csvcut (1)           - manual page for csvcut 1.0.7
csvformat (1)        - manual page for csvformat 1.0.7
csvgrep (1)          - manual page for csvgrep 1.0.7
csvjoin (1)          - manual page for csvjoin 1.0.7
csvjson (1)          - manual page for csvjson 1.0.7
csvlook (1)          - manual page for csvlook 1.0.7
csvpy (1)            - manual page for csvpy 1.0.7
csvsort (1)          - manual page for csvsort 1.0.7
csvsql (1)           - manual page for csvsql 1.0.7
csvstack (1)         - manual page for csvstack 1.0.7
csvstat (1)          - manual page for csvstat 1.0.7
in2csv (1)           - manual page for in2csv 1.0.7
mlr (1)              - - like awk, sed, cut, join, and sort for name-indexed data such as CSV and t...
pedigree (1)         - create a TeX file for pedigree from a csv file
rec2csv (1)          - rec to csv converter
sql2csv (1)          - manual page for sql2csv 1.0.7
Text::CSV (3pm)      - comma-separated values manipulator (using XS or PurePerl)
Text::CSV_PP (3pm)   - Text::CSV_XS compatible pure-Perl module
Text::CSV_XS (3pm)   - comma-separated values manipulation routines

mlr comes from the miller package, the others ar efrom csvkit. mlr provides subcommands emulating awk,sed,cut,join,sort for .csv files ( and also JSON).

The csvkit commands

There is no general man page for csvkit. There are individualman pages for the utilities listed above… csv2rec, csvvlean, csvcut, csvformat, csvgrep, csvjoin, csvjson, csvlook, csvpy, csvsort, csvsql, csvstack, csvstat, in2csv, pedigree, rec2csv.

There is also an .html document installed by the package csvkit-doc. You can find it here
file:///usr/share/doc/csvkit/html/index.html
It is a comprehensive document, and includes a tutorial

There is no point in my duplicating it here
The utility I was originally interested in was csvstack which will concatenate .csv files withut duplicating the heading line.

csvstack *.csv > combinedfile

I also noticed csvstat which emulates whatthe R summary() function does

$ csvcut -c Area expt1.csv | csvstat
/usr/lib/python3/dist-packages/agate/table/from_csv.py:74: RuntimeWarning: Error sniffing CSV dialect: Could not determine delimiter
  1. "Area"

	Type of data:          Number
	Contains null values:  False
	Unique values:         21937
	Smallest value:        1
	Largest value:         199,468.625
	Sum:                   88,354,619.5
	Mean:                  694.399
	Median:                22.625
	StDev:                 2,679.849
	Most common values:    1 (29493x)
	                       2 (9181x)
	                       3 (2491x)
	                       3.125 (1983x)
	                       4 (1290x)

Row count: 127239

Seems OK, not sure what the Python message means.

Read the document… it is comprehensive.

The ‘mlr’ commands

There is no .html document for `mlr’, but the man page is extensive and has examples.
There is a help facility

$ mlr help topics
Type 'mlr help {topic}' for any of the following:
Essentials:
  mlr help topics
  mlr help basic-examples
  mlr help file-formats
Flags:
  mlr help flags
  mlr help list-separator-aliases
  mlr help list-separator-regex-aliases
  mlr help comments-in-data-flags
  mlr help compressed-data-flags
  mlr help csv/tsv-only-flags
  mlr help file-format-flags
  mlr help flatten-unflatten-flags
  mlr help format-conversion-keystroke-saver-flags
  mlr help json-only-flags
  mlr help legacy-flags
  mlr help miscellaneous-flags
  mlr help output-colorization-flags
  mlr help pprint-only-flags
  mlr help profiling-flags
  mlr help separator-flags
Verbs:
  mlr help list-verbs
  mlr help usage-verbs
  mlr help verb
Functions:
  mlr help list-functions
  mlr help list-function-classes
  mlr help list-functions-in-class
  mlr help usage-functions
  mlr help usage-functions-by-class
  mlr help function
Keywords:
  mlr help list-keywords
  mlr help usage-keywords
  mlr help keyword
Other:
  mlr help auxents
  mlr help mlrrc
  mlr help output-colorization
  mlr help type-arithmetic-info
Shorthands:
  mlr -g = mlr help flags
  mlr -l = mlr help list-verbs
  mlr -L = mlr help usage-verbs
  mlr -f = mlr help list-functions
  mlr -F = mlr help usage-functions
  mlr -k = mlr help list-keywords
  mlr -K = mlr help usage-keywords
Lastly, 'mlr help ...' will search for your exact text '...' using the sources of
'mlr help flag', 'mlr help verb', 'mlr help function', and 'mlr help keyword'.
Use 'mlr help find ...' for approximate (substring) matches, e.g. 'mlr help find map'
for all things with "map" in their names.

The most important thing seems to be ‘verbs’… these are the commands, eg
mlr cut -f {field1,field2}
extracts filed1 and field2 from all records.
‘Cut’ is the verb.

So, read the man page. It has everything

Apps which use .csv files

R and Julia and Python can read and write .csv files.
R and Julia read them into DataFrames.
You can also read them into a spreadsheet

$ gnumeric lab3459_5.jpg.csv

Gnumeric is my favorite spreadsheet.

Some File Managers, when you ask to display a .csv file, will display it using a spreadsheet. I find that annoying. I need to see the raw file.

Conclusion

Anyone who manages any kind of data files could find these utilities useful.

callpaul.eu · May 7, 2025, 12:23pm

I prefer libre office calc never had a problem and done some really big files and data. So big had to split the file in 2 parts as it would not fit on one sheet

nevj · May 7, 2025, 12:50pm

Too many buttons to press. I can start gnumeric from the CLI

berninghausen · May 7, 2025, 1:45pm

Me, too, Paul. A csv file is picked up by calc pretty easily. I can start it with a right click, Nev.

Rosika · May 7, 2025, 1:56pm

Hi Neville,

thanks for providing an extensive information section for csvkit and miller.
Very interesting.

Mine too, actually.

I´ve been using is since my Lubuntu days. I believe the OS came with gnumeric preinstalled, with a focus on lightness.
For Linux Lite I had to install it. I like it very much.

The same with me. I hardly ever start any application any other way than with the help of CLI.
Partly due to the fact that I sandbox almost all apps with firejail, partly because I got so accustomed to the terminal.

Many greetings from Rosika

nevj · May 7, 2025, 2:38pm

Great minds think alike.
It works much better on a laptop too… instead of using the clumsy touchpad.
I dont want to start a mouse versus keyboard war… Bill and Paul can have their Mouseworld.

Rosika · May 7, 2025, 2:54pm

Hi Neville,

What a nice compliment. I am very flattered.

As I have fish set as my default shell accessing all the commands I´ve ever used can be done in a jiffy. Only some (tiny) part of the command is needed. Enter it and then “arrow up” until you find it.
No need to define aliases.

The bottom line is: I have those commands at my disposal pretty much right away instead of wasting time clicking through graphical menues and submenues… .

Cheers rom Rosika.

callpaul.eu · May 7, 2025, 3:30pm

You should join it some time its picturesque! And only a click away

callpaul.eu · May 7, 2025, 6:12pm

If you right click on your csv file it selects open with and by default open office calc
It asks is this a csv file
Does it have head record describing each column yes or no
Click next and its imported plus seperated into columns
Save it in another format so you dont have to do it again

OK 4 perhaps 5 clicks and its done… RSI from mouse clicks ? Perhaps

pdecker · May 7, 2025, 7:19pm

Python can read CSV into data frames too.

berninghausen · May 7, 2025, 10:13pm

We can reserve a chair and a nice drink for them, Paul.

nevj · May 7, 2025, 10:51pm

I did not know Python had DataFrames.
I need to get some Python practice.

pdecker · May 8, 2025, 12:45am

There is a CSV library or module you can import from the standard library for importing or exporting or converting purposes.

The dataframe support is typically from the pandas or polars libraries. I believe Pandas is written in C. Polars is written in Rust. So, you get the speed of C or Rust where it counts, but the ease of development of Python for orchestrating the rest of the program.

daniel.m.tripp · May 8, 2025, 1:16am

I’ll have to check some of these out - I use CSV files reasonably often…

I agree - it’s less sophisticated than LibreOffice Calc, but it works better with CSV files… If I open a CSV file in LibreOffice Calc, it does nasty stuff like formats numbers that look like dates as something else - or badly formatted (e.g. treats YYYYmmdd as mm/dd/YYYY or worse (and adds time in a.m. or p.m.) - Gnumeric does a much better job…

I have a shell script that uses “column” to display CSV files :

#!/usr/bin/env bash
# read a csv in the terminal
# column -s, -t < eni-linux-list.csv | less -#2 -N -S
# expect an input file name
PROG=$(basename $0)
CSVFILE=$1
if [ "$#" -lt 1 ] ; then
        echo "need argument - expecting a CSV file..."
	echo "e.g. : $PROG filename.csv"
        exit 1
fi
# column -s, -t < $CSVFILE|grep -v \# | less -#2 -N -S
# column -s, -t < $CSVFILE|grep -v \# | less -#2 -N -S
grep -v \# $CSVFILE | column -s, -t |less -#2 -N -S

Yeah - I have a nasty habit of leaving commented out lines in my bash scripts…

I’ve never really had to use CSV specific utilities though - but my CSV files are probably a lot smaller than yours… usually cat, less, column, sed, are enough for my use case…

BTW - I have both libre and gnumeric installed in Pop!_OS - and I think they both came pre-installed - Gnumeric is the default in Nautilus when double clickking on CSV files…

nevj · May 8, 2025, 2:26am

I do that everywhere. It is a memory aid for me, but it tends to confuse others.

I think you could pipe to gnumeric… must try it.

GrahamLees · May 8, 2025, 7:19am

I had a problem recently with a .csv file and gLabels where Calc corrupted elements of the file adding strange characters in place of the ampersand (typically, but not exclusively). In the end, I just opened the .csv address file in terminal, edited it and saved it back and there has not been an issue since.
I also tried Gnumeric but that didn’t solve the issue - indeed, it exacerbated it by adding unwanted fields in the file to the extent that additional unwanted columns were introduced in to Gnumeric.
This was in Debian 12.10 Wayland, LO Flatpak v 25.2.3.2 locale: en_GB.UTF-8 UI: en-GB
I tried asking the question in GNOME Discourse at the time but didn’t really get anywhere so, as with the constipated mathematician, I worked it with a pencil by going back to basics.

nevj · May 8, 2025, 7:27am

Yes, you always have the option of editing the raw file, becauze .csv is a text file.

These programs like Calc and gnumeric make too many assumptions. I think csvkit is safe in that respect. Not sure about mlr. i have not used gLabels.

GrahamLees · May 8, 2025, 7:55am

very unlikely that gLabels caused the issue tbh. It just reads the .csv file and parses it to the created fields derived from the header line in the .csv file. If you have a need for printing labels, gLabels is my go to place. Setting it up for things like business cards or address labels is a breeze. I did try doing it in Calc but it just didn’t cut the mustard. It has an extensive database of different commercially available standard labels (Avery et al) and setting up a non standard size label is quite intuitive. Very little wastage too as the print routines take account of singles and “widowed” labels on a sheet.
The other alternative to calc for editing .csv files is Google Sheets where the file can be imported File ⇒ Import ⇒ Upload ⇒ Browse (or drag a file) and changes are saved dynamically.

nevj · May 8, 2025, 12:06pm

" The csvclean command is a utility from the csvkit suite designed to clean and standardize CSV files. It automatically detects and corrects common errors, inconsistencies, and formatting issues that can hinder data analysis. It helps ensure data integrity by addressing encoding problems, normalizing delimiters, removing stray whitespace, and converting invalid characters. By providing a consistent and well-formatted CSV file, csvclean streamlines subsequent data processing steps like importing into databases or using data analysis tools. Using it helps to minimize errors and maximize the efficiency of your data workflows."

You may find csvclean useful , instead of trying to fix a .csv file with hand edits.
I would also try csvformat and csvlook

Sometimes using spreadsheet just to fix a file is an overkill.

GrahamLees · May 8, 2025, 1:32pm

Does csvkit also include csvclean when downloaded or are they separate packages?