dpaste/Tzdb7 (Bash)

Having downloaded and extracted only
"subreddits24/askscience_comments.zst"
from
"Subreddit comments/submissions 2005-06 to 2024-12" on Academic Torrents
( https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4/tech&filelist=1 )
and extracted it using 7zip, then split it using GNU `split`
(`split -d -l 10000 askscience_comments askscience`),
proceed as follows (this is a bash shell on Linux or WSL):
There are now files "askscience{00..89}" and "askscience{9000..9563}" (for some reason).
In the directory where the "askscience" files live,
$ mkdir with-year sorted-year
Get the creation timestamp (.created_utc) as year from each comment in each file and write them to the same filenames in with-year/ with the year added to the beginning of each line:
$ seq -f "askscience%02g" 0 89 | xargs -P 6 -I {} bash -c 'jq -r '\''(.created_utc | tonumber | gmtime | strftime("%Y")) + ":" + (. | tostring)'\'' < "{}" >> "with-year/{}"'
$ seq -f "askscience%04g" 9000 9563 | xargs -P 6 -I {} bash -c 'jq -r '\''(.created_utc | tonumber | gmtime | strftime("%Y")) + ":" + (. | tostring)'\'' < "{}" >> "with-year/{}"'
Grep the files in with-year/ for each year and write the results to files named as each year in sorted-year/:
$ for i in {2010..2024} ; do grep -hE "^${i}:" with-year/* >> sorted-year/$i ; done
Find the total number of comments for each year:
$ wc -l sorted-year/*
Find the number of comments in each year that contain an em dash:
$ grep -cE "`printf '\xE2\x80\x94'`|—|—|—|\\\u2014|\\\2014|&mdash;|&#8212;|&#x2014;|—" sorted-year/*
Find the total number of em dashes for each year:
$ for year in {2010..2024} ; do grep -oE "`printf '\xE2\x80\x94'`|—|—|—|\\\u2014|\\\2014|&mdash;|&#8212;|&#x2014;|—" sorted-year/$year | wc -l ; done
The rest was done in Excel as (very) simple formulas indicated in row 3 in the screenshot.