Having downloaded and extracted only "subreddits24/askscience_comments.zst" from "Subreddit comments/submissions 2005-06 to 2024-12" on Academic Torrents ( https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4/tech&filelist=1 ) and extracted it using 7zip, then split it using GNU `split` (`split -d -l 10000 askscience_comments askscience`), proceed as follows (this is a bash shell on Linux or WSL): There are now files "askscience{00..89}" and "askscience{9000..9563}" (for some reason). In the directory where the "askscience" files live, $ mkdir with-year sorted-year Get the creation timestamp (.created_utc) as year from each comment in each file and write them to the same filenames in with-year/ with the year added to the beginning of each line: $ seq -f "askscience%02g" 0 89 | xargs -P 6 -I {} bash -c 'jq -r '\''(.created_utc | tonumber | gmtime | strftime("%Y")) + ":" + (. | tostring)'\'' < "{}" >> "with-year/{}"' $ seq -f "askscience%04g" 9000 9563 | xargs -P 6 -I {} bash -c 'jq -r '\''(.created_utc | tonumber | gmtime | strftime("%Y")) + ":" + (. | tostring)'\'' < "{}" >> "with-year/{}"' Grep the files in with-year/ for each year and write the results to files named as each year in sorted-year/: $ for i in {2010..2024} ; do grep -hE "^${i}:" with-year/* >> sorted-year/$i ; done Find the total number of comments for each year: $ wc -l sorted-year/* Find the number of comments in each year that contain an em dash: $ grep -cE "`printf '\xE2\x80\x94'`|—|—|—|\\\u2014|\\\2014|&mdash;|&#8212;|&#x2014;|—" sorted-year/* Find the total number of em dashes for each year: $ for year in {2010..2024} ; do grep -oE "`printf '\xE2\x80\x94'`|—|—|—|\\\u2014|\\\2014|&mdash;|&#8212;|&#x2014;|—" sorted-year/$year | wc -l ; done The rest was done in Excel as (very) simple formulas indicated in row 3 in the screenshot.