1. Having downloaded and extracted only
  2. "subreddits24/askscience_comments.zst"
  3. from
  4. "Subreddit comments/submissions 2005-06 to 2024-12" on Academic Torrents
  5. ( https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4/tech&filelist=1 )
  6. and extracted it using 7zip, then split it using GNU `split`
  7. (`split -d -l 10000 askscience_comments askscience`),
  8. proceed as follows (this is a bash shell on Linux or WSL):
  9. There are now files "askscience{00..89}" and "askscience{9000..9563}" (for some reason).
  10. In the directory where the "askscience" files live,
  11. $ mkdir with-year sorted-year
  12. Get the creation timestamp (.created_utc) as year from each comment in each file and write them to the same filenames in with-year/ with the year added to the beginning of each line:
  13. $ seq -f "askscience%02g" 0 89 | xargs -P 6 -I {} bash -c 'jq -r '\''(.created_utc | tonumber | gmtime | strftime("%Y")) + ":" + (. | tostring)'\'' < "{}" >> "with-year/{}"'
  14. $ seq -f "askscience%04g" 9000 9563 | xargs -P 6 -I {} bash -c 'jq -r '\''(.created_utc | tonumber | gmtime | strftime("%Y")) + ":" + (. | tostring)'\'' < "{}" >> "with-year/{}"'
  15. Grep the files in with-year/ for each year and write the results to files named as each year in sorted-year/:
  16. $ for i in {2010..2024} ; do grep -hE "^${i}:" with-year/* >> sorted-year/$i ; done
  17. Find the total number of comments for each year:
  18. $ wc -l sorted-year/*
  19. Find the number of comments in each year that contain an em dash:
  20. $ grep -cE "`printf '\xE2\x80\x94'`|&#8212;|&#x2014;|&mdash;|\\\u2014|\\\2014|&amp;mdash;|&amp;#8212;|&amp;#x2014;|—" sorted-year/*
  21. Find the total number of em dashes for each year:
  22. $ for year in {2010..2024} ; do grep -oE "`printf '\xE2\x80\x94'`|&#8212;|&#x2014;|&mdash;|\\\u2014|\\\2014|&amp;mdash;|&amp;#8212;|&amp;#x2014;|—" sorted-year/$year | wc -l ; done
  23. The rest was done in Excel as (very) simple formulas indicated in row 3 in the screenshot.