- Having downloaded and extracted only
- "subreddits24/askscience_comments.zst"
- from
- "Subreddit comments/submissions 2005-06 to 2024-12" on Academic Torrents
- ( https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4/tech&filelist=1 )
- and extracted it using 7zip, then split it using GNU `split`
- (`split -d -l 10000 askscience_comments askscience`),
- proceed as follows (this is a bash shell on Linux or WSL):
-
- There are now files "askscience{00..89}" and "askscience{9000..9563}" (for some reason).
-
- In the directory where the "askscience" files live,
-
- $ mkdir with-year sorted-year
-
- Get the creation timestamp (.created_utc) as year from each comment in each file and write them to the same filenames in with-year/ with the year added to the beginning of each line:
-
- $ seq -f "askscience%02g" 0 89 | xargs -P 6 -I {} bash -c 'jq -r '\''(.created_utc | tonumber | gmtime | strftime("%Y")) + ":" + (. | tostring)'\'' < "{}" >> "with-year/{}"'
- $ seq -f "askscience%04g" 9000 9563 | xargs -P 6 -I {} bash -c 'jq -r '\''(.created_utc | tonumber | gmtime | strftime("%Y")) + ":" + (. | tostring)'\'' < "{}" >> "with-year/{}"'
-
- Grep the files in with-year/ for each year and write the results to files named as each year in sorted-year/:
-
- $ for i in {2010..2024} ; do grep -hE "^${i}:" with-year/* >> sorted-year/$i ; done
-
- Find the total number of comments for each year:
-
- $ wc -l sorted-year/*
-
- Find the number of comments in each year that contain an em dash:
-
- $ grep -cE "`printf '\xE2\x80\x94'`|—|—|—|\\\u2014|\\\2014|&mdash;|&#8212;|&#x2014;|—" sorted-year/*
-
- Find the total number of em dashes for each year:
-
- $ for year in {2010..2024} ; do grep -oE "`printf '\xE2\x80\x94'`|—|—|—|\\\u2014|\\\2014|&mdash;|&#8212;|&#x2014;|—" sorted-year/$year | wc -l ; done
-
- The rest was done in Excel as (very) simple formulas indicated in row 3 in the screenshot.