EA Forum: Data analysis and deep learning

post by anishazaveri · 2020-05-12T17:39:41.884Z · score: 67 (40 votes) · EA · GW · 3 comments

Contents

  1. Scraping
  2. Exploratory Data Analysis
    2.1 Number of yearly posts
    2.2 Posts with the most comments
    2.3 Posts with the most karma
    2.4 Authors with the most posts
    2.5 Authors with the highest mean karma
  3. Word Clouds
  4. GPT2
None
3 comments

(Cross-posted from my blog)

Here's a fun project I undertook this month:

  1. Scrape all posts from the Effective Altruism (EA) Forum
  2. Explore overall trends in the data i.e. posts with the greatest number of comments, authors with the greatest number of posts etc.
  3. Build a wordcloud to visualize the most used words
  4. Fine-tune GPT2 on the EA Forum text corpus and generate text. Here's a preview of the text GPT2 produced:

GITC's Vaccination Prevention Research Project This is the first post of a three part series on the development of effective vaccines. This series will start with a list of possible vaccines that can be developed by the GPI team

Code and data for this project are available at this GitHib repo

1. Scraping

The robots.txt file [? · GW] of EA Forum disallows crawling/scraping data from forum.effectivealtruism.org/allPosts. To get around this, I did the following:

I cleaned the data and restricted subsequent analyses on posts published between 01-01-2013 to 04-15-2020, since recent posts were unlikely to have accumulated comments.

2. Exploratory Data Analysis

2.1 Number of yearly posts

2.2 Posts with the most comments

date title author num_comments
4/23/2019 Long-Term Future Fund: April 2019 grant recommendations [EA · GW] Habryka [EA · GW] 240
10/26/2017 Why & How to Make Progress on Diversity & Inclusion in EA [EA · GW] Kelly_Witwicki [EA · GW] 235
11/15/2019 I'm Buck Shlegeris, I do research and outreach at MIRI, AMA [EA · GW] Buck [EA · GW] 231
10/24/2016 Concerns with Intentional Insights [EA · GW] Jeff_Kaufman [EA · GW] 186
2/26/2019 After one year of applying for EA jobs: It is really, really hard to get hired by an EA organisation [EA · GW] EA applicant [EA · GW] 182
1/16/2020 Growth and the case against randomista development [EA · GW] HaukeHillebrandt [EA · GW] 168
9/15/2014 Open Thread [EA · GW] RyanCarey [EA · GW] 163
11/11/2017 An Exploration of Sexual Violence Reduction for Effective Altruism Potential [EA · GW] Kathy_Forth [EA · GW] 156
10/22/2014 Should Giving What We Can change its Pledge? [EA · GW] Michelle_Hutchinson [EA · GW] 144
9/3/2019 Are we living at the most influential time in history? [EA · GW] William_MacAskill [EA · GW] 140

2.3 Posts with the most karma

date title author num_karma
2/26/2019 After one year of applying for EA jobs: It is really, really hard to get hired by an EA organisation [EA · GW] EA applicant [EA · GW] 285
1/16/2020 Growth and the case against randomista development [EA · GW] HaukeHillebrandt [EA · GW] 269
1/13/2020 EAF's ballot initiative doubled Zurich's development aid [EA · GW] Jonas Vollmer [EA · GW] 254
9/26/2019 Some personal thoughts on EA and systemic change [EA · GW] Carl_Shulman [EA · GW] 183
9/3/2019 Are we living at the most influential time in history? [EA · GW] William_MacAskill [EA · GW] 174
6/2/2019 Is EA Growing? EA Growth Metrics for 2018 [EA · GW] Peter_Hurford [EA · GW] 168
3/7/2019 SHIC Will Suspend Outreach Operations [EA · GW] cafelow [EA · GW] 165
8/20/2019 List of ways in which cost-effectiveness estimates can be misleading [EA · GW] saulius [EA · GW] 155
6/20/2019 Information security careers for GCR reduction [EA · GW] ClaireZabel [EA · GW] 153
8/14/2019 Ask Me Anything! [EA · GW] William_MacAskill [EA · GW] 150

2.4 Authors with the most posts

author num_posts
Aaron Gertler [EA · GW] 87
Milan_Griffes [EA · GW] 83
Peter_Hurford [EA · GW] 74
RyanCarey [EA · GW] 66
Tom_Ash [EA · GW] 58

2.5 Authors with the highest mean karma

Authors with <2 posts were excluded

author mean_post_karma
Buck [EA · GW] 92.2
Jonas Vollmer [EA · GW] 77.0
Luisa_Rodriguez [EA · GW] 74.7
saulius [EA · GW] 73.5
sbehmer [EA · GW] 73.0

3. Word Clouds

My next goal was to make a word cloud representing the most commonly used words in the EA Forum. I preprocessed the post content as follows:

The resulting word cloud was built using the Python word_cloud package on ~2.6 million tokens:

The most common words appeared to be 'one' and 'work'. I thought it would be instructive to see if these were over-represented in the EA Forum specifically, or are generally over-represented in other blogs/forums. To generate a control, I scraped all posts from Slate Star Codex (SSC) and performed identical text preprocessing to generate ~1.4 million tokens.

Using R's wordcloud package I built a "comparative" word cloud showing words over-represented in the EA Forum versus SSC and vice-versa.

What about words that were common between the EA Forum and SSC?

4. GPT2

Finally, I used the text corpus from the EA Forum to fine-tune GPT2. GPT2 is a text-generating language model trained on 8 million web pages on the internet. You can play around with it on Talk to Transformer where the model completes the sentence given a prompt.

For fine-tuning I used this very convenient template Colab notebook made with gpt2-simple. I fine-tuned both the "small" (124 M parameters) and "medium" (355 M parameters) GPT2 models and tried a few temperature settings for text generation.

GPT2 can work with or without a prompt, and typically produces "good" text in 1/25 tries. I've cherry picked some outputs I obtained from the fine-tuned models:

3 comments

Comments sorted by top scores.

comment by matthewp · 2020-05-12T18:26:10.400Z · score: 28 (19 votes) · EA(p) · GW(p)
TL;DR's for the EA Forum/Welcome: ”Effective altruists are trying to figure out how to build a more effective AI, using paperclips, but we're not really sure how it's possible to do so.

Ouch.

comment by Ben Pace · 2020-05-13T05:45:10.618Z · score: 3 (4 votes) · EA(p) · GW(p)

This is awesome.

comment by agent18 · 2020-05-18T18:27:02.628Z · score: 1 (1 votes) · EA(p) · GW(p)

Would it be possible to make 2-grams and 3-grams as well. Maybe that provides more insight.

P.S It takes more time to generate it and more ram. It does not scale linearly with the n-grams.