Fitting probability models for Author detection

Using a custom python script and google colab, students extract word frequencies from raw text files on Project Gutenberg and analyze them in an attempt to identify an mystery author.

Overview

In a 1963 paper Inference in an Authorship Problem, Frederick Mosteller and David L. Wallace use word frequencies of common words (like, "and" or "it") to detect authorship.  Specifically, they use common probability models (the poisson and binomial random variables) to model word frequencies for texts, and then compare the results of an un-credited text to those of texts with known authors.

This approach to text analysis offers students and researchers a method of exploring an author's "writing style" as it is determined the frequency with which the author uses certain words. However, one could envision other possible projects.


(Back to top)

MTH225 Probability, Fall 2018

Tags:
fall 2018 (11)

Notes

In this assignment, I taught students to use a custom python script in Google Colab to collect word frequency data from raw text files. Each student was assigned an author and emailed text files for books written by that author. In addition, each student received a text from a "mystery author." To complete the assignment students needed to 1. use the python script and Google Colab to collect frequency data from his or her author as well as the mystery author; 2. generate probability models and visuals using google sheets; and 3. write a paper arguing whether his or her author is the same as the mystery author using evidence collected from their analyses.

Outcome summary

Positive outcomes: Most students adapted quickly to the technology and had little trouble using it. The assignment provided many opportunities for prompted critical thinking and evidence-based argument. Lastly, students saw a really interesting application of probability theory to the humanities. Practical problems: Google Colab, unsurprisingly, had a few small issues if the student used Safari. I recommend using Chrome as the internet browser. The project focused on the application of probability models that we learned about during the semester (separate from the project), so without this background, the project would take more class time to implement. However, this analysis can easily be modified to avoid discussion of probability models and look only at the word frequency usage between authors. Some students are put off by the "coding" component, so extra time is required to make this seem approachable.

Materials

Handouts

  • ExampleGuide.pdf (708 KB) —A guide for uploading an ipynb-file to Google Colab
  • JA_all.txt (1.3 MB) —Three Jane Austen Books in one text file.
  • pythonNotebook.zip (16.6 KB) —Zipped folder containing the python notebook to upload to Colab