A comparison of authors writing using text analysis

Students write a report comparing the writing of two authors using word unigram and bigram analysis and submit a Jupyter Notebook showing the code used to extract the data used in the analysis.


Students pick two authors who have at least three works available in plain text. They should be familiar enough with the authors to form a hypothesis of the similarities and difference in language of the two. The assignment asks students to extract unigrams (single words) and bigrams (adjacent word pairs) from the authors and find which of those are used more often by each author as compared to the other by ranking frequency ratios. For example, the number of times "therefore" is used by the first author divided by the number of times it is used by the second author. A ratio grater than 1.0 means it is used more by the first author; around 1.0 means both authors use is about the same, and less than 1.0 means the second author uses it more. In a report, students analyze the unigram and bigram data to evaluate their hypothesis about the two authors' use of language.

Learning outcomes
By the end of the assignment, student should be able to:

  1. Demonstrate extracting and ranking word unigram and bigrams indicative of one author over another.
  2. Demonstrate critical inquiry of the extracted data through hypothesis formulation and assessment.
This assignment assumes that students are familiar enough with the writings of two authors whose work is available in plain text format to be able to generate a hypothesis concerning how similar or different the authors language will be. They also need Python programming skills (or similar) sufficient to extract unigram and bigram frequency ratios from plain text versions of the authors' works. This assignment assumes familiarity with Jupyter Notebooks, but that could be substituted with any programming language file.

(Back to top)

CSC440 Data Mining & Visualization, Spring 2019

python (4)


This is the first run of the assignment. The course covered text analysis at the very end of the semester and the the lessons overlapped with the time frame given for the assignment. The purpose of the course is to teach methods for extracting and visualizing data from datasets. As such, we did not cover how to form hypotheses or integrate analyses into writing. No models (good or poor) were provided to students.

Outcome summary

Overall, the assignment went as expected. Students generally did a good job forming hypotheses and analyzing them. Many also included a discussion of limitations, although that wasn't explicitly required. The presentation of data was not great; most relied on screenshots of output from the Python code rather than reformatting them into nice looking tables directly in their word processor. The next time I use this assignment, I would have students formulate and submit their hypotheses a week before the project, I would give students models of good and poor hypotheses and analyses, and I would make the directions a little more explicit, including guidelines on presentation and have them add a section on limitations.



From the web

    There are currently no web resources associated with this assignment.