I have created a visualization of word frequencies using the "big data" that is the corpus of Tolstoy’s novel War and Peace. My goal was to determine whether analyzing frequency of words in the novel can express the novel's meaning. Please interact with my visualization below and see what you think the answer is!
This was a learning process for me! I chose War and Peace because it is my favorite classic work, and has a large (publicly accessible) corpus available for analysis. First, I uploaded the digital corpus (copied from Gutenberg) into Voyant Tools for textual analysis. Voyant Tools determined the frequency of the top 300 most used words. See the cirrus visualization of the top 300 words below. As I operate more comfortably in Excel, I exported this data into a spreadsheet for cleaning and further analysis.
Within Excel, I first divided the most commonly used words into word-type categories: noun, verb, adjective, etc. Then, I categorized the words by sentiment (feeling words, military words, people, etc.). This was sometimes difficult to determine: for example, the word “state,” used in the text 139 times, could either refer to a place (e.g. state of Russia or France), or a state of being. The word “just” also gave me trouble: It would be difficult to perform sentiment analysis on this word if one had not read the text. I considered eliminating it from my analysis altogether since the word might have different connotations throughout the text. Initially, based on the content and tone of the novel, I assumed that this word would belong in the “adjective” category, connotating a sense of justice. However, I went through the text to make sure. The first twenty uses of the word were as an adverb, meaning “exactly” or “in the immediate past” (as in, “just now”). This illustrates one of the key complications of sentiment analysis. If the meaning of the word was not explicit, I did not categorize it. I also consolidated data that used the same origin word (e.g., woman/women, Russian/Russians, etc.). This consolidation resulted in a different frequency model than I started with. The word “man,” for example, slid to the #2 most commonly used spot when its frequency was combined with “men.” When I finished with my cleaning, including consolidation and removal of stopwords, the original list of 300 words had been pared down to 267.
I wanted to visualize this data within a circular treemap so that the viewer could understand the frequency of words within a particular topic or sentiment. I researched how to do this as a form of “circular packing,” a visualization which can be generated through R coding. Full disclosure: I am not an expert (or even proficient) in R, so I collaborated with a data analyst on this phase. Together, we copied a basic R-language code that generates treemaps, and inserted the data from Excel into the code through RStudio. Within RStudio, we had to format the data into groups and subgroups so that the treemap would be interactive and allow us to visualize sentiment subgroups and frequencies of words within each group type. We encountered some issues as we went along. I wanted to have subgroups for some categories, with some words fitting into the main group and others belonging to the subgroup. We found that the program only allowed subgroups if every word in the group belonged to a subgroup. Therefore, I had to add sentiment clarification further. Some of the subgroups we submitted were not accepted and had to be omitted entirely. One problem that can still be viewed in the final product are the words “position” and “count”, which the program did not accept. They can still be found, but with a “2” added after the word.
There was a lot of trial and error with this program because it uses code that someone else wrote and has made public, rather than an app into which we can just plug our data. However, I learned a lot about R coding and how to categorize data with programming language. Since I am proficient in Excel, I actually thought a lot of it made sense – it groups data in the same way. I was very lucky to have a programmer assist me with code writing and troubleshooting, though.
I consternated quite a bit on how to present this information. I really wanted to be able to show the frequency of words in this text, and further, the frequency of words that conveyed particular meaning or sentiment within the corpus. War and Peace is my favorite book because of its emotional universality, and this textual analysis really demonstrates that - I do think that meaning can be discerned through an analysis of word frequencies. Even someone who has not read the novel can come to an understanding lot about its content (beyond character and place names) by viewing this visualization of word frequency. Exploring Big Historical Data: The Historian's Macroscope says of topic modeling: "The topic model generates hypotheses, new perspectives, and new questions: not simple answers, let alone some sort of mythical 'truth.' The act of visualization of the results too is as much art as it is science, introducing new layers of interpretation of engagement."[1] Clustering words together to form meaning within a novel is considered art, but perhaps this data visualization should be too! Maybe this sounds silly, but I was actually very moved by navigating through the bubbles from highest to lowest frequency. It conveys a lot about Tolstoy’s philosophical focus - and makes me want to give War and Peace another read.
[1] Shawn Graham. Exploring Big Historical Data:The Historian's Macroscope (p. 157). Icp. Kindle Edition.
Commentaires