Part 4. Digital Tools Explained

4.5 Topic Modeling Tool

What is Topic Modeling Tool?


Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. This tool uses an Latent Dirichlet Allocation (LDA)  algorithm to classify text in a document to a particular topic. Topic models provide a simple way to analyze large volumes of unlabeled text. A “topic” consists of a cluster of words that frequently occur together.


Overview of Topic Modeling Tool

Teddy Rolad’s offers a fantastic overview of in “Topic Modeling: What Humanists Actually Do With It.” In this overview, he points out that “Computers make excellent statisticians and this can be leveraged toward the kind of textual synthesis that initiates higher-order inquiry.” Even though computers cannot interpret meaning, “The computer is well able to recognize unique strings of characters like words and can perform tasks like locating or counting these strings throughout a document.”

Roland cautions readers that “Despite its algorithmic nature, it would be a gross mischaracterization to claim that topic modeling is somehow objective or absent interpretation.” He continues, “I will simply emphasize that human evaluative decisions and textual assumptions are encoded in each step of the process, including text selection and topic scope.”

Scholars might consider using topic modeling as a way to guide close readings and as a technique that examines overarching linguistic patterns in a collection of texts. Topic modeling provides us with methods to organize, understand and summarize large collections of textual information. Ultimately, this technical approach to interpreting texts highlights hidden topical patterns that are present across the collection that might not be seen when performing a traditional read.

This chapter offers an overview of how to use Topic Modeling Tool to explore Toni Morrison’s Sula.

Getting Started with Topic Modeling Tool

What is topic modeling?

Topic modeling is a method of text analysis that looks for clusters of words, called “topics”, in a collection of texts. Topic modeling allows us to examine a body of texts from a distance, find which words tend to cluster together, and examine the general trends among those clusters.

How do I do it?

In this tutorial, we will be using Topic Modeling Tool, a desktop application that runs on Mac and PC.

You will also need a body of texts that you want to examine. These can be chapters from a novel, a collection of journal articles, or any other collection of texts. These texts should be generally related somehow, otherwise the results from the topic modeling will be meaningless!

These texts each need to be saved as separate .txt files in order for the program to work. You can copy and paste your texts into any word processor and save the file as a .txt file. Create a new folder and save all your text files into that folder.

How do I install Topic Modeling Tool?

You can install Topic Modeling Tool on Github. Go to this link on your web browser

Click on the Green “CODE” button in the top right, and then click on “Download Zip”.

On Mac, open the file “TopicModelingTool.dmg” and then drag the program into your applications folder in the window that opens. You may have to allow your computer access to open the program in your security settings.

How do I use Topic Modeling Tool?

Open Topic Modeling Tools. You should see a window that looks like this:

This image is from the home console on Topic Modeling Tool.
Figure 4.5.1

You will need two folders. The first folder should contain all of the texts you want the program to look at. This is your input. Click on the button labeled “Input Dir…” and then navigate to your input folder. Click on your chosen folder so that it is highlighted blue, and then click “Choose”.

This image is the file upload function in Topic Modeling Tool.
Figure 4.5.2

Your second folder should be a blank folder. This is your output. This is where the computer will store the results of the topic analysis. Click on the button labeled “Output Dir…” and then navigate to your designated output folder. Click on your chosen folder so that it is highlighted blue, and then click “Choose”.

You have two optional files you can include. You can add these to your analysis by clicking on the “optional settings” button. The “metadata” file is useful for any numerical data you have related to your various text documents. This file should be saved as a .csv file. The “stopword” file will exclude any words from the analysis. By default, Topic Modeling Tools excludes common words in the English language. If you choose to add additional words to the stopword list, list one word per line in a .txt Both of these files are optional, and you can run perform topic analysis without them.

You can specify how many topics you want the program to generate in the “Number of Topics” box.  You may have to run the program different numbers of times to obtain an analysis of your liking.

When you are ready to run the program, click “Learn Topics”. The process can take a few minutes. You should see text being generated in the console area. When the program is done, the console will display “—”.

How do I access the results?

Once the program is finished running, the program will create two folders in your chosen “output” folder: output_csv and output_html.

output_csv will contain .csv files. CSV stands for comma separated values. These files are typically opened in spreadsheet programs like excel. you will find three CSV files in this folder:

  • docs-in-topics.csv – this csv file gives you a list of topics and shows you which documents correspond to which topics
  • topics_words.csv – this csv file gives you a numbered list of topics
  • topics-in-docs.csv – this csv file gives you a list of documents and lists which topics correspond to them
  • if you included the optional metadata file, you will have a fourth document called topics-metadata

output_html will contain html files, which will open in your web browser:

  • all-topics.html – this is the main html document. it contains a list of each topic, and by clicking on each topic you can access the texts which most correspond to that topic
  • Docs – this folder contains an HTML page for each text in your input
  • Topics – this folder contains an HTML page for each generated topic
  • malletgui.css – this code instructs your web browser how to display your html page

What do the results mean?

topics_words.csv is a good place to begin understanding the results of your topic analysis. As an example, the results below are from a chapter-by-chapter topic modeling of Toni Morrison’s Sula. Each chapter of Sula was one text in the input folder, and six topic groups were generated. These are the six main categories that can be decoded in order to interpret a given topic. The csv file will look like this:

This is a screenshot from Topic Modeling Tool. This example shows the six topics from an example using Toni Morrison's Sula.
Figure 4.5.3

After coming up with various categories for the string of words, topics-in-docs.csv is the next step place to begin understanding the results of your topic analysis. The csv file will look like this:

This image is from another return from Topic Modeling Tool. This sample is from "topic_in_docs" and a sample from an experiment with Sula.
Figure 4.5.4

Each row in topics-in-docs.csv corresponds to one of the texts in your input. Column A provides an identification number to each document, and column B lists the filename of that document. Columns C, E, G, and so on list the “toptopics”, the topics which each text most corresponds to. These topics are listed left to right from most to least relevant for each text. Columns D, F, H, and so on tells you what percent that topic makes up of the doc. These percentages are listed as decimals

 

This image is from another return from Topic Modeling Tool. This sample is from "topic_in_docs" and a sample from an experiment with Sula. This image is a close up of the previous file.
Figure 4.5.5

For example, in the picture above, document number 0, in row number 2, has 63.07% match to topic 21. The document also has a 6.42% match to topic number 0. And so on.

What do the topics mean? How do I name them?

 We can understand the topics better by searching through the all-topics html page. Go to your output folder > output_html > all-topics.html. When you open the document, you should see a page similar to the spreadsheet file. By clicking on a specific topic, you can see the extent to which a given topic is dispersed across the corpus. Below, the example corresponds to Topic (5): eyes, house, woman, mother, children, back, helene, looked, window, colored

This image is from another return from Topic Modeling Tool. This example is from the html file that shows the distribution of topics by chapters.
Figure 4.5.6

Each row is a list of words associated with that topic. Clicking on a row will take you to a html page for that specific topic, which lists the texts associated with that topic. Reading through the texts which have a high correspondence to that topic might give you an idea of what that topic represents.

 

This image is from another return from Topic Modeling Tool. This example shows the specific returns from chapter 9 from Toni Morrison's Sula.
Figure 4.5.7

Given these results, combined with our prior knowledge of Toni Morrison’s Sula, we might name the following groups as follows:

This image is a revised document where categories/titles have been assigned to each topic or cluster of words.
Figure 4.5.8

Rationale for Topics

  1. Division: This theme deals with the expectations and intersectionality of the community, highlighting how easy relationships break and become scarred. Important scenes that are examples of this theme include the return of Sula from college, the dissolution of Nel and Jude’s marriage, Eva’s divisive relationship with her children, and the description of the development of the Bottom into space for vacation homes. Chapters dealing with this theme might explore binary terms such as black and white or women and boys, as well as people or things that might be dividers, such as road and place.  
  2. Matriarchy: Sula is told entirely through the eyes of women living in the Bottom, so a pertinent theme is the power of mothers. The women-centered narrative allows Morrison to offer insights into the lives of Black women and the formation of their roles; the chapters in which this theme is dominant explain the choices made by mothers in the narrative. One example of this theme in the text is the conversation in Chapter 5 in which Hannah asks Eva if she ever loved her children.
  3. Orientation: Much of Sula is interested in exploring the comings and goings of characters—both literally and figuratively—as they learn where they belong in relation to one another. Chapters dealing with this topic might offer prepositional words such as left, good, stood, home, turned, and closed. Major events that exemplify this theme include Eva’s physical and emotional separation from her relatives, as well as Sula and Nel’s separation via marriage, life, and death.
  4. Social Currency: Sula explores the workings of a community and their operations. As Trudier Harris explains in “The Worlds that Toni Morrison Made,” the novel “highlight[s] a world in which people who try to go against the flow ultimately suffer for it—even as they seem to have an ultimately positive impact upon the communities.” Chapters dealing with this theme might repeat words such as people, looked, and wanted, highlighting methods through which the community dealt reputations and expectations to others.
  5. Repression: Trudier Harris explains that much of Morrison’s work explores the inner feelings of characters to highlight racism, race, and relations and their effects on communities and peoples: Sula does the same. Scenes that uphold the theme of repression include Shadrack’s invention of National Suicide Day, Helene’s breaking into the middle class while trying to repress her origins, and Sula’s reconciliation of her distrust of Eva. Chapters dealing with this theme might offer words such as eyes, house, back, looked, window and colored; words such as these emphasize the interiority of this theme.
  6. Streets: Within Sula, the physicality of the community is important. Its roads matter because they carry and connect individuals. Journeying is important, both physically and figuratively, in writing by Black women, as Deborah E. McDowell explains in “New Directions for Black Feminist Criticism.” Chapters dealing with this theme might offer words such as ice, Shadrack, and watched and include events that existed in the traffic of the street: deaths, parades, deliveries, work, and so on.

I later revised the “TopicsInDocs” file after creating the rationale for the six topics. I replaced “column B” with simplified chapter titles. Also, I changed the numbers for each topic to specific categories. And, finally, I placed all of the information in a single row to make it easier to sort through the various chapters, topics, and percentages.

This image is from another return from Topic Modeling Tool. This sample is from "topic_in_docs" and a sample from an experiment with Sula.
Figure 4.5.9
This image is from another return from Topic Modeling Tool. This sample is from "topic_in_docs" and a sample from an experiment with Sula. This a revised version with the topics and percentages arranged vertically.
Figure 4.5.10

What if I don’t like my results?

If your topic analysis results do not satisfy you, you can greatly alter the results depending on the conditions of what you put in. You can try to group your texts in a different way, or you can alter the number of topic groups to fit your liking.

 

By John Merritt


This chapter was adapted from “Very basic strategies for interpreting results from the Topic Modeling Tool” by Andy Wallace

 

Media Attributions

  • Figure 4.5.1 – Topics – Topic Modeling Homescreen © Kenton Rambsy
  • Figure 4.5.2 – Topics – Topic Modeling Open Screen © Kenton Rambsy
  • Figure 4.5.3 – Topics – Excel File © Kenton Rambsy
  • Figure 4.5.4 – Topics – Topic Modeling Return © Kenton Rambsy
  • Figure 4.5.5 – Topics – Topic Modeling Return – 2 © Kenton Rambsy
  • Figure 4.5.6 – Topics – Repression © Kenton Rambsy
  • Figure 4.5.7 – Topics – Chapter 9 © Kenton Rambsy
  • Figure 4.5.8 – Topics – Sula Topics © Kenton Rambsy
  • Figure 4.5.9 – Topics – Topic in Docs
  • Figure 4.5.10 – Topics – Revised Topic in Docs © Kenton Rambsy

License

Icon for the Creative Commons Attribution 4.0 International License

The Data Notebook by Peace Ossom-Williamson and Kenton Rambsy is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.

Share This Book