Quantitative Text Analysis I
Prof. Madrid-Morales is one of the greatest teachers I have ever had, explaining difficult concepts in an accessible way and demonstrating procedures step-by-step with great energy and enthusiasm. — participant from Canada
This course offers students an introduction to quantitative text analysis methods used to systematically extract information from texts. It starts with an overview of traditional approaches, such as manually coded content analysis, before moving on to computational methods that treat text as data. After reviewing relevant concepts in content analysis, participants learn and practice the basics of hand-coded approaches to text analysis. During the second half of the course, focused on computer-assisted text analysis, participants are introduced to text-processing techniques (tokenization, stemming and lemmatization), followed by dictionary-based approaches and simple machine learning techniques for document classification. The course combines lectures with hands-on labs that allow participants to practice and apply their newly acquired skills.
This course is the first part in a two-course sequence. Part two (cf. Quantitative Text Analysis II) covers more advanced topics, such unsupervised machine learning, word embeddings and large language models (LLMs).
Dates
This one-week, 17.5-hour course runs Monday-Friday, July 1-5, 2024. The course is scheduled for 1:30-5:00 pm.
Classroom Location
Faculty of Arts and Social Science, AS1 02-08
Instructor
Dani Madrid-Morales, University of Sheffield
Detailed Description
This course provides an accessible and applied introduction to the fundamentals of quantitative text analysis, focusing on widely used methods to systematically extract insights from text data. Designed for participants with basic knowledge of the R programming language, the course guide learners from basic text processing to more structured analysis techniques to make sense of large collections of texts.
The main learning objectives of the course are:
- The principles of computational text analysis and its relevance in social science research.
- How to preprocess textual data (e.g., tokenization, stemming, stop-word removal).
- Using dictionaries for sentiment and thematic analysis.
- Building and validating simple text classification models.
- Leveraging APIs and LLMs to improve dictionary creation and text classification.
The course begins by introducing foundational principles of content analysis, focusing on how to systematically and objectively analyse text data. Participants will explore the differences between manual to computational content analysis, understanding traditional workflows and discussing the advantages and disadvantages of using each approach for large-scale analysis. The first two sessions highlight practical strategies for formulating research questions, defining populations, and selecting appropriate sampling methods, laying the groundwork for automated approaches.
The course then transitions to computer-assisted text analysis methods, introducing participants to dictionary-based approaches for efficiently classifying and tagging large text collections. Students will also learn the basics of supervised machine learning, such as building simple classifiers to categorize text into predefined categories. The course concludes by exploring how contemporary tools like APIs and Large Language Models (LLMs) can enhance traditional approaches, improve dictionary creation, and support more robust classification tasks.
This course is predominantly practical and applied, and it is structured in such a way that participants learn how to use these methods in their own research. It combines theoretical sessions with practical hands-on labs that allow participants to immediately apply what they learn in individual and team exercises.
This course is the first part in a two-course sequence. More advanced techniques, such as supervised and semi-supervised machine learning for text analysis, and word embeddings are covered in the more advanced Quantitative Text Analysis II course.
Prerequisites
While there are no formal prerequisites, it would be beneficial if participants were familiar with basic statistical concepts (cf. Regression Analysis) and had some basic knowledge of the R programming language.
Requirements
Participants are expected to have access to an internet-connected computer. Access to data, temporary licenses for the course software, and installation support will be provided by the Methods School.
Core Readings
Van Atteveldt, Wouter; Trilling, Damian, and Arcila Calderón, Carlos. 2022. Computational Analysis of Communication. Hoboken: Wiley.
Suggested Readings
Grimmer, Justin; Roberts, Margaret E., and Stewart, Brandon M. 2022. Text As Data. Princeton and Oxford: Princeton University Press.
Silge, Julia, and David Robinson. 2017. Text Mining with R. Sebastopol: O’Reilly Media.