Quantitative Text Analysis I

Prof. Madrid-Morales is one of the greatest teachers I have ever had, explaining difficult concepts in an accessible way and demonstrating procedures step-by-step with great energy and enthusiasm. — participant from Canada

This course provides participants with an introduction to quantitative text analysis methods used to systematically extract information from texts. It starts with an overview of traditional approaches, such as manually coded content analysis, before quickly moving on to computational methods that treat text as data. After reviewing relevant concepts in content analysis (e.g. content validity, inter-coder reliability…), participants learn and practice the basics of hand-coded approaches to text analysis. During the second half of the course, focused on computer-assisted text analysis, participants are introduced to text-processing techniques (tokenization, stemming and lemmatization), followed by dictionary-based approaches, including sentiment analysis. The course combines lectures with hands-on labs that allow participants to practice and apply their newly acquired skills.

This course is the first part in a two-course sequence. Part two (cf. Quantitative Text Analysis II) covers more advanced topics, such as supervised and semi-supervised machine learning, and word embeddings.

Dates

This one-week, 17.5-hour course runs Monday-Friday, July 1-5, 2024. The course is scheduled for 1:30-5:00 pm.

Classroom Location

Faculty of Arts and Social Science, AS1 02-08

Image preview

Instructor

Dani Madrid-Morales, University of Sheffield

Detailed Description

This course provides participants with an applied introduction to basic methods of quantitative text analysis that are widely used to systematically extract information from texts. The course starts by covering traditional approaches, such as manual hand-coding, but quickly moves on to recent advances in social science methods that treat text as data and use computer-assisted techniques in their analysis.

The course begins with a review of important concepts in content analysis, such as inter-coder reliability and content validity. It then takes a closer look at manual hand-coding approaches, which have been used for decades in well-known research projects, like the Comparative Manifesto Project, that have relied on human coders to reduce content of a wide variety of texts into predefined categories. In addition, participants will be introduced to crowd-coding platforms that are increasingly being used to label and tag texts.

From there, the course moves to computer-assisted, dictionary-based text analysis techniques that employ computers to code large amounts of text/data by relying on previously built codebooks that assign individual words to specific thematic categories. Next, participants are introduced to various refinements to the dictionary approach, such as sentiment analysis and Wordscores. While the former allows for the study of attitudes or emotions in texts, the latter allows social scientists to automatically extract policy positions from documents, such as election manifestos or speeches. Special attention will be paid to understanding validation techniques, which are a required last step in any computational analysis of text.

This is an applied course for beginners and intermediate users of content analysis that provides both an overview of the theoretical foundations of quantitative text analysis and a thorough introduction to the use of computer-assisted techniques. This course is predominantly practical and applied, and it is structured in such a way that participants learn how to use these methods in their own research. It combines theoretical sessions with practical hands-on labs that allow participants to immediately apply what they learn in individual and team exercises.

This course is the first part in a two-course sequence. More advanced techniques, such as supervised and semi-supervised machine learning for text analysis, and word embeddings are covered in the more advanced Quantitative Text Analysis II course.

Prerequisites

While there are no formal prerequisites, it would be beneficial if participants were familiar with basic statistical concepts (cf. Regression Analysis) and had some experience with the statistical software R. However, participants unfamiliar with these concepts and tools will be able to effectively participate in the course.

Requirements

Participants are expected to have access to an internet-connected computer. Access to data, temporary licenses for the course software, and installation support will be provided by the Methods School.

Core Readings

Van Atteveldt, Wouter; Trilling, Damian, and Arcila Calderón, Carlos. 2022. Computational Analysis of Communication. Hoboken: Wiley.