Quantitative Text Analysis II

Prof. Madrid-Morales is an outstanding instructor, extremely passionate and helpful. I hope he will continue to offer this great course. — participant from Hong Kong

This course covers advanced techniques and methods in computational quantitative text analysis used to systematically extract information from texts. It combines lectures with hands-on labs in which participants can practice their newly acquired skills. Students will work on a research project throughout the week, covering all stages from data collection and acquisition, to processing, analysis and visualization. The course begins with a brief overview of automated data collection techniques commonly used by social scientists working with text data (for example, web scraping), before discussing the use of supervised, semi-supervised and unsupervised machine learning techniques. While the focus is on practical applications of these methods, the course also offers an introduction to the mathematical and statistical rationale behind them. In the latter part of the course, participants are introduced to novel approaches in quantitative text analysis such as semantic network analysis and word embeddings. Students will also learn basic techniques to visualize quantitative text analysis data.

This course is the second part in a two-course sequence. It requires participants to be familiar with the material covered by the introductory Quantitative Text Analysis I or have prior experience with basic computational text analysis.

 

Dates

This one-week, 17.5-hour course runs Monday-Friday, July 8-12, 2024. The course is scheduled for 1:30-5:00 pm.

 

Classroom Location 

Faculty of Arts and Social Science, AS1  02-08

Image preview

Instructor

Dani Madrid-Morales, University of Sheffield

 

Detailed Description

Building on the material covered by the first course in the two-course quantitative text analysis sequence (cf. Quantitative Text Analysis I), this course teaches advanced techniques in computational quantitative text analysis and provides participants with skills that can be immediately applied to systematically extract and analyze information from text.

The course starts with a review of different techniques uses for data cleaning and gathering (web scraping, API use, and corpus building), and offers an introduction to different tools commonly used in natural language processing (NLP). After this, the course moves on to discussing supervised, semi-supervised and unsupervised machine learning models commonly used in computational text analysis and explores their basic mathematical and statistical foundations. Students will also be introduced to other novel methodological approaches to study textual content quantitatively. These include semantic network analysis and word embeddings.

Through detailed tutorials and hands-on sessions, participants are taught how to use state-of-the-art open source R packages, like quanteda, tidytext and ggplot2, to complete a wide array of tasks, including basic text-as-data visualization techniques. In the latter part of the course, participants also learn how to best incorporate and use the results of applied quantitative text analysis in further statistical analyses

Based on the specific research projects, interests and needs of participants, which are discussed at the beginning of the week, the course also offers practical solutions to problems related to acquiring, pre-processing, and storing large amount of text; it teaches complementary advanced topics, such as data scraping, part-of-speech (POS) tagging and/or multi-language text analysis; and it offers guidance on how to learn and apply more advanced methods and techniques.

 

Prerequisites

We strongly encourage participants to combine this course with the introductory Quantitative Text Analysis I. Alternatively, participants should have prior experience with quantitative text analysis and some familiarity with the statistical software R. Participants may contact the instructor with questions regarding the knowledge required to follow this course.

 

Requirements

Participants are expected to have access to an internet-connected computer. Access to data, temporary licenses for the course software, and installation support will be provided by the Methods School.

 

Core Readings

Van Atteveldt, Wouter; Trilling, Damian, and Arcila Calderón, Carlos. 2022. Computational Analysis of Communication. Hoboken: Wiley.

 

Suggested Readings

Grimmer, Justin; Roberts, Margaret E., and Stewart, Brandon M. 2022. Text As Data. Princeton and Oxford: Princeton University Press.

Jurafsky, Daniel, and James H. Martin. 2023. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Stanford.