Introduction to Data Science

This course introduces participants to the foundational skills required for data science using the programming language R. Quantitative data are increasingly used in policymaking and analysis. Learning how to analyze and effectively communicate these data is therefore increasingly important. In this course, participants will learn how to use R to (1) clean data to prepare it for analysis, (2) summarize, aggregate, and transform data, and (3) visualize data. Mastering these skills will allow presenters to explore patterns and trends in the data they use and effectively communicate these through graphs. While this course will not cover statistical analysis, the skills taught in this course form a useful basis for further study in statistics or advanced data science methods such as machine learning. Participants will learn the foundational skills required for data science in hands-on, practical sessions.

 

Dates

This one week, the 17.5-hour course runs Monday-Friday, July 1-5, 2024. The course is scheduled for 9:00am -12:30 pm.

 

Classroom Location 

Faculty of Arts and Social Science, AS1  03-04

Image preview

 

Instructor

Nina Obermeier, University of Pennsylvania

 

Detailed Description

This course is intended to provide participants with the basic practical skills necessary for exploring, summarizing, and communicating quantitative data. Drawing insights from data often requires learning how to manage, transform, and visualize it.

First, participants will learn the basics about the RStudio/Posit environment and how to import and export data. Second, they will learn key, commonly-used data wrangling skills, including merging data sets, creating new variables, assigning values based on conditions, grouping data, summarizing data, excluding missing values, reshaping data, and filtering data sets. Third, they will learn how to create different types of graphs for data visualization using the R package ggplot2. This is a particularly useful skill not only for exploring trends and patterns in the data but also for effectively communicating the data in question.

Statistical analysis is not covered in this course. However, the skills participants will learn in this course will help prepare them to conduct statistical analyses in the future. Similarly, they form a useful foundation for more advanced data science tools such as machine learning.

The focus of the course will be on practical, hands-on exercises and demonstrations. On completion of the course, participants will be able to load raw data into RStudio/Posit and prepare it for analysis. They will be able to explore patterns and trends in the data by grouping, filtering, aggregating, and summarizing the data. Finally, they will be able to create clean, simple graphical representations of the data.

Participants will learn how to complete these tasks using the programming language R. For this course, participants will use RStudio/Posit, which is a free software application for using R. RStudio/Posit provides a more intuitive environment for data analysis using R, particularly for individuals without a programming background.

Participants are not assumed to have any prior knowledge of data analysis or R.

 

Prerequisites

There are no prerequisites for this course.

 

Requirements

Participants are expected to have access to an internet-connected computer. Instructions on downloading the required free software (RStudio/Posit) will be made available to participants prior to the beginning of the course.

 

Core Readings

We will be drawing on the relevant sections of the following free ebooks:

Michael Franke. 2021. An Introduction to Data Analysis. https://michael-franke.github.io/intro-data-analysis/index.html.

Roger D. Peng. 2022. R Programming for Data Science. https://bookdown.org/rdpeng/rprogdatascience/.

Hadley Wickham and Garrett Grolemund. 2017. R for Data Science. https://r4ds.had.co.nz/.