While the base R package includes many useful functions and data structures that you can use to accomplish a wide variety of data science tasks, the third-party tidyverse package supports a comprehensive data science workflow as illustrated in the diagram below. The tidyverse ecosystem includes many sub-packages designed to address specific components of the workflow.
Tidyverse is a coherent system of packages for importing, tidying, transforming, exploring, and visualizing data. The packages of the tidyverse ecosystem were mostly developed by Hadley Wickham, but they are now being expanded by several contributors. Tidyverse packages are intended to make statisticians and data scientists more productive by guiding them through workflows that facilitate communication, and result in reproducible work products. Fundamentally, the tidyverse is about the connections between the tools that make the workflow possible.
Let’s briefly discuss the core packages that are part of tidyverse, and then we’ll do a deeper dive into the specifics of the packages as we move through the book. We’ll use these tools extensively throughout the book.
The goal of readr is to facilitate the import of file-based data into a structured data format. The readr package includes seven functions for importing file-based datasets including csv, tsv, delimited, fixed width, white space separated, and web log files.
Data is imported into a data structure called a tibble. Tibbles are the tidyverse implementation of a data frame. They are quite similar to data frames, but are basically a newer, more advanced version. However, there are some important differences between tibbles and data frames. Tibbles never convert data types of variables. They never change the names of variables or create row names. Tibbles also have a refined print method that shows only the first 10 rows, and all columns that will fit on the screen. Tibbles also print the column type along with the name. We’ll refer to tibbles as data frames throughout the remainder of the book to keep things simple, but keep in mind that you’re actually going to be working with tibble objects. In the next chapter you’ll learn how to use the read_csv() function to load csv files into a tibble object.
Data tidying is a consistent way of organizing data in R, and can be facilitated through the tidyr package. There are three rules that we can follow to make a dataset tidy. First, each variable must have its own column. Second, each observation must have its own row, and finally, each value must have its own cell.
The dplyr package is a very important part of tidyverse. It includes five key functions for transforming your data in various ways. These functions include filter(), arrange(), select(), mutate(), and summarize(). In addition, these functions all work very closely with the group_by() function. All five functions work in a very similar manner where the first argument is the data frame you’re operating on, and the next N number of arguments are the variables to include. The result of calling all five functions is the creation of a new data frame that is a transformed version of the data frame passed to the function. We’ll cover the specifics of each function in a later chapter.
The ggplot2 package is a data visualization package for R, created by Hadley Wickham in 2005 and is an implementation of Leland Wilkinson’s Grammar of Graphics.
Grammar of Graphics is a term used to express the idea of creating individual blocks that are combined into a graphical display. The building blocks used in ggplot2 to implement the Grammar of Graphics include data, aesthetic mapping, geometric objects, statistical transformations, scales, coordinate systems, position adjustments, and faceting.
Using ggplot2 you can create many different kinds of charts and graphs including bar charts, box plots, violin plots, scatterplots, regression lines, and more. There are a number of advantages to using ggplot2 versus other visualization techniques available in R. These advantages include a consistent style for defining the graphics, a high level of abstraction for specifying plots, flexibility, a built-in theming system for plot appearance, mature and complete graphics system, and access to many other ggplot2 users for support.
Other tidyverse packages
The tidyverse ecosystem includes a number of other supporting packages including stringr, purr, forcats, and others. In this book we’ll focus primarily on the package already described, but to round out your knowledge of tidyverse you can reference tidyverse.org.