In a previous article we introduced you to the tidyverse ecosystem, and discussed some of the primary packages that fall within this excellent data science resource for R. This post will be a gentle introduction to some of the capabilities provided by tidyverse so that you’ll have a better understanding of how these functions operate within a basic data analysis workflow. We’ll examine the patterns in rental real estate values in major Texas metropolitan areas measured through a price per square foot value generated on a monthly basis. The dataset for this simple analysis is provided by Zillow Research through their data page.
What you’ll need to complete this “Learn by Example”
- RStudio – You can download RStudio if you don’t already have it. The free version will work just fine.
- Install the tidyverse package. It’s beyond the scope of this article to go into any detail on how this ecosystem of packages is installed, but there are many good resources that you can find online that will describe how to install tidyverse with RStudio. For purposes of this exercise though we’ll assume that you have already installed tidyverse.
- Download a file called City_MedianRentalPricePerSqft_AllHomes.csv from Zillow Research. On the data page for Zillow Research scroll down to Rental Listings and select Median Rent List Price ($ Per Square Foot) SFR, Condo/Co-op and then City as the Geography. You can see this in the screenshot below. You can save the file wherever you’ll like, just remember where you put it. Ideally it should be placed into a folder created specifically for this and other “Learn by Example” R exercises.
Let’s get started
In this “Learn by Example” you’ll read the csv file into a data frame object (technically a tibble, which is a new and improved version of a data frame), define the columns you want to use, filter the data to include only major metropolitan areas in Texas, perform some basic data tidying, group the data by metropolitan area, summarize the data, and finally create a plot of the data. This sounds like a lot of work, but tidyverse is particularly adept at this type of task, and as you’ll see it only takes a few lines of code to complete this workflow.
- Before we start writing code to accomplish this workflow let’s take a look at the structure of the csv file. If you have Excel or some other software that reads csv files go ahead and open the file now. If not, you can just take a look at the screenshot below to get a good feel for the structure of the data. In addition to columns for the region, state, metro, and county, you’ll see that most of the columns contain price per square foot values by year and month. For example, the column 2010-12 refers to December, 2010. These values are rental price per square foot.
- Open RStudio and set the working directory. The working directory is a folder on your computer that serves as the “home” folder. In this case it should be the folder where you placed the City_MedianRentalPricePerSqft_AllHomes.csv file that you downloaded earlier. If you don’t know how to set the working directory you can reference this help file provided by RStudio help.
- Create a new R script and save it to your working directory with a name of Analysis1_CityPriceSqFt.R.
- In the new script, load the readr, dplyr, ggplot2, and tidyr libraries. You can load these scripts by typing in the code that you see below at the top of the script or you can load them by clicking the check boxes next to their names in the Packages tab in RStudio. It’s generally preferable to type them into your script in the event that the script is run outside the RStudio environment.
library(readr) library(dplyr) library(ggplot2) library(tidyr)
- Use the read_csv() function from the readr package to load the City_MedianRentalPricePerSqft_AllHomes.csv into a dataframe with a variable name of cityDF.
cityDF <- read_csv("City_MedianRentalPricePerSqft_AllHomes.csv", col_names = TRUE)
- The select() function, found in the dplyr package, can be used to specify the columns that should be included in the dataframe. You probably noticed that there are a lot of columns of information – one for each month of every year since 2010. We’ll just use a single month from each year, and to keep it simple we’ll just use the month of December from each year. Therefore, we’ll select the 12th month from each year.As with all the dplyr functions we’ll use in this exercise, the first parameter passed into the function will be the dataframe to operate on (cityDF). It can also be used to rename columns. Add the code you see below. Notice that the RegionName column is being renamed to Region using the syntax New Column Name = Old Column Name.
cityDF <- select(cityDF, Region = RegionName, State, Metro, `2010-12`, `2011-12`, `2012-12`, `2013-12`, `2014-12`, `2015-12`, `2016-12`, `2017-12`, `2018-12`)
- Next we’ll filter the dataframe to include only the major metropolitan areas in Texas. Add the line of code you see below to call the filter() function, passing in the dataframe (cityDF), along with where clauses that identify the State and Metro areas. Technically we could leave off the parameter that defines TX as the state, but I want to illustrate one way that you can pass in multiple conditions to the filter() function.
cityDF <- filter(cityDF, State == 'TX', Metro %in% c('Houston-The Woodlands-Sugar Land', 'San Antonio-New Braunfels', 'Austin-Round Rock', 'Dallas-Fort Worth-Arlington'))
- Now there are a couple things that we’ll do to tidy up the dataset. The first thing we’ll do is move all the separate columns for dates into a single column called Year. The gather() function in the tidyr package is used to accomplish this operation. There are three parameters of the gather() function. The first is the set of columns that represent what should be values and not variables. These would be the date columns in this case. Next, you’ll need to name the variable of the new column. This is also called the key, and in this case will the Year variable. Finally, you’ll need to provide the value, which is the name of the variable whose values are spread over the cells.
cityDF <- gather(cityDF, `2010-12`, `2011-12`, `2012-12`, `2013-12`, `2014-12`, `2015-12`, `2016-12`, `2017-12`, `2018-12`, key='Year', value="PriceSqFt")
- We don’t really need the month identifier as part of the date; the year alone will suffice. For this we can use the dplyr mutate() function along with the built-in R substring() function to ensure that the contents of the Year column will contain only the year and not the month as well.
cityDF <- mutate(cityDF, Year = substring(cityDF$Year, 1,4))
- In this next step we’ll group the dataframe. The grouping of data in a data frame facilitates the split-apply-combine paradigm. This paradigm first splits the data into groups, using the group_by() function in dplyr, then applies analysis to the group, and then combines the results. The group_by() function handles the split portion of the paradigm by creating groups of data using one or more columns.
cityDF <- group_by(cityDF, Metro, Year)
- We can now use the summarize() function on the grouped data to get a mean price per square foot for each group. The na.rm parameter is used to leave out any null values.
cityDF <- summarize(cityDF, PriceSqFt = mean(PriceSqFt, na.rm = TRUE))
- The final step is to plot the results using the ggplot() function from the ggplot2 package. In this case we’ll use geom_point() to plot the points for each year along with geom_line() to draw a line in between each point.
ggplot(data=cityDF, aes(x=Year, y=PriceSqFt, group=Metro, color=Metro)) + geom_line() + geom_point() + ggtitle("Texas Rental Prices Per Sq Foot since 2010") + ylab("Price Per SqFt")
- Save and then run the script. If everything has been coded correctly you should see a plot in the Plot tab that looks like the screenshot below.
Improving the code
As was the case in this exercise, it is not unusual to run a series of dplyr functions as part of a larger processing routine. As you’ll recall, each dplyr function returns a new data frame, and this data frame is typically used as the input to the next dplyr function in the series. In this case we were recreating the cityDF object with each dplyr function call. These data frames are intermediate datasets not needed beyond the current step. However, as the programmer you are still required to name and code each of these datasets. Piping is a more efficient way of handling these temporary, intermediate datasets. In sum, piping is an efficient way of sending the output of one function to another function without creating an intermediate dataset, and is most useful when you have a series of functions to run. The syntax for piping is to use the %>% characters at the end of each statement that you want to pipe.
Comment out the code that you have written thus far, except for the calls to load the tidyverse packages. Add the code you see below to take advantage of piping. There are a few things to note about the syntax to use with piping. First, note the use of %>% characters at the end of each statement that you want to be piped. Next, notice that it is no longer necessary to pass a dataframe as the first parameter to the tidyverse functions. Running this code should result in the exact same output plot, but piping is a more efficient way of coding.
cityDF <- read_csv("City_MedianRentalPricePerSqft_AllHomes.csv", col_names = TRUE) cityDF %>% select(Region = RegionName, State, Metro, `2010-12`, `2011-12`, `2012-12`, `2013-12`, `2014-12`, `2015-12`, `2016-12`, `2017-12`, `2018-12`) %>% filter(State == 'TX', Metro %in% c('Houston-The Woodlands-Sugar Land', 'San Antonio-New Braunfels', 'Austin-Round Rock', 'Dallas-Fort Worth-Arlington')) %>% gather(`2010-12`, `2011-12`, `2012-12`, `2013-12`, `2014-12`, `2015-12`, `2016-12`, `2017-12`, `2018-12`, key='Year', value="PriceSqFt") %>% mutate(Year = substring(Year, 1,4)) %>% group_by(Metro, Year) %>% summarize(PriceSqFt = mean(PriceSqFt, na.rm = TRUE)) %>% ggplot(aes(x=Year, y=PriceSqFt, group=Metro, color=Metro)) + geom_line() + geom_point() + ggtitle("Major Texas Metropolitan Areas - Rental Prices Per Sq Foot since 2010") + ylab("Price Per SqFt")