Volume 1

Iterating and Quarto

By Millie O. Symns in R

September 30, 2022

Introduction

I wouldn’t call myself a pro when using R, even though I have been using it since ~2018. I know how to Google my way through problems like the best and apply them to my specific needs and situations. As I have been teaching myself through work and some personal projects, looking at my code from when I started compared to now is drastically different. Even so, I often forget the things I learned and think back to a project where I used a function I needed for a project and have to think back to what that project was and the file name. So since my brain works that way, I have been thinking of making my own sort of cheat sheet for the things I always do or need and manage to forget 😅 While I figure out how I want that to look, I thought I would share some of these tips in a mini-series.

Disclaimer: There may be even easier or better ways to do any of the things I share here (I would bet money there is). These work for me for what I need to do now, but I always love more efficient ways of working. Share your suggestions with me if you have something even better!

Data we are working with

The sample dataset comes from the City of Austin (Texas) open data portal. Quick shoutout to Two Sigma’s Data Clinic for creating scout, making it very easy for me to find my sample dataset. For my tips, I slightly modified how the data looks, but you can see the whole dataset and original format on the website.

Scenarios

Every moment I am trying something new is because some outside force has inspired it. Here are some pretty real situations for me that you as an analyst or R learner have probably found yourself in once or twice before 🙃

I have created a github repo to hold all the files so you can try it out for yourself.

Scenario #1

Someone sent you an excel workbook with multiple sheets, each with the same format of data (i.e., same columns in each). You need to take all the data and put it into one data frame to work with for your project, and you’re not interested in figuring out how to work with lists. You also need to know what sheet the data came from for your analysis.

The before times

One way you could tackle this problem is by reading each sheet as a data frame, creating a variable with the sheet name, and then binding all the data frames together.

# Creating a df reading the same excel file multiple times
# Reading each individual sheet separately
# Creating a new column in each called "sheet_name"
july2022 <- read_excel("Data/sample_animal_shelter_data.xlsx", sheet =  "July 2022") |>
  mutate(sheet_name = "July 2022")
august2022 <- read_excel("Data/sample_animal_shelter_data.xlsx", sheet =  "August 2022") |>
  mutate(sheet_name = "August 2022")
september2022 <- read_excel("Data/sample_animal_shelter_data.xlsx", sheet =  "September 2022") |>
  mutate(sheet_name = "September 2022")

# Bindng all the sheets together into one new dataframe
before_times_complete_data <- bind_rows(july2022, august2022, september2022)

And this probably would be fine if it is just a couple of sheets, but it becomes a little unruly when you have so many sheets in one file. I had 15 sheets to deal with in one excel workbook recently, so I wasn’t about to do that. And handy-dandy Google helped me out with finding this solution ☺️

Any time there is an instance that you are copying and pasting syntax and slightly modifying it is probably an opportunity to iterate and do something more efficiently.

The after times


# Getting all the sheet names from the excel workbook
sheets <- excel_sheets("Data/sample_animal_shelter_data.xlsx")

# Iterating the process above through map_df by getting the path with the sheet name and creating the new column all at once
after_times_complete_data <- purrr::map_df(sheets, ~dplyr::mutate(readxl::read_excel("Data/sample_animal_shelter_data.xlsx", sheet = .x),
                                                   sheet_name = .x))

I am going from 7 to 2 lines of code to get the same outcome - success!

Scenario #2

You have some text or value you need to write out in multiple places in your code. The value will probably need to change a few times as you work through your code for your results. And you are not sure if you want to create a parameterized report, but you are testing out the waters.

For this scenario, you could write out the text value in all the places, and any time you need to change it, you can scan your code and manually make the change or do a “find all and replace” situation.

The before times


version_a_data_edits <- after_times_complete_data |>
  janitor::clean_names() |> # adding this just so that column names are in snake case
  mutate(status = case_when(date_of_birth < ymd(20220715) ~ "Too soon", 
                            date_of_birth > ymd(20220715) ~ "Too late", 
                            date_of_birth == ymd(20220715) ~ "Winner"), 
         extract_date = ymd(20220715))

As you are working, that could get you where you need to go, but you could also create a value that you change once that replaces all the other text that needs to be changed.

The after times

Here you are adding one more line of code, but future you will appreciate not having the headache of finding all the spots where you need to change the text.

# This pauses your code to input a value. The prompt is whatever text you want to pop up in your console
date_for_processing <- readline(prompt="Enter the winning date in ymd format with no dashes (e.g., 20220715): ")

# The same code as before, but not with the value you just created instead of the actual text 
version_b_data_edits <- after_times_complete_data |>
  janitor::clean_names() |> # adding this just so that column names are in snake case
  mutate(status = case_when(date_of_birth < ymd(date_for_processing) ~ "Too soon", 
                            date_of_birth > ymd(date_for_processing) ~ "Too late", 
                            date_of_birth == ymd(date_for_processing) ~ "Winner"), 
         extract_date = ymd(date_for_processing))

The downside of this process is that you don’t know what the value was unless you add a line to print the value or check your environment to see what was the last input. An even better way to do this would be by adding a parameter field at the top of your file (but we will save that journey for another day).

Scenario #3

You have created a beautiful report in HTML format, but you need to provide PDF and Word documents to other folks. The folks in question like your HTML document but have trouble accessing it, and it would be simpler if they just had a simple static PDF or Word document.

The before times

Before the magic of Quarto, I was strictly working in Rmarkdowns. I love having code in chunks and being able to produce HTML documents. But once someone needed something from me where I couldn’t just share an HTML file, and they wanted it formatted in a PDF document, it was torture. The errors were endless, with LaTex issues galore! 😩 And the word documents would never render quite right. So, I would do my analysis in Rmarkdown, copy necessary pieces into a word document, do some formatting and then save it as a PDF. 😓

The after times

Now with Quarto, I don’t have this struggle anymore! 🤩 I can start with whatever format I want, and if it needs to be something else, I can go to the terminal and write out commands to take my quarto file and render it to whatever format I need.

You can try this out by using scenario_2.qmd file. In RStudio, find your terminal (might be a tab right next to your console), and type either of the following:

# for a pdf doc
quarto render scenario_2.qmd --to pdf

# for a word doc
quarto render scenario_2.qmd --to docx

See you next time!

It is a crazy time to be alive, people! I hope these mini tips make your work a little easier (or at least inspire you to find more efficient ways to do your projects).