The journey from STATA to R

My opinions on the worth of transitioning projects to open source code

By Millie O. Symns in R Python Reflections

April 26, 2022

Once upon a time, I was an avid SPSS user and then STATA user. It was between 2015 and 2017 in grad school learning statistics and research. These were the only statistical software and methods that professors introduced to me at the time. This makes sense – that is what they used in their labs for academic research and publications for peer-reviewed journals for the social sciences. Why would there be any other option out there? It wasn’t until my first job out of grad school that I learned that there is a whole world of possibilities in the open-source space.

The R logo with pink hearts on the side.

Artwork by Allison Horst

I was pretty reluctant at first. I worked on and led interdisciplinary teams that had projects in SPSS, STATA, and R. I was already struggling to get used to STATA, with which I had the least experience at the time to do the job. Why would I add another language to that?! I was also trying to get familiar enough to query data using SQL. It felt like a lot, but as time went on, making small opportunities to work in R with teammates who only worked in that language, I began to see the benefits and the potential for possibilities for projects, even outside of data analysis.

It wasn’t all rainbows and sunshine, though, don’t get me wrong. It was indeed a battle some days, but using R in RStudio became my go-to and only tool I was interested in using for projects moving forward. And so was the rest of the office, with research analysts learning on the job and transitioning their projects to R.

The office was all working at different paces for this transition, depending on the projects in their portfolios. Some elected not to transition their projects because of time or other priorities. So naturally, with a good mix of people working on teams with various projects, there was going to come a time when I needed to translate a process from SPSS and STATA to R.

Here are some of my takeaways and reflections about this time:

Reasons why I made the switch:

Developing in-demand skills: If I ever wanted to work outside of academic research, I needed to learn R or Python since many job descriptions include these languages. Translating the syntax also helped me understand some things in R and the tidyverse. I learned ways we could process data even more efficiently because of the possibilities with R.

Learn something beyond data analysis: It took me a while to get the open-source concept, but once I saw it, I saw what else I could do. You can do data visualizations, create maps, do art, create a website, create slide decks, etc. These things are not accessible if you only work in SPSS or STATA, let alone if you don’t have a license (or afford one) outside of school or work.

Reasons why your office should transition to open-source languages:

It is the way of the future: More companies and organizations are using open-source languages.

Cheaper and more sustainable: You need to maintain a license with proprietary software. Depending on your needs as an organization, you can install and work in open-source languages for free. Other costs may come into play if you want special servers or IDEs; however, working in R and Python gives you much more flexibility to do your work than your other statistical software. The costs may outweigh the benefits given the farther you can go with your money by investing time and effort into open source languages.

Creating reproducible processes and reports: If you tend to do regular reporting, such as monthly or annually), with R or Python, you can write reproducible scripts to do your data cleaning and reporting all from one place. You can even automate this process.

A larger pool of future employees: Finding talent is a task, so having more potential highly qualified candidates who work in open-source languages could be a huge bonus.

I can go on, but I feel like you get the point :)

An exmaple

Here concrete example I can share of translating a data process from STATA to R and Python.

A friend of mine shared this really cool resource on COVID-19 testing in schools (you all should check it out). And I noticed that there was some sample STATA syntax to gather all the excel files and merge them on a school or district level. Being out of higher education (for the time being), I don’t have access to STATA, and I only really use R as my first choice for data cleaning and exploration needs. I needed to figure out the process outlined in STATA and do the same (or something similar) in R. I added the extra challenge to see if I could figure it out in Python since I had just been learning that language in my fellowship with Data Science for All.

You can find the STATA syntax here: https://www.covidschooldatahub.com/for_researchers

The STATA syntax looks like it does some extra steps in cleaning or formatting some columns. However, the overall main idea (from what I can tell) is to get all the files into one dataframe, so that is what I focused on. Of course there might be some better ways than what I chose to do in my syntax to get it done, but it worked for me. πŸ™ƒ

Here is what I did in R.


# Calling the libraries needed to run the script below
library(plyr)
library(dplyr)
library(readr)

# Creates a prompt for user to answer to put folder path name
# When you see it, paste your file path and press enter to continue running the rest of the script
my_path <- readline(prompt="Enter path to folder: ")

# Show the path to the folder where the data is located
raw_data <- list.files(path = my_path,  
                       pattern = "*.csv", full.names = TRUE) %>% 
  # Apply the read_xlsx function to and out in a list
  lapply(function(i){
    # specifying col_types since each file has a slightly different guess for certain columns; 
    # dates being filed as characters to avoid losing data and failing to parse
    read_csv(i, col_types = "ccccccccccccdcccccccccccddddd") 
  }) %>%  
  # Combine the data into one dataframe
  bind_rows  %>% 
  # Getting dates formatted correctly
  mutate(TimePeriodStart = as.Date(TimePeriodStart, "%m/%d/%y"),
         TimePeriodEnd = as.Date(TimePeriodEnd, "%m/%d/%y"))

And here is what I did in Python:


# Necessary packages to import
import glob
import pandas as pd

# Locating files binding all together 
# NOTE: Replace the 'yourfilepath' with a the file path to the folder
df = pd.concat(map(pd.read_csv, glob.glob('yourfilepath/*.csv')))

# Reformatting date fields
df['TimePeriodStart'] = pd.to_datetime(df['TimePeriodStart'], format = "%m/%d/%y", errors = 'ignore')
df['TimePeriodEnd'] = pd.to_datetime(df['TimePeriodEnd'], format = "%m/%d/%y", errors = 'ignore')


So for my academics and non-profits out there doing fantastic work, consider this an invitation and a sign to start considering updating your processes. It will take some time to figure out where you should begin in what projects, but I think it is worth the effort in the long run.😁

Posted on:
April 26, 2022
Length:
6 minute read, 1167 words
Categories:
R Python Reflections
See Also: