Reading files from other software

Keywords: #files

Your colleagues can decide to work with other software for data analysis, such as SPSS, Stata, and SAS. Every software has its own specific format to save and share the data sets. For example, in R, we have the RData format. The basic information in all these formats is the same. Still, how they organize data sets, the different types of classes they use, and how they manage some kinds of information (e.g., dates and position) could be very different.

One way to share data sets among different software is using an intermediate open format such as the CSV format, but you can lose part of the information. We have the package haven, which part of the tidyverse to fix this problem. Let’s see some examples with the popular data set iris.

The first step is to load the package haven and the collection of packages tidyverse.

library(tidyverse)
library(haven)

data(iris)
my_iris <- iris %>% as_tibble()
my_iris
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Let’s start with the standard format in R (“rds”).

# R
saveRDS(my_iris, file = "iris.rds")
my_new_iris <- readRDS("iris.rds")
my_new_iris
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

We can find the file iris.rds in the working directory of your hard drive. Besides, we can see that we have loaded the same data set (with the same features) that we had saved. After this process, it’s time to do the same with the standard format of the SAS software.

# SAS
write_sas(my_iris, "iris.sas7bdat")
my_new_iris <- read_sas("iris.sas7bdat")

If we run the previous code, we get the following error: “Failed to create column Sepal.Length: A provided name contains an illegal character”. We must know that many software doesn’t admit variable names with the dot symbol (“.”), so we need to remove them. Let’s fix the problem and try again to save and read the data set in the SAS format.

my_iris <- my_iris %>% 
  set_names(~ str_to_lower(.) %>% # To covert the names to lower case
              str_replace_all("\\.", "_") ) # To replace '.' by '_'

# SAS
write_sas(my_iris, "iris.sas7bdat")
my_new_iris <- read_sas("iris.sas7bdat")
my_new_iris
## # A tibble: 150 x 5
##    sepal_length sepal_width petal_length petal_width species
##           <dbl>       <dbl>        <dbl>       <dbl>   <dbl>
##  1          5.1         3.5          1.4         0.2       1
##  2          4.9         3            1.4         0.2       1
##  3          4.7         3.2          1.3         0.2       1
##  4          4.6         3.1          1.5         0.2       1
##  5          5           3.6          1.4         0.2       1
##  6          5.4         3.9          1.7         0.4       1
##  7          4.6         3.4          1.4         0.3       1
##  8          5           3.4          1.5         0.2       1
##  9          4.4         2.9          1.4         0.2       1
## 10          4.9         3.1          1.5         0.1       1
## # … with 140 more rows

Now, we have fixed the problem. We can check what has happened with the factor variable species. The SAS format only has saved the value of the factors, not their labels. Finally, we can do the same with the standard formats for SPSS and Stata.

# SPSS
write_sav(my_iris, "iris.sav")
my_new_iris <- read_sav("iris.sav")
my_new_iris
## # A tibble: 150 x 5
##    sepal_length sepal_width petal_length petal_width    species
##           <dbl>       <dbl>        <dbl>       <dbl>  <dbl+lbl>
##  1          5.1         3.5          1.4         0.2 1 [setosa]
##  2          4.9         3            1.4         0.2 1 [setosa]
##  3          4.7         3.2          1.3         0.2 1 [setosa]
##  4          4.6         3.1          1.5         0.2 1 [setosa]
##  5          5           3.6          1.4         0.2 1 [setosa]
##  6          5.4         3.9          1.7         0.4 1 [setosa]
##  7          4.6         3.4          1.4         0.3 1 [setosa]
##  8          5           3.4          1.5         0.2 1 [setosa]
##  9          4.4         2.9          1.4         0.2 1 [setosa]
## 10          4.9         3.1          1.5         0.1 1 [setosa]
## # … with 140 more rows
# Stata
write_dta(my_iris, "iris.dta")
my_data <- read_dta("iris.dta")
my_new_iris
## # A tibble: 150 x 5
##    sepal_length sepal_width petal_length petal_width    species
##           <dbl>       <dbl>        <dbl>       <dbl>  <dbl+lbl>
##  1          5.1         3.5          1.4         0.2 1 [setosa]
##  2          4.9         3            1.4         0.2 1 [setosa]
##  3          4.7         3.2          1.3         0.2 1 [setosa]
##  4          4.6         3.1          1.5         0.2 1 [setosa]
##  5          5           3.6          1.4         0.2 1 [setosa]
##  6          5.4         3.9          1.7         0.4 1 [setosa]
##  7          4.6         3.4          1.4         0.3 1 [setosa]
##  8          5           3.4          1.5         0.2 1 [setosa]
##  9          4.4         2.9          1.4         0.2 1 [setosa]
## 10          4.9         3.1          1.5         0.1 1 [setosa]
## # … with 140 more rows

In the last two cases, the SPPS and the Stata formats have saved two values for the factor variable species: the numerical value and their label.