Package: tidyr


Function: separate()


  • Note: Since writing this, tidyr::separate() has been superseded by separate_wider_position() and separate_wider_delim(). See this link for more info.


1. Split a character variable (name) into first name and last name

Review the data (d1)

# A tibble: 3 × 3
  name            age test_score
  <chr>         <dbl>      <dbl>
1 Marge Simpson    30        3.6
2 Homer Simpson    34        8.5
3 Ned Flanders     35       NA  

Split name into two variables

  • Note: The values for the into argument need to be provided as a character vector.

  • Note: The original variable will be removed unless you add the argument remove=FALSE

  • Note: If no separator is indicated, the function will split values wherever a non-alphanumeric character appears (for example, a space). If you want to be clear where to separate, add the argument sep= and insert a regular expression such as sep=” “.

d1 %>% 
  tidyr::separate(name, into=c("first_name", "last_name"))
# A tibble: 3 × 4
  first_name last_name   age test_score
  <chr>      <chr>     <dbl>      <dbl>
1 Marge      Simpson      30        3.6
2 Homer      Simpson      34        8.5
3 Ned        Flanders     35       NA  
  • Note: The number of new variables must equal the number of pieces you are separating. So for example if some names had a middle initial, we would only receive the first name and middle initial for those names
# A tibble: 3 × 3
  name               age test_score
  <chr>            <dbl>      <dbl>
1 Marge Simpson       30        3.6
2 Homer J. Simpson    34        8.5
3 Ned Flanders        35       NA  
d2 %>% 
  tidyr::separate(name, into = c("first_name", "last_name"))
# A tibble: 3 × 4
  first_name last_name   age test_score
  <chr>      <chr>     <dbl>      <dbl>
1 Marge      Simpson      30        3.6
2 Homer      J            34        8.5
3 Ned        Flanders     35       NA  

You can make decisions on how to handle these anomalies by adding the argument extra= and add the option “warn”, “drop” or “merge”, with the default being “warn”. To learn more type ?separate in the console.

Here is an example using extra=“merge”

d2 %>% 
  tidyr::separate(name, into = c("first_name", "last_name"), extra="merge")
# A tibble: 3 × 4
  first_name last_name    age test_score
  <chr>      <chr>      <dbl>      <dbl>
1 Marge      Simpson       30        3.6
2 Homer      J. Simpson    34        8.5
3 Ned        Flanders      35       NA  
  • Note: Similarly since the number of new variables must equal the number of pieces you are separating, if some names only have a first or last name, you will have missing data in your new variables.
# A tibble: 3 × 3
  name            age test_score
  <chr>         <dbl>      <dbl>
1 Marge Simpson    30        3.6
2 Simpson          34        8.5
3 Ned Flanders     35       NA  
d3 %>% 
  tidyr::separate(name, into = c("first_name", "last_name"))
# A tibble: 3 × 4
  first_name last_name   age test_score
  <chr>      <chr>     <dbl>      <dbl>
1 Marge      Simpson      30        3.6
2 Simpson    <NA>         34        8.5
3 Ned        Flanders     35       NA  

You can make decisions on how to handle these anomalies by adding the argument fill= and add the option “warn”, “right” or “left”, with the default being “warn”. To learn more type ?separate in the console.

Here is an example using fill=“left”

d3 %>% 
  tidyr::separate(name, into = c("first_name", "last_name"), fill="left")
# A tibble: 3 × 4
  first_name last_name   age test_score
  <chr>      <chr>     <dbl>      <dbl>
1 Marge      Simpson      30        3.6
2 <NA>       Simpson      34        8.5
3 Ned        Flanders     35       NA  

2. Split only some of the cells in a character column (school_name) and keep only one column of the split

Review the data (d6)

# A tibble: 4 × 2
  school_name                  city          
  <chr>                        <chr>         
1 David H. Hickman High School Columbia      
2 Muriel W. Battle High School Columbia      
3 Rock Bridge Sr. High         Columbia      
4 Webster Groves High School   Webster Groves

In this example, I want to clean up the school names by removing any first name and middle initials from school names. Most schools only go by the last name. So I want to remove David H. and Muriel W.

With only a few values, I could easily do a recode. However, if there were hundreds of schools with similar first name and middle initial patterns, that would take too long. So here we use tidyr::separate().

  • Note: We use regular expressions (regex) here to denote we are separating on any capitalized letter “[A-Z]” followed by a period “\.” and last followed by a white space “\s”. Notice the use of “\” as an escape to tell the regex to match the period exactly, rather than using it for its special behavior which is to match any character. You can also put the period in [] as an escape used to match characters within the brackets, similar to the “[A-Z]” above.

  • Note: I added the argument fill=“left” to bring any value that did not separate into our new school_name column from the left side to the right column “school_name”.

d6 %>% 
  tidyr::separate(school_name, sep = "[A-Z]\\.\\s", into = c("name1", "school_name"), fill="left")
# A tibble: 4 × 3
  name1     school_name                city          
  <chr>     <chr>                      <chr>         
1 "David "  Hickman High School        Columbia      
2 "Muriel " Battle High School         Columbia      
3  <NA>     Rock Bridge Sr. High       Columbia      
4  <NA>     Webster Groves High School Webster Groves

Notice now that we have an extra column name1 that I don’t want. I could either remove this using a dplyr::select() statement OR I could name this “name1” column NA in my function above and the column would drop automatically.

d6 %>% 
  tidyr::separate(school_name, sep = "[A-Z]\\.\\s", into = c(NA, "school_name"), fill="left")
# A tibble: 4 × 2
  school_name                city          
  <chr>                      <chr>         
1 Hickman High School        Columbia      
2 Battle High School         Columbia      
3 Rock Bridge Sr. High       Columbia      
4 Webster Groves High School Webster Groves

3. Split a numeric variable (stid_schid) into state_id and school_id

Review the data (d7)

# A tibble: 3 × 2
  stid_schid students
       <dbl>    <dbl>
1     123456      300
2     123678      340
3     123789      351

Split stid_schid into 2 variables

  • Note: If working with numeric columns you can separate based on character position.
d7 %>% 
  tidyr::separate(stid_schid, sep = 3, into=c("state_id", "school_id"))
# A tibble: 3 × 3
  state_id school_id students
  <chr>    <chr>        <dbl>
1 123      456            300
2 123      678            340
3 123      789            351

Function: separate_wider_delim


1. Split a variable into multiple variables at the commas

Review the data (d8)

# A tibble: 3 × 2
      id var_a   
   <dbl> <chr>   
1 123456 a,b,c, x
2 123678 d,e     
3 123789 y, z    

In this case, as is seen in Qualtrics exports, when there is a multi-select option item, all responses will export into the same column, with each selected option separated by a comma (with no space between options). If there is a space after a comma, that means it is part of the response selection (and should not be split). So in this example, I only want to split values into a new column if there is no space after the comma.

To do this, we can use tidyr::separate_wider_delim, along with regex, to get the pattern we need. I also don’t know how many new columns we will get, so I will use the arguments names_sep and too_few to not predefine my columns.

d8 %>%
  tidyr::separate_wider_delim(var_a, 
                              delim = stringr::regex(",(?!\\s)"),
                              names_sep = "",
                              too_few = "align_start")
# A tibble: 3 × 4
      id var_a1 var_a2 var_a3
   <dbl> <chr>  <chr>  <chr> 
1 123456 a      b      c, x  
2 123678 d      e      <NA>  
3 123789 y, z   <NA>   <NA>  

Return to Separate