separate()tidyr::separate() has been
superseded by separate_wider_position() and
separate_wider_delim(). See this link for
more info.1. Split a character variable (name) into first
name and last name
Review the data (d1)
# A tibble: 3 × 3
name age test_score
<chr> <dbl> <dbl>
1 Marge Simpson 30 3.6
2 Homer Simpson 34 8.5
3 Ned Flanders 35 NA
Split name into two variables
Note: The values for the into argument need to be provided as a character vector.
Note: The original variable will be removed unless you add the argument remove=FALSE
Note: If no separator is indicated, the function will split values wherever a non-alphanumeric character appears (for example, a space). If you want to be clear where to separate, add the argument sep= and insert a regular expression such as sep=” “.
d1 %>%
tidyr::separate(name, into=c("first_name", "last_name"))
# A tibble: 3 × 4
first_name last_name age test_score
<chr> <chr> <dbl> <dbl>
1 Marge Simpson 30 3.6
2 Homer Simpson 34 8.5
3 Ned Flanders 35 NA
# A tibble: 3 × 3
name age test_score
<chr> <dbl> <dbl>
1 Marge Simpson 30 3.6
2 Homer J. Simpson 34 8.5
3 Ned Flanders 35 NA
d2 %>%
tidyr::separate(name, into = c("first_name", "last_name"))
# A tibble: 3 × 4
first_name last_name age test_score
<chr> <chr> <dbl> <dbl>
1 Marge Simpson 30 3.6
2 Homer J 34 8.5
3 Ned Flanders 35 NA
You can make decisions on how to handle these anomalies by adding the argument extra= and add the option “warn”, “drop” or “merge”, with the default being “warn”. To learn more type ?separate in the console.
Here is an example using extra=“merge”
d2 %>%
tidyr::separate(name, into = c("first_name", "last_name"), extra="merge")
# A tibble: 3 × 4
first_name last_name age test_score
<chr> <chr> <dbl> <dbl>
1 Marge Simpson 30 3.6
2 Homer J. Simpson 34 8.5
3 Ned Flanders 35 NA
# A tibble: 3 × 3
name age test_score
<chr> <dbl> <dbl>
1 Marge Simpson 30 3.6
2 Simpson 34 8.5
3 Ned Flanders 35 NA
d3 %>%
tidyr::separate(name, into = c("first_name", "last_name"))
# A tibble: 3 × 4
first_name last_name age test_score
<chr> <chr> <dbl> <dbl>
1 Marge Simpson 30 3.6
2 Simpson <NA> 34 8.5
3 Ned Flanders 35 NA
You can make decisions on how to handle these anomalies by adding the argument fill= and add the option “warn”, “right” or “left”, with the default being “warn”. To learn more type ?separate in the console.
Here is an example using fill=“left”
d3 %>%
tidyr::separate(name, into = c("first_name", "last_name"), fill="left")
# A tibble: 3 × 4
first_name last_name age test_score
<chr> <chr> <dbl> <dbl>
1 Marge Simpson 30 3.6
2 <NA> Simpson 34 8.5
3 Ned Flanders 35 NA
2. Split only some of the cells in a character column
(school_name) and keep only one column of the
split
Review the data (d6)
# A tibble: 4 × 2
school_name city
<chr> <chr>
1 David H. Hickman High School Columbia
2 Muriel W. Battle High School Columbia
3 Rock Bridge Sr. High Columbia
4 Webster Groves High School Webster Groves
In this example, I want to clean up the school names by removing any first name and middle initials from school names. Most schools only go by the last name. So I want to remove David H. and Muriel W.
With only a few values, I could easily do a recode. However, if there
were hundreds of schools with similar first name and middle initial
patterns, that would take too long. So here we use
tidyr::separate().
Note: We use regular expressions (regex) here to denote we are separating on any capitalized letter “[A-Z]” followed by a period “\.” and last followed by a white space “\s”. Notice the use of “\” as an escape to tell the regex to match the period exactly, rather than using it for its special behavior which is to match any character. You can also put the period in [] as an escape used to match characters within the brackets, similar to the “[A-Z]” above.
Note: I added the argument fill=“left” to bring any
value that did not separate into our new school_name column
from the left side to the right column “school_name”.
d6 %>%
tidyr::separate(school_name, sep = "[A-Z]\\.\\s", into = c("name1", "school_name"), fill="left")
# A tibble: 4 × 3
name1 school_name city
<chr> <chr> <chr>
1 "David " Hickman High School Columbia
2 "Muriel " Battle High School Columbia
3 <NA> Rock Bridge Sr. High Columbia
4 <NA> Webster Groves High School Webster Groves
Notice now that we have an extra column name1 that I
don’t want. I could either remove this using a
dplyr::select() statement OR I could name this “name1”
column NA in my function above and the column would drop
automatically.
d6 %>%
tidyr::separate(school_name, sep = "[A-Z]\\.\\s", into = c(NA, "school_name"), fill="left")
# A tibble: 4 × 2
school_name city
<chr> <chr>
1 Hickman High School Columbia
2 Battle High School Columbia
3 Rock Bridge Sr. High Columbia
4 Webster Groves High School Webster Groves
3. Split a numeric variable (stid_schid) into
state_id and school_id
Review the data (d7)
# A tibble: 3 × 2
stid_schid students
<dbl> <dbl>
1 123456 300
2 123678 340
3 123789 351
Split stid_schid into 2 variables
d7 %>%
tidyr::separate(stid_schid, sep = 3, into=c("state_id", "school_id"))
# A tibble: 3 × 3
state_id school_id students
<chr> <chr> <dbl>
1 123 456 300
2 123 678 340
3 123 789 351
separate_wider_delim1. Split a variable into multiple variables at the commas
Review the data (d8)
# A tibble: 3 × 2
id var_a
<dbl> <chr>
1 123456 a,b,c, x
2 123678 d,e
3 123789 y, z
In this case, as is seen in Qualtrics exports, when there is a multi-select option item, all responses will export into the same column, with each selected option separated by a comma (with no space between options). If there is a space after a comma, that means it is part of the response selection (and should not be split). So in this example, I only want to split values into a new column if there is no space after the comma.
To do this, we can use tidyr::separate_wider_delim,
along with regex, to get the pattern we need. I also don’t know how many
new columns we will get, so I will use the arguments names_sep
and too_few to not predefine my columns.
d8 %>%
tidyr::separate_wider_delim(var_a,
delim = stringr::regex(",(?!\\s)"),
names_sep = "",
too_few = "align_start")
# A tibble: 3 × 4
id var_a1 var_a2 var_a3
<dbl> <chr> <chr> <chr>
1 123456 a b c, x
2 123678 d e <NA>
3 123789 y, z <NA> <NA>
Return to Separate