I am working on a project involving presidential candidates and am struggling to clean the data due to the variations in the name spelling and abbreviations. For example, for Hillary Clinton, the data might include:

"clinton", "clinton, hillary rodham", "clinton, hilary”, clinton, hil”

Is there a way to convert all of the names into the format: last name, first initial? (i.e.: clinton, h.)

So far I have converted all of the entries to lowercase and have tried the following for replacement:

for (i in 1:nrow(df)){if ((df$can_name %like% c("clinton, hilary", "hillary", "clinton, hillary rodham", "clinton, hilary"))==TRUE){df$can_name <- ‘clinton, h.'}}

However, I have over 100,000 observations, so this would get extremely tedious to do for every candidate. In the past I have used grep to split up and replace strings, but I am not sure how to deal with middle names and misspellings in this case. I am not very familiar with regex, so any advice would be greatly appreciated!

EDIT:I ended up using a variation of Onyambu's answer. For my final solution, I decided to just use the candidates' last names and was able to replace multiple variations using the | (and) opperator. Here is the code:

sub(".*clinton.*|.*hilary.*|.*hillary.*", "clinton", df$can_name )
3

Best Answer


You can use sub.

 sub(".*Clinton.*","Clinton, h.", df$can_name)

You can use apply functions. They are vectorized and offer much faster way of iterating. You can do something like this:

df <- data.frame(vec = c('c','b','a','clinton', 'clinton, hillary rodham', 'clinton, hilary', 'clinton, hil'))namesp <- c('clinton', 'clinton, hillary rodham', 'clinton, hilary', 'clinton, hil')df$can_name <- sapply(df$vec, function(x) if(x %in% namesp) return('clinton, h.') else return('not found'))print(df)vec can_name1: c not found2: b not found3: a not found4: clinton clinton, h.5: clinton, hillary rodham clinton, h.6: clinton, hilary clinton, h.7: clinton, hil clinton, h.

Save time by creating a "lookup table" with perfect data and merging it by an easily extracted field in your main (messy) dataframe.

# Your Main Data (dirty)df <- tibble(name = c("clinton", "clinton, hillary rodham","clinton, hillary","clinton, hil") )# Your Perfect Lookup Data lookup_table <- tibble(first_name = "Hillary",middle_name = "Rodham",last_name = "Clinton")

Extract something easy from your main data and merge. Always good practice to keep your "raw" data column to spot check for errors.

my.regex = "([a-zA-Z]+)"df %>% mutate(name = str_to_title(name)) %>% extract(name, into = "last_name", my.regex, remove = FALSE) %>% left_join( lookup_table)#> # A tibble: 4 x 4#> name last_name first_name middle_name#> <chr> <chr> <chr> <chr> #> 1 Clinton Clinton Hillary Rodham #> 2 Clinton, Hillary Rodham Clinton Hillary Rodham #> 3 Clinton, Hillary Clinton Hillary Rodham #> 4 Clinton, Hil Clinton Hillary Rodham

This technique assures you of perfect data and is scalable.

Note: The regex above only extracts the last name from the examples given. Not sure whether/how you intend on discriminating between the candidates Bill/Hillary Clinton or Jeb/George/George Bush if given only a single name ("Bush", "Clinton").

Created on 2018-07-28 by the reprex package (v0.2.0).