I am reading a csv file "dopers" in R.

dopers <- read.csv(file="generalDoping_alldata2.csv", head=TRUE,sep=",")

After reading the file, I have to do some data cleanup. For instance in the country column if it says

"United States" or "United State"

I would like to replace it with "USA"

I want to make sure that, if the word is " United States " or "United State ", even them my code should work. What I want to say is that even if there is any character before and after "United States" it is replaced with "USA". I understand we can use sub() function for that purpose. I was looking online and found this, however I do not understand what "^" "&" "*" "." does. Can someone please explain.

dopers$Country = sub("^UNITED STATES.*$", "USA", dopers$Country)
1

Best Answer


Given your examples,

s <- c(" United States", " United States ", "United States ")

You can define a regular expression pattern that matches them by

pat <- "^.*United State.*$"

Here, the ^ represents the beginning and $ the end of the string, while. stands for any character and * defines a repetition (zero to any). You can experiment with modified patterns, such as

pat <- "^[ ]*United State[ ]*$" # only ignores spacespat <- "^.*(United State|USA).*$" # only matches " USA" etc.

The substitution is then performed by

gsub(pat, "USA", s)