I wanted to know conceptually why there is difference in output using str(Series).split() and Series.str.split(), when using it on the series object. I was looking to split the date based on the punctuation: the str(Series).split() didn’t give me the desired output while the other method, using Series.str.split() but I heard that using the [dot] accessor is frowned upon. I’ve searched the net but I did not get any satisfactory answers. Can anyone help me out and shed some light? I was using the following series object.
df5 = pd.Series(["04/20/2009", "04/20/09", "4/20/09", "4/3/09", "Mar-20-2009", "Mar 20, 2009", "March 20, 2009", "Mar. 20, 2009", "Mar 20 2009", "20 Mar 2009","20 March 2009", "20 Mar. 2009", "20 March, 2009", "Mar 20th, 2009", "Mar 21st, 2009", "Mar 22nd, 2009", "Feb 2009", "Sep 2009", "Oct 2010", "6/2008","12/2009", "2009", "2010"])
Advertisement
Answer
str(series).split()
functions similar to concatenating the series object into a string and then splits it on a specified delimiter (in this case, since it is empty, it’ll use space as a delimiter).
On the other hand, series.str.split()
will function similar to mapping each string of the series object to the split function which would give you a series object with a list of strings for each string in the original series object.
Here is the official documentation for
series.str.split()
for more info.
Also, the dot operator is generally frowned upon when it’s used to access a dataframe column, as it won’t work if the column has a whitespace in the name.