I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)Out[20]: First_Date Second Date0 2016-02-09 2015-11-191 2016-01-06 2015-11-302 NaT 2015-12-043 2016-01-06 2015-12-084 NaT 2015-12-095 2016-01-07 2015-12-116 NaT 2015-12-127 NaT 2015-12-148 2016-01-06 2015-12-149 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)df_test.head() Out[22]: First_Date Second Date Difference0 2016-02-09 2015-11-19 82 days1 2016-01-06 2015-11-30 37 days2 NaT 2015-12-04 NaT3 2016-01-06 2015-12-08 29 days4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric) df_test.head()Out[25]: First_Date Second Date Difference0 2016-02-09 2015-11-19 7.084800e+151 2016-01-06 2015-11-30 3.196800e+152 NaT 2015-12-04 NaN3 2016-01-06 2015-12-08 2.505600e+154 NaT 2015-12-09 NaN
Best Answer
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int
if there are no missing values(NaT
) and float
if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype
timedelta
by np.timedelta64(1, 'D')
, but output is not int
, but float
, because NaN
values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')print (df_test)First_Date Second Date Difference0 2016-02-09 2015-11-19 82.01 2016-01-06 2015-11-30 37.02 NaT 2015-12-04 NaN3 2016-01-06 2015-12-08 29.04 NaT 2015-12-09 NaN5 2016-01-07 2015-12-11 27.06 NaT 2015-12-12 NaN7 NaT 2015-12-14 NaN8 2016-01-06 2015-12-14 23.09 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dtimport numpy as npimport pandas as pd#Assume we have df_test:In [222]: df_testOut[222]: first_date second_date0 2016-01-31 2015-11-191 2016-02-29 2015-11-202 2016-03-31 2015-11-213 2016-04-30 2015-11-224 2016-05-31 2015-11-235 2016-06-30 2015-11-246 NaT 2015-11-257 NaT 2015-11-268 2016-01-31 2015-11-279 NaT 2015-11-2810 NaT 2015-11-2911 NaT 2015-11-3012 2016-04-30 2015-12-0113 NaT 2015-12-0214 NaT 2015-12-0315 2016-04-30 2015-12-0416 NaT 2015-12-0517 NaT 2015-12-06In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date'] In [224]: df_testOut[224]: first_date second_date Difference0 2016-01-31 2015-11-19 73 days1 2016-02-29 2015-11-20 101 days2 2016-03-31 2015-11-21 131 days3 2016-04-30 2015-11-22 160 days4 2016-05-31 2015-11-23 190 days5 2016-06-30 2015-11-24 219 days6 NaT 2015-11-25 NaT7 NaT 2015-11-26 NaT8 2016-01-31 2015-11-27 65 days9 NaT 2015-11-28 NaT10 NaT 2015-11-29 NaT11 NaT 2015-11-30 NaT12 2016-04-30 2015-12-01 151 days13 NaT 2015-12-02 NaT14 NaT 2015-12-03 NaT15 2016-04-30 2015-12-04 148 days16 NaT 2015-12-05 NaT17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)In [227]: df_testOut[227]: first_date second_date Difference Diffference0 2016-01-31 2015-11-19 73 days 731 2016-02-29 2015-11-20 101 days 1012 2016-03-31 2015-11-21 131 days 1313 2016-04-30 2015-11-22 160 days 1604 2016-05-31 2015-11-23 190 days 1905 2016-06-30 2015-11-24 219 days 2196 NaT 2015-11-25 NaT NaN7 NaT 2015-11-26 NaT NaN8 2016-01-31 2015-11-27 65 days 659 NaT 2015-11-28 NaT NaN10 NaT 2015-11-29 NaT NaN11 NaT 2015-11-30 NaT NaN12 2016-04-30 2015-12-01 151 days 15113 NaT 2015-12-02 NaT NaN14 NaT 2015-12-03 NaT NaN15 2016-04-30 2015-12-04 148 days 14816 NaT 2015-12-05 NaT NaN17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):try:# Calcuating the smallest date difference between the start and the close date# There's some tricky logic in here to calculate for determining date difference# the other way around (Dec -> Jan is 1 month rather than 11)sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)close_date = int(row[y].strftime('%j')) # day of year (1-366)later_date_of_year = max(sub_start_date, close_date) earlier_date_of_year = min(sub_start_date, close_date)days_diff = later_date_of_year - earlier_date_of_year# Calculates the difference going across the next year (December -> Jan)days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_yearreturn min(days_diff, days_diff_reversed)except ValueError:return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):time_dict = {'<Minute>': 'm','<Hour>': 'h','<Day>': 'D','<Week>': 'W','<Month>': 'M','<Year>': 'Y'}time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']offset_base_name = str(to_offset(time_delta).base)time_term = time_dict.get(offset_base_name)result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.open_time and end_time need to change according your df
What worked perfect for me is this. I am on Pandas version: 2.0.2.
from datetime import datetimedf['new_col'] = (pd.to_datetime(df['col1'])).sub(pd.to_datetime(df['col2'])).dt.days