I would like to add a cumulative sum column to my Pandas dataframe so that:

namedaynoJackMonday10JackTuesday20JackTuesday10JackWednesday50JillMonday40JillWednesday110

becomes:

Jack | Monday | 10 | 10Jack | Tuesday | 30 | 40Jack | Wednesday | 50 | 90Jill | Monday | 40 | 40Jill | Wednesday | 110 | 150

I tried various combos of df.groupby and df.agg(lambda x: cumsum(x)) to no avail.

8

Best Answer


This should do it, need groupby() twice:

df.groupby(['name', 'day']).sum() \.groupby(level=0).cumsum().reset_index()

Explanation:

print(df)name day no0 Jack Monday 101 Jack Tuesday 202 Jack Tuesday 103 Jack Wednesday 504 Jill Monday 405 Jill Wednesday 110# sum per name/dayprint( df.groupby(['name', 'day']).sum() )noname day Jack Monday 10Tuesday 30Wednesday 50Jill Monday 40Wednesday 110# cumulative sum per name/dayprint( df.groupby(['name', 'day']).sum() \.groupby(level=0).cumsum() )noname day Jack Monday 10Tuesday 40Wednesday 90Jill Monday 40Wednesday 150

The dataframe resulting from the first sum is indexed by 'name' and by 'day'. You can see it by printing

df.groupby(['name', 'day']).sum().index 

When computing the cumulative sum, you want to do so by 'name', corresponding to the first index (level 0).

Finally, use reset_index to have the names repeated.

df.groupby(['name', 'day']).sum().groupby(level=0).cumsum().reset_index()name day no0 Jack Monday 101 Jack Tuesday 402 Jack Wednesday 903 Jill Monday 404 Jill Wednesday 150

Modification to @Dmitry's answer. This is simpler and works in pandas 0.19.0:

print(df) name day no0 Jack Monday 101 Jack Tuesday 202 Jack Tuesday 103 Jack Wednesday 504 Jill Monday 405 Jill Wednesday 110df['no_csum'] = df.groupby(['name'])['no'].cumsum()print(df)name day no no_csum0 Jack Monday 10 101 Jack Tuesday 20 302 Jack Tuesday 10 403 Jack Wednesday 50 904 Jill Monday 40 405 Jill Wednesday 110 150

This works in pandas 0.16.2

In[23]: print dfname day no0 Jack Monday 101 Jack Tuesday 202 Jack Tuesday 103 Jack Wednesday 504 Jill Monday 405 Jill Wednesday 110In[24]: df['no_cumulative'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())In[25]: print dfname day no no_cumulative0 Jack Monday 10 101 Jack Tuesday 20 302 Jack Tuesday 10 403 Jack Wednesday 50 904 Jill Monday 40 405 Jill Wednesday 110 150

you should use

df['cum_no'] = df.no.cumsum()

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html

Another way of doing it

import pandas as pddf = pd.DataFrame({'C1' : ['a','a','a','b','b'],'C2' : [1,2,3,4,5]})df['cumsum'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.cumsum())df

enter image description here

Instead of df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()(see above) you could also do a df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()

  • df.groupby(by=['name','day']).sum() is actually just moving both columns to a MultiIndex
  • as_index=False means you do not need to call reset_index afterwards

data.csv:

name,day,noJack,Monday,10Jack,Tuesday,20Jack,Tuesday,10Jack,Wednesday,50Jill,Monday,40Jill,Wednesday,110

Code:

import numpy as npimport pandas as pddf = pd.read_csv('data.csv')print(df)df = df.groupby(['name', 'day'])['no'].sum().reset_index()print(df)df['cumsum'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())print(df)

Output:

 name day no0 Jack Monday 101 Jack Tuesday 202 Jack Tuesday 103 Jack Wednesday 504 Jill Monday 405 Jill Wednesday 110name day no0 Jack Monday 101 Jack Tuesday 302 Jack Wednesday 503 Jill Monday 404 Jill Wednesday 110name day no cumsum0 Jack Monday 10 101 Jack Tuesday 30 402 Jack Wednesday 50 903 Jill Monday 40 404 Jill Wednesday 110 150

as of version 1.0 pandas got a new api for window functions.

specifically, what was achieved earlier with

df.groupby(['name'])['no'].apply(lambda x: x.cumsum()) 

or

df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()

now becomes

df.groupby(['name'])['no'].expanding().sum()

I find it more intuitive for all window-related functions than groupby+level operations

although learning to use groupby is useful for general purpose.
see docs:https://pandas.pydata.org/docs/user_guide/window.html

If you want to write a one-liner (perhaps you want to pass the methods into a pipeline), you can do so by first setting as_index parameter of groupby method to False to return a dataframe from the aggregation step and use assign() to assign a new column to it (the cumulative sum for each person).

These chained methods return a new dataframe, so you'll need to assign it to a variable (e.g. agg_df) to be able to use it later on.

agg_df = (# aggregate df by name and daydf.groupby(['name','day'], as_index=False)['no'].sum().assign(# assign the cumulative sum of each name as a new columncumulative_sum=lambda x: x.groupby('name')['no'].cumsum()))

res