I recently saw a question about
and realized that when I first started using Pandas I would often attempt to solve problems with apply when a vectorized solution was what I should have been using instead. Let’s say that you have an existing function to calculate the present value of an investment that takes scalar arguments and you also have a pandas.DataFrame.apply
of investments, perhaps loaded from a csv file or database.DataFrame
PV = FV / (1 + i) ** n
def present_value(fv, i_rate, n_periods): return fv / (1 + i_rate) ** n_periods
If someone has given us this function, we might be tempted to just use it on our data. So here’s what a
might look like with some values.DataFrame
df = pd.DataFrame([(1000, 0.05, 12), (1000, 0.07, 12), (1000, 0.09, 12), (500, 0.02, 24)], columns=['fv', 'i_rate', 'n_periods'])
One way to apply a function to a
is to manually iterate over the items in the frame and apply the function.DataFrame
for (index, row) in df.iterrows(): df.loc[index, 'pv'] = present_value(row.fv, row.i_rate, row.n_periods)
Another way to reuse that existing function is to use apply
on the DataFrame
, using
to apply it to each row (instead of each column).axis=1
df['pv'] = df.apply(lambda r: present_value(r['fv'], r['i_rate'], r['n_periods']), axis=1)
The problem with this technique is it isn’t vectorized. We are going to force the
function to be evaluated once for each row in the present_value
DataFrame
, and this will be much more expensive than a similar vectorized solution. In fact, apply
is even evaluated twice on the first row (for the current implementation) since it can choose an optimized path based on the result, so the function being applied should not have side effects.
So in this case, we should consider a vectorized solution.
df['pv2'] = df['fv']/(1 + df['i_rate']) ** df['n_periods']
If we time these two versions, we can see the vectorized version is more than twice as fast. Here’s the full result.