I recently saw a question about and realized that when I first started using Pandas I would often attempt to solve problems with apply when a vectorized solution was what I should have been using instead. Let’s say that you have an existing function to calculate the present value of an investment that takes scalar arguments and you also have a pandas.DataFrame.apply of investments, perhaps loaded from a csv file or database.DataFrame
PV = FV / (1 + i) ** n
def present_value(fv, i_rate, n_periods):
return fv / (1 + i_rate) ** n_periodsIf someone has given us this function, we might be tempted to just use it on our data. So here’s what a might look like with some values.DataFrame
df = pd.DataFrame([(1000, 0.05, 12), (1000, 0.07, 12), (1000, 0.09, 12), (500, 0.02, 24)],
columns=['fv', 'i_rate', 'n_periods'])One way to apply a function to a is to manually iterate over the items in the frame and apply the function.DataFrame
for (index, row) in df.iterrows():
df.loc[index, 'pv'] = present_value(row.fv, row.i_rate, row.n_periods)Another way to reuse that existing function is to use apply on the DataFrame, using to apply it to each row (instead of each column).axis=1
df['pv'] = df.apply(lambda r: present_value(r['fv'], r['i_rate'], r['n_periods']), axis=1)
The problem with this technique is it isn’t vectorized. We are going to force the function to be evaluated once for each row in the present_valueDataFrame, and this will be much more expensive than a similar vectorized solution. In fact, apply is even evaluated twice on the first row (for the current implementation) since it can choose an optimized path based on the result, so the function being applied should not have side effects.
So in this case, we should consider a vectorized solution.
df['pv2'] = df['fv']/(1 + df['i_rate']) ** df['n_periods']
If we time these two versions, we can see the vectorized version is more than twice as fast. Here’s the full result.