Writing efficient code, vectorization, and benchmarking.¶

Objectives:¶

Understand how to use Pandas and Numpy most efficiently.
Relate the notions of efficient programming to [historical benchmarks].
Use vectorized operations to make your code faster.

In [13]:

import pandas as pd
import numpy as np
from math import *
import matplotlib.pyplot as plt

Processing time for array creation.¶

Let's explore different ways of creating an N=1000 element array:

Below are some append() operations we've done many times to list and numpy arrays.

We can determine their runtime using the %%timeit magic variable in Jupyter. %%timeit runs multiple experiments or 'loops' to estimate the mean and standard deviation in the processing time.

In [14]:

%%timeit -n 100
# Appending items to a list array.
N = []
for n in range(1000):
    N.append(n)

32.7 μs ± 10.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Append() operation on a numpy arrays...¶

In [15]:

%%timeit -n 100
# Appending items to a Numpy array
N = np.array([]);
for n in range(1000):
    N = np.append(N,n)

964 μs ± 78 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]:

%%timeit -n 100
N = np.arange(0,1000,1)

The slowest run took 10.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1.12 μs ± 1.45 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results: Clearly, np.arange() is not using a for-loop, at least not in Python language, because this operation is much faster than even the base-python list array loop.

How does the time required for computation scale with the value of N?¶

Instead of using %%timeit, use the time() module to calculate the compute time.
Do this for multiple values of N.
Normalize or divide the compute time by the first value, so the compute time is unitless. This lets us compare with the benchmark curves below.

In [17]:

## Add the compute resources
import time
Ns = np.array([10,100,500,1000])

t = []
for i in Ns:
    ti = time.time()
    N = np.empty([0,i]);
    for n in range(i):
        N = np.append(N,n)
    t.append(time.time()-ti)

# Normalize the time by the initial value to get unitless quantity.
t = np.array(t)/t[0]
print(t)

t2 = []
for i in Ns:
    ti = time.time()
    N = np.arange(0,i,1)
    t2.append(time.time()-ti)
# Normalize the time by the initial value to get unitless quantity.
t2 = np.array(t2)/t2[0]
print(t2)

[ 1.          9.10367893  9.80267559 14.42474916]
[1.         0.29411765 0.17647059 0.52941176]

Algorithm efficiency benchmarks developed by Don Knuth.¶

Make graphs of the Benchmark scaling relationships between N and the compute time.
Compare these curves to the append() results.

In [18]:

N = np.arange(0,1000,1)    # Create an O(n) vector for graphing.
N = np.array(N,ndmin=2).T  
Ologn = np.log(N);         # Create an O(log(n))  vector for graphing.
Onlogn = N*np.log(N);      # Create an O(n*log(n)) vector for graphing.
On2 = N**2;                # Create an O(n*n) vector for graphing.
Oexp = 1.1**N              # Create an O(e^n) vector for graphing.
AN = np.concatenate((N,Ologn,Onlogn,On2,Oexp),axis=1)

plt.figure(figsize=(10,8))
plt.plot(N,AN);
plt.ylim(1e-1,1e5);
plt.yscale('log')
#plt.xscale('log')
    
plt.scatter(Ns,t,s=50,marker='s')
plt.scatter(Ns,t2)
plt.legend(['O(n)','O(logn)','O(nlogn)','O(n^2)','O(e^n)','np.append.loop','np.arange'])

plt.xlabel('N',fontsize=20)
plt.ylabel('Computing resources required',fontsize=20)

#plt.savefig('Compute_resources.png')

/var/folders/xk/37brlxn92jnf5_cft3dkbmbw0000gp/T/ipykernel_92244/1698766593.py:3: RuntimeWarning: divide by zero encountered in log
  Ologn = np.log(N);         # Create an O(log(n))  vector for graphing.
/var/folders/xk/37brlxn92jnf5_cft3dkbmbw0000gp/T/ipykernel_92244/1698766593.py:4: RuntimeWarning: divide by zero encountered in log
  Onlogn = N*np.log(N);      # Create an O(n*log(n)) vector for graphing.
/var/folders/xk/37brlxn92jnf5_cft3dkbmbw0000gp/T/ipykernel_92244/1698766593.py:4: RuntimeWarning: invalid value encountered in multiply
  Onlogn = N*np.log(N);      # Create an O(n*log(n)) vector for graphing.

Out[18]:

Text(0, 0.5, 'Computing resources required')

No description has been provided for this image

Results: The np.append() loop scales somewhere between O(log(n)) and O(n) suggesting it is relatively efficient, even though it is much slower than np.arange(), which actually decreases in compute time for larger values of N. This suggests that np.arange() is actually O(1) or has a constant compute time.

NOTE: The rest of this tutorial is copied directly from Sofia Heisler's excellent Pycon presentation on optimizing Pandas code.

What does efficient code in Pandas and Numpy look like?¶

In [19]:

# Load in the data file for computation
df = pd.read_csv('new_york_hotels.csv', encoding='cp1252')

In [20]:

df.head()

Out[20]:

	ean_hotel_id	name	address1	city	state_province	postal_code	latitude	longitude	star_rating	high_rate	low_rate
0	269955	Hilton Garden Inn Albany/SUNY Area	1389 Washington Ave	Albany	NY	12206	42.68751	-73.81643	3.0	154.0272	124.0216
1	113431	Courtyard by Marriott Albany Thruway	1455 Washington Avenue	Albany	NY	12206	42.68971	-73.82021	3.0	179.0100	134.0000
2	108151	Radisson Hotel Albany	205 Wolf Rd	Albany	NY	12205	42.72410	-73.79822	3.0	134.1700	84.1600
3	254756	Hilton Garden Inn Albany Medical Center	62 New Scotland Ave	Albany	NY	12208	42.65157	-73.77638	3.0	308.2807	228.4597
4	198232	CrestHill Suites SUNY University Albany	1415 Washington Avenue	Albany	NY	12206	42.68873	-73.81854	3.0	169.3900	89.3900

Haversine definition¶

Haversine function computes the great circle distance between two lat/lon locations on the globe.

In [21]:

def haversine(lat1, lon1, lat2, lon2):
    miles_constant = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    mi = miles_constant * c
    return mi

1. Crudest looping¶

Using the for loop to iteratively access row-wise elements in the Pandas Dataframe does not take advantage of the vectorized capabilities that Pandas can achieve (through Numpy).

In [22]:

# Define a function to manually loop over all rows and return a series of distances
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude'])
        distance_list.append(d)
    return distance_list

In [23]:

%%timeit
# Run the haversine looping function
df['distance'] = haversine_looping(df)

47.7 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2. Iterrows Haversine¶

Pandas offers an row-wise iterator that is faster than indexing via a loop.

In [24]:

%%timeit
# Haversine applied on rows via iteration
haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985,\
                                      row['latitude'], row['longitude']))
df['distance'] = haversine_series

24.8 ms ± 333 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

3. `Apply()` Haversine on rows¶

In [25]:

%timeit df['distance'] =\
df.apply(lambda row: haversine(40.671, -73.985,\
                               row['latitude'], row['longitude']), axis=1)

10.2 ms ± 136 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

4. Vectorized implementation of Haversine applied on Pandas series¶

In [26]:

# Vectorized implementation of Haversine applied on Pandas series
%timeit df['distance'] = haversine(40.671, -73.985,\
                                   df['latitude'], df['longitude'])

333 μs ± 4.37 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

5. Compute Haversine on rows in `Numpy` instead of `Pandas`¶

In [27]:

DF = df.iloc[:,[6,7]].values  # Convert Pandas columns into a Numpy array

%timeit [haversine(40.671, -73.985,row[0], row[1]) for row in DF]

5.87 ms ± 78.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

6. Vectorized implementation of Haversine applied on NumPy arrays¶

In [28]:

np_lat = df['latitude'].values
np_lon = df['longitude'].values

# Vectorized implementation of Haversine applied on NumPy arrays
%timeit df['distance'] = haversine(40.671, -73.985,\
                         df['latitude'].values, df['longitude'].values)

49 μs ± 627 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Cythonize that loop¶

Future: The fastest python code uses precompiled C code, which is available through the Cython module.

7. In-class assignment: Benchmark and graph haversine methods 1, 3 and 6.¶

Evaluate the haversine functions using N = 10, 100, 500, and 1000 as in the example above. Hint: subindex df to pass only N values, instead of the full array (which is 1600).
Use the ti = time.time(); t = time.time()-ti; code to capture the compute time for graphing.
Normalize t as above.
Recreate the graph above showing the curves for algorithm performance.
Graph the output as a function of N's, as above.

8. What to turn in?¶

.Png or .jpg image of the benchmark graph with your results from haversine methods 1, 3, and 6 included.

In [ ]: