## Part 2: A look at iterators in Python, using itertools

In my previous post (here), I went into a general discussion on the virtues of using iterators in your code and ran through some beginner-level examples in C++. Here, I am going to extend this introduction to iterators by looking at how they are implemented in Python and how you can use them to improve your code.

As a data scientist working in Python, you are guaranteed to come across various data-types stored in a range of containers such as lists, tuples and dictionaries. In Python, these containers are all iterable objects, meaning we can obtain an iterator from them (as in C++, Python strings are also iterable objects). Operating on these iterable data containers in a Python for loop requires an **iterator method** and execution of a **next method** to create an iterator and advance[1]. These built-in methods are shown below where **iter(name) **returns an iterator from a list and **next(it) **allows us to advance through the iterator and print out each element.

name = ["Ciaran", "Cooney"]

it = iter(name)

print(next(it))

print(next(it))#output

Ciaran

Cooney

Python comes with several built-in functions such as **zip** and **map** which facilitate iteration over data containers. These are very useful and time-saving tools once you have developed an intuition for when and how to use them. The zip function effectively works by using iter() and next() to to call and advance through each of the input arguments before returning an iterator which can return tuple containing input data with common indices.

a = zip([1,2,3], ['a','b','c'])

print(list(a))#output

[(1, 'a'), (2, 'b'), (3, 'c')]

Map applies a function to each element in an interable before advancing to the next. Here, iter() is called on the second argument and the input function is applied to the corresponding element. Next() is then called until the iterator is exhausted.

`b = map(len, ['hello', 'world'])`

print(list(b))

At some point in the future you may wish to make a class or object of your own into an iterator, perhaps to enhance the performance of a data processing pipeline. To do this, you will need to implement **__iter__()** and **__next__() **methods. The __iter__() method returns the iterator object and the __next__() method facilitates operations (here simply returning the element) at each iteration. It is important to be careful not to create an iterator that will continue to advance infinitely so we use an if-else statement and raise **StopIteration **when the iterator has been exhausted.

class MyClass():def __init__(self, container):

self.container = containerdef __iter__(self):

self.count = 0

return selfdef __next__(self):

myclass = MyClass(["Hello", "my", "name", "is", "Ciaran"])

if self.count < len(self.container):

x = self.container[self.count]

self.count += 1

return x

else:

raise StopIteration

myiter = iter(myclass)for x in myiter:

print(x)#output

Hello

my

name

is

Ciaran

Although it’s always good to know what’s going on under the hood, the truth is that more often than not your interaction with iterators will be through the built-in functions and the itertools package. Itertools has so many great iterator tools, so it is well worth your time to have a rummage through the documentation to see what catches your eye.

One function I like is **dropwhile() **which allows you to make an iterator that drops elements from a iterable for as long as predicate is true, after which it returns all elements. **Groupby() **is a common iterator algorithm which returns consecutive keys and groups from the iterable. Another useful function it itertools is **permutations()**. As you might have guessed, this one returns permutations of the elements contained within the input iterable. The length of permutations can be constrained by a second argument, *r (*see code below*), *otherwise permuations will be the length of the input iterable. I have coded up some examples of using these functions:

print(list(dropwhile(lambda x: x<=3, [1,2,3,4,5,6,7,8,9,3])))

#output: [4, 5, 6, 7, 8, 9, 3]print(list(list((list(g), k)) for k, g in groupby([1,2,2,2,2,3,4,4,4,4,5,5,2,1,1,1,1])))

#output: [[[1], 1], [[2, 2, 2, 2], 2], [[3], 3], [[4, 4, 4, 4], 4], [[5, 5], 5], [[2], 2], [[1, 1, 1, 1], 1]]print(list(permutations([1,2,3])))

[(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)]print(list(permutations([1,2,3], 2)))

[(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)]

The term “iterator algebra” is used in the itertools documentation to describe the general concept of combining iterator tools to improve overall code efficiency and performance [2]. Combining itertools functions can take a bit of thought at first and can get quite advance pretty quickly, but for this post, I am going to show you one simple example of how using itertools can speech up processing time.

Let’s consider a simple example where we want to take two lists containing positive integers, determine all possible combinations of elements across lists (not within lists) and return the sum of each combination. Below, I have implemented a typical function with a couple of for loops to run over the lists and perform the summing operations.

a = [1,2,3]

b = [4,5,6]

def sum_combinations(a, b):

combinations, results = [], []

for i in a:

for j in b:

combinations.append(tuple((i,j)))

results.append(sum((i,j)))

return combinations, resultscombs, res = sum_combinations(a,b)

print(combs, res)#output

[(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)]

[5, 6, 7, 6, 7, 8, 7, 8, 9]

This is fine for the 3-element lists I used in this example. But what happens if we expand the inputs to contain 10000 integers each? To test this I imported the **time **module to see how long the function would run for on my admittedly less than special laptop:

import time

a = np.random.randint(5, size=10000)

b = np.random.randint(5, size=10000)

start = time.time()

combs, res = sum_combinations(a,b)

stop = time.time()

print(f"time: {stop-start}")#output:

time: 108.07000184059143

Okay, **108s **seems like a fairly long time to have to wait for some basic operations. Fortunately, we have an alternative: iterator algebra!

Here I use the itertools function **product()** along with the **map** function mentioned above. This function gives us a cartesian product of the input iterables, kind of like using nested for loops. We then use map to apply the **sum** function as we iterate through the inputs.

start = time.time()

res_1 = list(map(sum,itertools.product(a,b, repeat=1)))

stop = time.time()

print(f"time: {stop-start}")#output: time: 34.44488835334778

Look at the time difference here! **108 s **when we implement a standard looping function and** 34 s **when using itertools with iterator algebra. If you take nothing else from this post, at least notice the potential time gains iterators can offer when the dimensionality of data expands.

I hope this post will prove useful for some of you venturing into the world of iterators for more efficient data processing.

All of the Python examples and any additional functions I have used are available here: https://github.com/cfcooney/medium_posts

[1] w3schools.com, “Python Iterators,” *w3schools.com*, 2020. [Online]. Available: https://www.w3schools.com/python/python_iterators.asp.

[2] S. Jaiswal, “Python Iterator Tutorial,” *DataCamp*, 2018. [Online]. Available: https://www.datacamp.com/community/tutorials/python-iterator-tutorial?utm_source=adwords_ppc&utm_campaignid=898687156&utm_adgroupid=48947256715&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034343&utm_targetid=a.