Preamble
Python allows for prototyping to be done extremely rapidly.
Need a common data structure sorted?
>>> sorted([1,3,5,4,2])
RETURNS: [1,2,3,4,5]
Need your list to be unique?
>>> set([1,2,2,3,4])
RETURNS: {1,2,3,4} # convert back to list by casting: list(set([1,2,2,3,4])
Need to count elements?
>>> from collections import Counter
>>> Counter([1,2,3,3,3,3,4])
RETURNS: Counter({3: 4, 1: 1, 2: 1, 4: 1})
# a dictionary whose keys are the unique elements in list, values are # of occurrences
We often can abuse lists/dictionaries to be the be-all end-all to our solutions, chasing after the completion of the problem instead of planning for efficiency and good structure. For example, Lists in python are extremely flexible (a significantly more pleasant form of C++ Vectors) and we can just get into the (bad) habit of always working with lists (not caring for tuples or deques).
To be fair, often times, we are not handling a sufficiently large amount of data such that we’re punished for handling something inefficiently. We get away with using lists (even though we may not append anything to them later on, actually making use of their mutability) and that can be perfectly fine in Python’s realm where we want to experience that boosted productivity and avoid wasting our thought and time. Not Prematurely Optimizing can be considered a good thing (to prevent time wastage) and can sometimes be what determines the difference between mid-level and senior-level developers.
However, personally, I would say that instinctively choosing a common/provided data structure (lists, dictionaries, dequeues, sets, tuples being some common ones), shows a developer that is at least mid-level and not junior for two blatant reasons:
- They have created a structure in mind; planning ahead knowing how the variable would be manipulated. They don’t rely on a step-by-step form of thinking when tackling a problem, but they have a picture of how their program will be modeled.
- They are aware of the structures provided by the language - showing a better fluency with their craft, which either indicates that they were willing to learn beyond just getting a job done (striving to get it done as best as it can) or that they’ve experienced the situation where they had to resort to trying other data structures to squeeze performance.
This applies to data structures that are provided by the language naturally.
When it comes to leveraging more custom classes (like Tries), this definitely is a larger hiccup in terms of time used in implementation (as now setup code is needed to create a custom data structure instead of focusing more on a simpler algorithm). So it’s a little more understandable that developers skirt around this (probably considering this more to be premature optimization).
I suspect when job requirements ask for Python programmers who have created programs instead of scripts, they’re really asking for programmers that maintain and improve upon (first-draft) scripts, not necessarily requiring custom data structures, but rather, at least are using properly selected provided data structures.
Now… to find out just how much more efficient the common data structures are, at what they do best.
Performance Benchmarking Test Cases
Instead of doing analysis on the common data structures by disassembly, I’ll focus on doing benchmarks based on timing analysis. I’ll be recreating the same testing environment that I used in my first post (reducing entropy in testing). I intend to benchmark Lists, Tuples, Dicts, Deques and Sets. The tests performed will be based on common use cases:
We’ll be working with containers that are typically 1,000,000 elements long.
- Access random elements by index (or key)
- Access elements by value (searching)
- Appending elements into the data structure (re-assignment required in tuples)
- Finding the different elements between two structures (of the same type)
The same Timing Decorator will be reused for these tests.
Code Used in Test Cases Outlined
Section 1 : Access Random Elements
from random import shuffle
from random import randint
from collections import deque
size = 1_000_000
queries = [randint(0, size-1) for x in range(size)]
num_list = [x for x in range(size)]
shuffle(num_list)
num_tup = tuple(num_list)
num_dict = {i:x for i,x in enumerate(num_list)}
num_deque = deque(num_list)
num_set = set(num_list)
@time_this
def list_access():
return sum([num_list[query] for query in queries])
@time_this
def tup_access():
return sum([num_tup[query] for query in queries])
@time_this
def dict_access():
return sum([num_dict[query] for query in queries])
#sets cannot be accessed via index
@time_this
def set_access():
return sum([query if query in num_set else 0 for query in queries])
@time_this
def deque_access():
return sum([num_deque[query] for query in queries])
list_access()
tup_access()
dict_access()
set_access()
deque_access()
Section 2 : Find elements by value (searching)
from random import shuffle
from random import randint
from collections import deque
size = 1_000_000
queries = [randint(0, size*2) for _ in range(1000)]
num_list = [x for x in range(size)]
shuffle(num_list)
num_tup = tuple(num_list)
num_dict = {x:i for i,x in enumerate(num_list)}
num_deque = deque(num_list)
num_set = set(num_list)
@time_this
def list_search():
return sum([query if query in num_list else 0 for query in queries])
@time_this
def tup_search():
return sum([query if query in num_tup else 0 for query in queries])
@time_this
def dict_search():
sum = 0
for query in queries:
try:
num_dict[query]
sum += query
except KeyError:
pass
return sum
@time_this
def set_search():
return sum([query if query in num_set else 0 for query in queries])
@time_this
def deque_search():
return sum([query if query in num_deque else 0 for query in queries])
list_search()
tup_search()
dict_search()
set_search()
deque_search()
Section 3 : Appending Elements
from random import shuffle
from random import randint
from collections import deque
size = 100_000
samples = [num for num in range(size)]
shuffle(samples)
@time_this
def list_append():
num_list = []
for x in samples:
num_list.append(x)
return num_list
@time_this
def tup_append():
num_tup = ()
for x in samples:
num_tup = num_tup + (x,)
return num_tup
@time_this
def dict_append():
num_dict = {}
for x in samples:
num_dict[x] = 1
return num_dict
@time_this
def set_append():
num_set = set()
for x in samples:
num_set.add(x)
return num_set
@time_this
def deque_append():
num_deq = deque()
for x in samples:
num_deq.append(x)
return num_deq
list_append()
tup_append()
dict_append()
set_append()
deque_append()
Section 4 : Finding the different elements
from random import shuffle
from random import randint
from collections import deque
list1 = [num for num in range(0, 75000)]
list2 = [num for num in range(50000, 100000)]
shuffle(list1)
shuffle(list2)
tup1 = tuple(list1)
tup2 = tuple(list2)
dict1 = {x:1 for x in list1}
dict2 = {x:1 for x in list2}
set1 = set(list1)
set2 = set(list2)
deq1 = deque(list1)
deq2 = deque(list2)
@time_this
def list_diff():
return [x for x in list1 if x not in list2] + [x for x in list2 if x not in list1]
@time_this
def tup_diff():
return tuple(x for x in tup1 if x not in tup2) + tuple(x for x in tup2 if x not in tup1)
@time_this
def dict_diff():
diff_dict = {x:1 for x in dict1.keys() if x not in dict2.keys()}
for x in dict2.keys():
if x not in dict1.keys():
diff_dict[x] = 1
return diff_dict
@time_this
def set_diff():
return set1.symmetric_difference(set2)
@time_this
def deque_diff():
diff_deq = deque()
for x in deq1:
if x not in deq2:
diff_deq.append(x)
for x in deq2:
if x not in deq1:
diff_deq.append(x)
return diff_deq
list_diff()
tup_diff()
dict_diff()
set_diff()
deque_diff()
Results
Section 1: Access Random Elements
For data structures of size 1,000,000 and 1,000,000 queries (of random integers), the following was observed about the access speeds of the different data structures. All results in all sections are in seconds.
| List | Tuple | Dictionary | Set | Deque |
|---|---|---|---|---|
| 0.41142 | 0.41287 | 0.71069 | - | 50.04806 |
Surprisingly we find that tuples and lists perform approximately the same (when accessing by index)! This goes against the common conception that because tuples are immutable, they might be inherently faster. There are many comments and people observing the disassembly of the tuple and lists data structures and come to the conclusion that tuples are better in every way, however there are other scientists that have observed otherwise. Our observations align with the fact that there is no performance gain with tuples in common use cases and therefore, the immutable property of tuples serve more on structuring code (preventing mis-use of design).
Accessing dictionary keys would be not noticeably slower, as we see in the case of a dictionary with 1 million elements, perform 1 million access of keys was 0.7 seconds (0.3 seconds slower than lists).
Acquiring elements in a set actually proved to be fastest however, sets aren’t indexed (treated as arrays), so this value was discarded.
Dequeues align with our expectation. We know that double-ended queues are meant to be fast when it comes to inserting (at start) or appending (at end). The fastest operations happen at the ends of the queue. Once searching a queue is required, all hell breaks loose - as seen by the 50s time it takes to access random elements throughout the queue.
Section 2: Finding an elements by its value (searching)
While I maintained the list size as 1 million, I needed to reduce the search size to 1000. As it turns out, searching for values is an extremely expensive operation for most of the common data types.
| List | Tuple | Dictionary | Set | Deque |
|---|---|---|---|---|
| 69.81036 | 69.96138 | 0.00061 | 0.00030 | 72.55674 |
Lists and Tuples score the same again! It really seems like a tuple is pretty much just like a list except we cannot modify or append to it (we can reassign it but we’ll find this to be super expensive in time as we’ll see in Section 3).
Dictionaries can be a tricky situation. You have to configure the dictionary to have the value you’re searching for as a key to make it fast. **IF the element being searched for is not a key, but instead is the value this massively changes performance. ** When searching the values (using .values() which generates a list of the values in the dictionary), the search (for 1000 random elements) takes 56.32167 seconds. So it’s integral that if you’re performing a search on a dictionary, to be sure to structure and design your code so that what you’re searching for typically is in the KEYS to achieve incredible fast search speeds.
Sets apparently are the fastest way to find out if an element exists within itself (they are very dictionary-like after all - just lacking the value part of the key:value pairs)
Deques performed a little worse than lists and tuples. However, it is very close and this can be attributed to the fact that it isn’t necessarily bad at incrementing through it’s structure in search of something. It may not perform well at random access (as seen in Section 1), but it performs within reason (compared to a list) as seen in this section.
It would seem that casting a list, tuple or deque to a set and performing searches seem to be blisteringly fast compared to searching within large lists, tuples and deques.
Section 3 : Appending Elements
In this section we attempt to see how long it takes to grow the data structure over time. We choose to grow the containers from 0 to 100,000 elements long. Below is the observed results.
| List | Tuple | Dictionary | Set | Deque |
|---|---|---|---|---|
| 0.01823 | 43.11323 | 0.02127 | 0.02575 | 0.01529 |
We knew that tuples would take incredibly long to build as they are immutable. There isn’t a method to append items to them (as they were not made to be used in this fashion). We grow tuples by taking the original tuple and adding another tuple. We can then overwrite the original tuple with this summation. It turns out this is incredibly expensive and required us to reduce our test cases from 1 million elements long to 100,000. The tuple appending did not scale linearly (1,000,000 appends takes significantly longer than 430 seconds).
When it comes to appending onto the end of objects, Deques finally had its chance to shine (as they are optimized for operations that occur on either side of its ends), performing the best of the lot. Lists coming in at a very close second place.
Section 4 : The difference in elements between 2 structures
Sets have had a lot of moments to shine during these tests performance-wise and here it comes again swiping another victory with its mathematical-like operations being the best time performing. The size was kept small (100,000 instead of 1 million) as most data types struggled with this evaluation.
| List | Tuple | Dictionary | Set | Deque |
|---|---|---|---|---|
| 99.00318 | 100.86129 | 0.06924 | 0.00503 | 128.11499 |
Lists and Tuples, once again showing that tuples are not inherently better than Lists just because it’s immutable. Deques having the lowest performance, as its optimizations only seem to be when working on the ends of the structure and are lackluster in other aspects of performance.
Dictionary keys are surprisingly extremely fast when it comes to searching and finding if keys exist or not and this seems to benefit the process involved in finding the difference in elements. There was drastic difference (almost 1500 times faster) dictionaries had in performance compared to lists and tuples.
This was an obvious situation where sets would shine (as there are methods in that exactly solve this evaluation - compared to having to perform loops and checks that the rest required). Sets found the differences 20,000 times faster than lists could have. It should be noted that sets enforce unique elements in its structure. Where the elements are not unique, dictionaries may provide a suitable general-purpose method of tracking differing elements (Counters - from collections, are based on dictionaries and may further ease this).
Deques performed within our expectation (reiterating that it’s only optimized to handle actions onto the ends of the data structure).
Conclusion
Surprisingly, Deques did not offer a significantly large performance gain (when it came to repeatedly appending items onto the end of the list).
Tuples did not outperform Lists. The two remained mostly on par with each other as far as performance is concerned. This fortifies the opinion that tuples are meant as a data structure to enforce design choices (with its immutable property) and not meant for optimization necessarily. Tuples may be more advantageous spatially, but that remains to be tested and not in this scope.
Dictionaries and Sets are incredibly fast when it comes to searching which surpasses our expectations (given that’s typically their use-cases). The difference in performance to the other containers made it clear that special attention should be given to design when it comes to utilizing sets and dictionaries (reducing minutes of searching to under a second).
It would seem you can get away with using Lists in most situations (over tuples and deques). But developers should be wary to find areas where they can optimize their designs with dictionaries and sets. Given the incredibly vast improvement these two data structures offer compared to lists it can hardly be considered premature optimization and also should not be time consuming, as it’s a provided data structure by the standard library. This should be incorporated into your coding/design habits and not inhibit the speed of development Python provides.