Sorting, Filtering Data and Search#
There are some operations we will frequently need when working with data sets. Sorting data is useful both for increasing the readability or accessibility and for highlighting certain aspects of the data. Filtering data is useful for removing irrelevant data and is especially important with large data sets.
Sorting Data#
When presenting data to the user, sorting data by the right key/property can be important. For example, if you’re looking for a court decision by date, a list ordered by title isn’t helpful.
Say we have a list of judges that we want to sort.
judges = ['Síofra O’Leary, President', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Lətif Hüseynov',
'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad, ad hoc judge', 'Victor Soloveytchik, Section Registrar']
We can sort lists in two ways. Firstly, we can make a new, sorted list, while also keeping the old one.
judges_sorted = sorted(judges)
print(judges_sorted)
['Anne Grøstad, ad hoc judge', 'Ivana Jelić', 'Jovan Ilievski', 'Lətif Hüseynov', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Síofra O’Leary, President', 'Victor Soloveytchik, Section Registrar']
Secondly, if we don’t need the original list, it’s more efficient to sort the existing list in place:
print(judges)
judges.sort()
print(judges)
['Síofra O’Leary, President', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Lətif Hüseynov', 'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad, ad hoc judge', 'Victor Soloveytchik, Section Registrar']
['Anne Grøstad, ad hoc judge', 'Ivana Jelić', 'Jovan Ilievski', 'Lətif Hüseynov', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Síofra O’Leary, President', 'Victor Soloveytchik, Section Registrar']
Reversed Sorting
By default, Python sorts the item from smallest to largest.
For text, this means in lexicographic order from A to Z.
We can reverse the sort order by giving sort()
the argument reverse = True
.
judges.sort(reverse=True)
print(judges)
['Victor Soloveytchik, Section Registrar', 'Síofra O’Leary, President', 'Stéphanie Mourou-Vikström', 'Mārtiņš Mits', 'Lətif Hüseynov', 'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad, ad hoc judge']
Filtering Data#
Let’s say we want to make a list of only the ad hoc judges. We can do that by iterating over the list:
adhoc_judges = []
for judge in judges:
if 'ad hoc' in judge:
adhoc_judges.append(judge)
print(adhoc_judges)
['Anne Grøstad, ad hoc judge']
This works, but can be written shorter and simpler with a list comprehension.
adhoc_judges = [judge for judge in judges if 'ad hoc' in judge]
print(adhoc_judges)
['Anne Grøstad, ad hoc judge']
Modifying Data#
We can also use list comprehensions to modify each item in a list. For example, we can remove the titles from the list of judges. We split on the comma and use element 0 from the result.
judges = [judge.split(',')[0] for judge in judges]
print(judges)
['Victor Soloveytchik', 'Síofra O’Leary', 'Stéphanie Mourou-Vikström', 'Mārtiņš Mits', 'Lətif Hüseynov', 'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad']
Sets — Avoiding Duplicates#
Let’s say we want to collect a list of all the judges that appear in a collection of cases. We could do it like this:
import json
def read_json_file(filename):
with open(filename, 'r') as file:
text_data = file.read()
json_data = json.loads(text_data)
return json_data
cases = read_json_file('cases-5.json')
all_judges = []
for case in cases:
for judge in case['decision_body']:
all_judges.append(judge['name'])
print(all_judges)
['Helena Jäderblom', 'Branko Lubarda', 'Helen Keller', 'Dmitry Dedov', 'Pere Pastor Vilanova', 'Georgios A. Serghides', 'Jolien Schukking', 'Stephen Phillips', 'Luis López Guerra', 'Helena Jäderblom', 'Helen Keller', 'Dmitry Dedov', 'Branko Lubarda', 'Pere Pastor Vilanova', 'Georgios A. Serghides', 'Stephen Phillips', 'MrJ. Hedigan', 'MrB.M. Zupančič', 'MrC. Bîrsan', 'MrV. Zagrebelsky', 'MrsA. Gyulumyan', 'MrDavid Thór Björgvinsson', 'MrsI. Ziemele', 'Mr V. Berger', 'Nina Vajić', 'Anatoly Kovler', 'Khanlar Hajiyev', 'Mirjana Lazarova Trajkovska', 'Julia Laffranque', 'Linos-Alexandre Sicilianos', 'Erik Møse', 'Søren Nielsen', 'Françoise Tulkens', 'Ireneu Cabral Barreto', 'Vladimiro Zagrebelsky', 'Danutė Jočienė', 'András Sajó', 'Nona Tsotsoria', 'Işıl Karakaş', 'Sally Dollé']
Unfortunately, we get duplicates because the same judges appear in different cases.
We need to remove these duplicates.
The simplest way is to use set()
instead of a list, because sets only store each item once.
all_judges = set()
for case in cases:
for judge in case['decision_body']:
all_judges.add(judge['name'])
print(all_judges)
{'Erik Møse', 'Françoise Tulkens', 'Søren Nielsen', 'Nona Tsotsoria', 'Vladimiro Zagrebelsky', 'Linos-Alexandre Sicilianos', 'Danutė Jočienė', 'MrDavid Thór Björgvinsson', 'Ireneu Cabral Barreto', 'Julia Laffranque', 'Anatoly Kovler', 'Georgios A. Serghides', 'Stephen Phillips', 'Işıl Karakaş', 'Jolien Schukking', 'Nina Vajić', 'Sally Dollé', 'MrB.M. Zupančič', 'MrV. Zagrebelsky', 'MrC. Bîrsan', 'MrsA. Gyulumyan', 'MrsI. Ziemele', 'Mr V. Berger', 'Luis López Guerra', 'MrJ. Hedigan', 'Helen Keller', 'Mirjana Lazarova Trajkovska', 'Khanlar Hajiyev', 'Dmitry Dedov', 'András Sajó', 'Pere Pastor Vilanova', 'Branko Lubarda', 'Helena Jäderblom'}
Note
A set is a collection of unique items. Sets are different from list in two ways: order doesn’t matter, and each item can occur only once.
Removing duplicates#
If we have an existing list with duplicates, we can remove the duplicates by making a set from the list.
judges = ['Síofra O’Leary, President',
'Mārtiņš Mits',
'Mārtiņš Mits', ]
judges = set(judges)
print(judges)
{'Mārtiņš Mits', 'Síofra O’Leary, President'}
Changing Order
Notice that the order of the items might change when we make a set from a list. Sets don’t keep track of the order of the items, so they will be printed in arbitrary order.
Tuples#
In addition to lists and sets, Python has a third collection type for storing items called tuples. Tuples are lists that cannot be modified. Like lists but unlike sets the elements have a certain order and can be repeated. We can make tuples by listing the items in parentheses:
items = (1, 2, 3)
Tuples can be useful for returning a collection of items from a function and for making lists of items that belong together. A tuple with two items is called a pair.
We can make a list of pairs, where each pair is a judge and their (optional) title:
judges = [('Síofra O’Leary', 'President'),
('Anne Grøstad', 'ad hoc judge'),
('Victor Soloveytchik', 'Section Registrar')]
print(judges)
[('Síofra O’Leary', 'President'), ('Anne Grøstad', 'ad hoc judge'), ('Victor Soloveytchik', 'Section Registrar')]
List vs. set vs. tuple
The differences between Python’s three main one dimensional container types are summarized in the table Container comparison.
Modifiable |
Allows duplicates |
Ordered |
|
---|---|---|---|
List |
✔️ |
✔️ |
✔️ |
Set |
✔️ |
❌ |
❌ |
Tuple |
❌ |
✔️ |
✔️ |
Unpacking Tuples#
Python has a shortcut for assigning multiple variables at once. This is useful for unpacking the values in a tuple. Some libraries, for example for Machine Learning, return multiple values by wrapping (storing) them in a tuple.
Let’s say we have a pair:
judge = ('Síofra O’Leary', 'President')
We can get the individual items like this:
judge_name = judge[0]
judge_title = judge[1]
But we can also assign both variables at once:
judge_name, judge_title = judge
This is called unpacking the tuple. The left-hand side and the right-hand side of the assignment should be tuples with the same number of elements.
Unpacking is especially useful for immediately splitting a tuple that is returned by a function into separate variables, for example:
judge_name, judge_title = get_judge(case)
Sorting by a key#
Sometimes we want to sort items in a list in a different way than the default sort order produced by .sort()
or sorted()
.
For example, we can sort strings by their length.
We can do that by giving the sort function a key for sorting.
The sorting key is the function that is applied to the items that are to be sorted.
The return value of this function is used instead of the item itself when doing the sorting.
To sort strings by their length, we use the argument key=len
.
names = ['Alicia', 'Jane', 'Joe', 'Abdul']
names.sort(key=len)
print(names)
['Joe', 'Jane', 'Abdul', 'Alicia']
Sorting Cases#
Let’s say we want to order the names of our cases by a their name or date. First, we make a list that contains the names (docnames) and dates as pairs. We can do this with a list comprehension.
cases_date = [(case['docname'], case['judgementdate']) for case in cases]
First, we will try sorting by the case name. For the key, we need a function that extracts the item with an index 0 from the pair. We can write this function ourselves:
def get_title(pair):
return pair[0]
cases_date.sort(key=get_title)
However, if we create functions like this, our code will become littered with short functions that are only used once. Therefore, Python supports making anonymous helper functions called lambda expressions. We can use a lambda expression as sorting key.
Lambda Expressions#
We make anonymous functions with lambda
expressions.
cases_date.sort(key=lambda pair: pair[0])
print(cases_date)
[('CASE OF ARAT AND OTHERSv. TURKEY', '13/01/2009 00:00:00'), ('CASE OF OBERWALDER v. SLOVENIA', '18/01/2007 00:00:00'), ('CASE OF RAKHMONOV v. RUSSIA', '16/10/2012 00:00:00'), ('CASE OF SKLYAR v. RUSSIA', '18/07/2017 00:00:00'), ('CASE OF YABLOKO RUSSIAN UNITED DEMOCRATIC PARTY AND OTHERS v. RUSSIA', '08/11/2016 00:00:00')]
Lambda Expressions Syntax
Because lambda expressions are expressions, they can only contain a single expression, not a list of statements as a regular function. The general form of a lambda expression is:
lambda parameters: expression
An equivalent, regular function definition would be something like:
def <lambda>(parameters):
return expression
Itemgetter#
Because sorting by a list element is such a common operation, Python has a built-in function itemgetter()
for this.
We can use itemgetter()
to get the element at a given index:
from operator import itemgetter
cases_date.sort(key=itemgetter(0))
print(cases_date)
[('CASE OF ARAT AND OTHERSv. TURKEY', '13/01/2009 00:00:00'), ('CASE OF OBERWALDER v. SLOVENIA', '18/01/2007 00:00:00'), ('CASE OF RAKHMONOV v. RUSSIA', '16/10/2012 00:00:00'), ('CASE OF SKLYAR v. RUSSIA', '18/07/2017 00:00:00'), ('CASE OF YABLOKO RUSSIAN UNITED DEMOCRATIC PARTY AND OTHERS v. RUSSIA', '08/11/2016 00:00:00')]
Sorting by Date#
If we want to sort by date, we must use the element with index 1.
cases_date.sort(key=lambda pair: pair[1])
print(cases_date)
[('CASE OF YABLOKO RUSSIAN UNITED DEMOCRATIC PARTY AND OTHERS v. RUSSIA', '08/11/2016 00:00:00'), ('CASE OF ARAT AND OTHERSv. TURKEY', '13/01/2009 00:00:00'), ('CASE OF RAKHMONOV v. RUSSIA', '16/10/2012 00:00:00'), ('CASE OF OBERWALDER v. SLOVENIA', '18/01/2007 00:00:00'), ('CASE OF SKLYAR v. RUSSIA', '18/07/2017 00:00:00')]
Unfortunately, Python sorts the dates by lexicographic order. We will need to convert the textual date to an object that Python understands.
We can do this with the Python library function datetime.strptime()
.
This function takes a parameter that specifies the data format, which looks a bit messy.
from datetime import datetime
cases_date.sort(key=lambda pair: datetime.strptime(pair[1], '%d/%m/%Y %H:%M:%S'))
print(cases_date)
[('CASE OF OBERWALDER v. SLOVENIA', '18/01/2007 00:00:00'), ('CASE OF ARAT AND OTHERSv. TURKEY', '13/01/2009 00:00:00'), ('CASE OF RAKHMONOV v. RUSSIA', '16/10/2012 00:00:00'), ('CASE OF YABLOKO RUSSIAN UNITED DEMOCRATIC PARTY AND OTHERS v. RUSSIA', '08/11/2016 00:00:00'), ('CASE OF SKLYAR v. RUSSIA', '18/07/2017 00:00:00')]