Sorting, Filtering Data and Search#
There are some operations we will frequently need when working with data sets. Sorting data is useful both for increasing the readability or accessibility and for highlighting certain aspects of the data. Filtering data is useful for removing irrelevant data and is especially important with large data sets.
Sorting Data#
When presenting data to the user, sorting data by the right key/property can be important. For example, if you’re looking for a court decision by date, a list ordered by title isn’t helpful.
Say we have a list of judges that we want to sort.
judges = ['Síofra O’Leary, President', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Lətif Hüseynov',
'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad, ad hoc judge', 'Victor Soloveytchik, Section Registrar']
We can sort lists in two ways. Firstly, we can make a new, sorted list, while also keeping the old one.
judges_sorted = sorted(judges)
print(judges_sorted)
['Anne Grøstad, ad hoc judge', 'Ivana Jelić', 'Jovan Ilievski', 'Lətif Hüseynov', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Síofra O’Leary, President', 'Victor Soloveytchik, Section Registrar']
Secondly, if we don’t need the original list, it’s more efficient to sort the existing list in place:
print(judges)
judges.sort()
print(judges)
['Síofra O’Leary, President', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Lətif Hüseynov', 'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad, ad hoc judge', 'Victor Soloveytchik, Section Registrar']
['Anne Grøstad, ad hoc judge', 'Ivana Jelić', 'Jovan Ilievski', 'Lətif Hüseynov', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Síofra O’Leary, President', 'Victor Soloveytchik, Section Registrar']
Reversed Sorting
By default, Python sorts the item from smallest to largest.
For text, this means in lexicographic order from A to Z.
We can reverse the sort order by giving sort()
the argument reverse = True
.
judges.sort(reverse=True)
print(judges)
['Victor Soloveytchik, Section Registrar', 'Síofra O’Leary, President', 'Stéphanie Mourou-Vikström', 'Mārtiņš Mits', 'Lətif Hüseynov', 'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad, ad hoc judge']
Filtering Data#
Let’s say we want to make a list of only the ad hoc judges. We can do that by iterating over the list:
adhoc_judges = []
for judge in judges:
if 'ad hoc' in judge:
adhoc_judges.append(judge)
print(adhoc_judges)
['Anne Grøstad, ad hoc judge']
This works, but can be written shorter and simpler with a list comprehension.
adhoc_judges = [judge for judge in judges if 'ad hoc' in judge]
print(adhoc_judges)
['Anne Grøstad, ad hoc judge']
Modifying Data#
We can also use list comprehensions to modify each item in a list. For example, we can remove the titles from the list of judges. We split on the comma and use element 0 from the result.
judges = [judge.split(',')[0] for judge in judges]
print(judges)
['Victor Soloveytchik', 'Síofra O’Leary', 'Stéphanie Mourou-Vikström', 'Mārtiņš Mits', 'Lətif Hüseynov', 'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad']
Sets — Avoiding Duplicates#
Let’s say we want to collect a list of all the judges that appear in a collection of cases. We could do it like this:
import json
def read_json_file(filename):
with open(filename, 'r') as file:
text_data = file.read()
json_data = json.loads(text_data)
return json_data
cases = read_json_file('cases-5.json')
all_judges = []
for case in cases:
for judge in case['decision_body']:
all_judges.append(judge['name'])
print(all_judges)
['Helena Jäderblom', 'Branko Lubarda', 'Helen Keller', 'Dmitry Dedov', 'Pere Pastor Vilanova', 'Georgios A. Serghides', 'Jolien Schukking', 'Stephen Phillips', 'Luis López Guerra', 'Helena Jäderblom', 'Helen Keller', 'Dmitry Dedov', 'Branko Lubarda', 'Pere Pastor Vilanova', 'Georgios A. Serghides', 'Stephen Phillips', 'MrJ. Hedigan', 'MrB.M. Zupančič', 'MrC. Bîrsan', 'MrV. Zagrebelsky', 'MrsA. Gyulumyan', 'MrDavid Thór Björgvinsson', 'MrsI. Ziemele', 'Mr V. Berger', 'Nina Vajić', 'Anatoly Kovler', 'Khanlar Hajiyev', 'Mirjana Lazarova Trajkovska', 'Julia Laffranque', 'Linos-Alexandre Sicilianos', 'Erik Møse', 'Søren Nielsen', 'Françoise Tulkens', 'Ireneu Cabral Barreto', 'Vladimiro Zagrebelsky', 'Danutė Jočienė', 'András Sajó', 'Nona Tsotsoria', 'Işıl Karakaş', 'Sally Dollé']
Unfortunately, we get duplicates because the same judges appear in different cases.
We need to remove these duplicates.
The simplest way is to use set()
instead of a list, because sets only store each item once.
all_judges = set()
for case in cases:
for judge in case['decision_body']:
all_judges.add(judge['name'])
print(all_judges)
{'Mirjana Lazarova Trajkovska', 'MrC. Bîrsan', 'Linos-Alexandre Sicilianos', 'MrJ. Hedigan', 'MrsA. Gyulumyan', 'Søren Nielsen', 'András Sajó', 'Işıl Karakaş', 'Khanlar Hajiyev', 'Anatoly Kovler', 'MrsI. Ziemele', 'Vladimiro Zagrebelsky', 'Georgios A. Serghides', 'Nina Vajić', 'Stephen Phillips', 'Danutė Jočienė', 'Erik Møse', 'MrDavid Thór Björgvinsson', 'Sally Dollé', 'Mr V. Berger', 'Julia Laffranque', 'Pere Pastor Vilanova', 'Jolien Schukking', 'MrV. Zagrebelsky', 'Helen Keller', 'Nona Tsotsoria', 'MrB.M. Zupančič', 'Helena Jäderblom', 'Luis López Guerra', 'Françoise Tulkens', 'Branko Lubarda', 'Ireneu Cabral Barreto', 'Dmitry Dedov'}
Note
A set is a collection of unique items. Sets are different from list in two ways: order doesn’t matter, and each item can occur only once.
Removing duplicates#
If we have an existing list with duplicates, we can remove the duplicates by making a set from the list.
judges = ['Síofra O’Leary, President',
'Mārtiņš Mits',
'Mārtiņš Mits', ]
judges = set(judges)
print(judges)
{'Mārtiņš Mits', 'Síofra O’Leary, President'}
Changing Order
Notice that the order of the items might change when we make a set from a list. Sets don’t keep track of the order of the items, so they will be printed in arbitrary order.
Tuples#
In addition to lists and sets, Python has a third collection type for storing items called tuples. Tuples are lists that cannot be modified. Like lists but unlike sets the elements have a certain order and can be repeated. We can make tuples by listing the items in parentheses:
items = (1, 2, 3)
Tuples can be useful for returning a collection of items from a function and for making lists of items that belong together. A tuple with two items is called a pair.
We can make a list of pairs, where each pair is a judge and their (optional) title:
judges = [('Síofra O’Leary', 'President'),
('Anne Grøstad', 'ad hoc judge'),
('Victor Soloveytchik', 'Section Registrar')]
print(judges)
[('Síofra O’Leary', 'President'), ('Anne Grøstad', 'ad hoc judge'), ('Victor Soloveytchik', 'Section Registrar')]
List vs. set vs. tuple
The differences between Python’s three main one dimensional container types are summarized in the table Container comparison.
Modifiable |
Allows duplicates |
Ordered |
|
---|---|---|---|
List |
✔️ |
✔️ |
✔️ |
Set |
✔️ |
❌ |
❌ |
Tuple |
❌ |
✔️ |
✔️ |
Unpacking Tuples#
Python has a shortcut for assigning multiple variables at once. This is useful for unpacking the values in a tuple. Some libraries, for example for Machine Learning, return multiple values by wrapping (storing) them in a tuple.
Let’s say we have a pair:
judge = ('Síofra O’Leary', 'President')
We can get the individual items like this:
judge_name = judge[0]
judge_title = judge[1]
But we can also assign both variables at once:
judge_name, judge_title = judge
This is called unpacking the tuple. The left-hand side and the right-hand side of the assignment should be tuples with the same number of elements.
Unpacking is especially useful for immediately splitting a tuple that is returned by a function into separate variables, for example:
judge_name, judge_title = get_judge(case)
Sorting by a key#
Sometimes we want to sort items in a list in a different way than the default sort order produced by .sort()
or sorted()
.
For example, we can sort strings by their length.
We can do that by giving the sort function a key for sorting.
The sorting key is the function that is applied to the items that are to be sorted.
The return value of this function is used instead of the item itself when doing the sorting.
To sort strings by their length, we use the argument key=len
.
names = ['Alicia', 'Jane', 'Joe', 'Abdul']
names.sort(key=len)
print(names)
['Joe', 'Jane', 'Abdul', 'Alicia']
Sorting Cases#
Let’s say we want to order the names of our cases by a their name or date. First, we make a list that contains the names (docnames) and dates as pairs. We can do this with a list comprehension.
cases_date = [(case['docname'], case['judgementdate']) for case in cases]
First, we will try sorting by the case name. For the key, we need a function that extracts the item with an index 0 from the pair. We can write this function ourselves:
def get_title(pair):
return pair[0]
cases_date.sort(key=get_title)
However, if we create functions like this, our code will become littered with short functions that are only used once. Therefore, Python supports making anonymous helper functions called lambda expressions. We can use a lambda expression as sorting key.
Lambda Expressions#
We make anonymous functions with lambda
expressions.
cases_date.sort(key=lambda pair: pair[0])
print(cases_date)
[('CASE OF ARAT AND OTHERSv. TURKEY', '13/01/2009 00:00:00'), ('CASE OF OBERWALDER v. SLOVENIA', '18/01/2007 00:00:00'), ('CASE OF RAKHMONOV v. RUSSIA', '16/10/2012 00:00:00'), ('CASE OF SKLYAR v. RUSSIA', '18/07/2017 00:00:00'), ('CASE OF YABLOKO RUSSIAN UNITED DEMOCRATIC PARTY AND OTHERS v. RUSSIA', '08/11/2016 00:00:00')]
Lambda Expressions Syntax
Because lambda expressions are expressions, they can only contain a single expression, not a list of statements as a regular function. The general form of a lambda expression is:
lambda parameters: expression
An equivalent, regular function definition would be something like:
def <lambda>(parameters):
return expression
Itemgetter#
Because sorting by a list element is such a common operation, Python has a built-in function itemgetter()
for this.
We can use itemgetter()
to get the element at a given index:
from operator import itemgetter
cases_date.sort(key=itemgetter(0))
print(cases_date)
[('CASE OF ARAT AND OTHERSv. TURKEY', '13/01/2009 00:00:00'), ('CASE OF OBERWALDER v. SLOVENIA', '18/01/2007 00:00:00'), ('CASE OF RAKHMONOV v. RUSSIA', '16/10/2012 00:00:00'), ('CASE OF SKLYAR v. RUSSIA', '18/07/2017 00:00:00'), ('CASE OF YABLOKO RUSSIAN UNITED DEMOCRATIC PARTY AND OTHERS v. RUSSIA', '08/11/2016 00:00:00')]
Sorting by Date#
If we want to sort by date, we must use the element with index 1.
cases_date.sort(key=lambda pair: pair[1])
print(cases_date)
[('CASE OF YABLOKO RUSSIAN UNITED DEMOCRATIC PARTY AND OTHERS v. RUSSIA', '08/11/2016 00:00:00'), ('CASE OF ARAT AND OTHERSv. TURKEY', '13/01/2009 00:00:00'), ('CASE OF RAKHMONOV v. RUSSIA', '16/10/2012 00:00:00'), ('CASE OF OBERWALDER v. SLOVENIA', '18/01/2007 00:00:00'), ('CASE OF SKLYAR v. RUSSIA', '18/07/2017 00:00:00')]
Unfortunately, Python sorts the dates by lexicographic order. We will need to convert the textual date to an object that Python understands.
We can do this with the Python library function datetime.strptime()
.
This function takes a parameter that specifies the data format, which looks a bit messy.
from datetime import datetime
cases_date.sort(key=lambda pair: datetime.strptime(pair[1], '%d/%m/%Y %H:%M:%S'))
print(cases_date)
[('CASE OF OBERWALDER v. SLOVENIA', '18/01/2007 00:00:00'), ('CASE OF ARAT AND OTHERSv. TURKEY', '13/01/2009 00:00:00'), ('CASE OF RAKHMONOV v. RUSSIA', '16/10/2012 00:00:00'), ('CASE OF YABLOKO RUSSIAN UNITED DEMOCRATIC PARTY AND OTHERS v. RUSSIA', '08/11/2016 00:00:00'), ('CASE OF SKLYAR v. RUSSIA', '18/07/2017 00:00:00')]
Regular Expressions#
Regular expressions (regexes) are powerful tools for searching text.
We have already seen different ways to search for text.
If we want to find the word “Article” in the string document
, we can use one of the following methods:
if 'Article' in document:
position = document.find('Article')
document.startswith('Article')
(to check only the first word)
With regular expressions, we can look for patterns, not just specific strings.
To use regular expressions, we must import Python’s re
module:
import re
We will need some text to search, so we will use the first page of a recent ECtHR case as an example.
document = '''NORWEGIAN CONFEDERATION OF TRADE UNIONS (LO) AND NORWEGIAN
TRANSPORT WORKERS’ UNION (NTF) v. NORWAY JUDGMENT
In the case of Norwegian Confederation of Trade Unions (LO) and
Norwegian Transport Workers’ Union (NTF) v. Norway,
The European Court of Human Rights (Fifth Section), sitting as a
Chamber composed of:
Síofra O’Leary, President,
Mārtiņš Mits,
Stéphanie Mourou-Vikström,
Lətif Hüseynov,
Jovan Ilievski,
Ivana Jelić, judges,
Anne Grøstad, ad hoc judge,
and Victor Soloveytchik, Section Registrar,
Having regard to:
the application against the Kingdom of Norway lodged with the Court
under Article 34 of the Convention for the Protection of Human Rights and
Fundamental Freedoms (“the Convention”) by two Norwegian associations,
the Norwegian Confederation of Trade Unions (Landsorganisasjonen i
Norge (“LO”)) and the Norwegian Transport Workers’ Union (Norsk
transportarbeiderforbund (“NTF”)) (“the applicant unions”), on 15 June
2017;
the withdrawal of Arnfinn Bårdsen, the judge elected in respect of
Norway, from sitting in the case (Rule 28 § 3 of the Rules of Court) and the
decision of the President of the Section to appoint Anne Grøstad to sit as an
ad hoc judge (Article 26 § 4 of the Convention and Rule 29 § 1(a));
the decision to give notice to the Norwegian Government (“the
Government”) of the complaint concerning Article 11 of the Convention
and to declare inadmissible the remainder of the application;
the observations submitted by the respondent Government and the
observations in reply submitted by the applicants;
the comments submitted by the European Trade Union Confederation
(ETUC), which was granted leave to intervene by the President of the
Section;
Having deliberated in private on 18 May 2021,
Delivers the following judgment, which was adopted on that date:
INTRODUCTION
1. The case concerns the alleged violation of Article 11 of the
Convention in relation to a decision by the Norwegian Supreme Court to
declare unlawful an announced boycott by a trade union which was planned
in order to pressure a Norwegian subsidiary of a Danish company to enter
into a Norwegian collective agreement applicable to dockworkers.'''
We can look for the word “Article” followed by any digit.
\d
matches a single digit:
matches = re.findall(r'Article \d', document)
print(matches)
['Article 3', 'Article 2', 'Article 1', 'Article 1']
Raw Strings
Backslashes have special meaning in Python strings.
They are used to make special characters like tab, \t
, and newline, \n
, called escape sequences.
To get an actual backslash, we must escape it with another backslash:
'Article \\d'
We can avoid this by using raw strings. They are prefixed with an “r”, like f-strings are prefixed with an “f”:
r'Article \d'
This is especially helpful for patterns with many backslashes.
Quantifiers#
We can repeat characters or patterns with quantifiers:
?
matches 0 or 1 instance of the preceding item. The expressionArticle ?5
only matches the two strings “Article5” and “Article 5”.*
matches 0 or more instances of the preceding item. For example, the expressionArticle *5
matches “Article5”, “Article 5” and so on, with any amount of space between the word and the digit.+
matches 1 or more instances of the preceding item. The expressionArticle +5
matches “Article 5” and all variants with at least one space before the 5.
To match numbers with multiple digits, we use the expression \d+
.
print(re.findall(r'Article \d+', document))
['Article 34', 'Article 26', 'Article 11', 'Article 11']
Character Classes#
Our document mentions both Articles and Rules, so we can expand our expression to match both. This can be done in different ways. First, let’s look at the different character classes we can match.
\w matches a single word character. This includes the letters a to z, but also letters from other languages and alphabets.
\s matches a single space character, including space, tab, and others.
\d matches a single digit, as we have seen.
. (period) matches any single character.
You can also make your own character class, by listing the characters in brackets. For example, [abc] matches a single a, b or c.
Now, let’s try to match any word followed by any number.
print(re.findall(r'\w+ \d+', document))
['Article 34', 'on 15', 'Rule 28', 'Article 26', 'Rule 29', 'Article 11', 'on 18', 'May 2021', 'Article 11']
Or#
As we can see, this also matches other substrings such as “on 15”.
Instead, we should specify just the words we want to look for.
We can do that with a |
, which is used as or in regular expressions.
print(re.findall(r'Article|Rule \d+', document))
['Article', 'Rule 28', 'Article', 'Rule 29', 'Article', 'Article']
Grouping#
Now, the digits are interpreted as belonging with “Rule” only, because of the precedence rules of regular expressions. We need to group the words with parentheses: (?: | )
.
print(re.findall(r'(?:Article|Rule) \d+', document))
['Article 34', 'Rule 28', 'Article 26', 'Rule 29', 'Article 11', 'Article 11']
We can also include the paragraph number:
print(re.findall(r'(?:Article|Rule) \d+ § \d+', document))
['Rule 28 § 3', 'Article 26 § 4', 'Rule 29 § 1']
However, now we only find occurrences that include a paragraph number.
We can make that part optional by putting it in parentheses followed by ?
:
print(re.findall(r'(?:Article|Rule) \d+(?: § \d+)?', document))
['Article 34', 'Rule 28 § 3', 'Article 26 § 4', 'Rule 29 § 1', 'Article 11', 'Article 11']
Ignoring Case#
Our current regular expression only matches strings containing “Article” or “Rule” exactly,
with uppercase initials and the rest of the letters in lowercase.
We might also want to ignore the case of the letters,
so that we also match “article” and “rule” written in other case variants.
To do that we include the flag re.IGNORECASE
.
print(re.findall(r'(?:Article|Rule) \d+(?: § \d+)?', document, flags=re.IGNORECASE))
['Article 34', 'Rule 28 § 3', 'Article 26 § 4', 'Rule 29 § 1', 'Article 11', 'Article 11']
Anchors#
Earlier, we have seen the string search methods .startswith()
and .endswith()
that only looks for matches at the beginning or end of strings.
We can do this with regular expressions by using anchors.
For example, the tabular ECHR-OD data has several column names that contains the word “article”.
The columns with names article=*
give the relevant articles for each case.
On the other hand, the columns with names ccl_article=*
have the conclusion for each article.
We want to filter the columns, selecting only the ones giving the relevant articles,
i.e. starting with “article”.
We can try matching on article-\d
:
print(re.findall(r'article-\d', '''article-5
ccl_article-6'''))
['article-5', 'article-6']
This also matches “article-6”, which is incorrect.
To fix this, we can use a ^
anchor:
print(re.findall(r'^article-\d', '''article-5
ccl_article-6'''))
['article-5']
This expression correctly matches only lines starting with “article”. The table Regex Anchors lists the anchors we can use in regular expressions.
Expression |
Matches |
---|---|
^ |
Beginning of string |
$ |
End of string |
\b |
Word boundaries |
Further Reading#
Regular expressions have many more features, and you can read more about them in for example Python’s Regular Expression HOWTO or the Wikipedia article Regular expression.