Information Extraction

Information Extraction#

We sometimes need to extract data from documents. For example, most documents contain dates and names of people or places. We can collect this information for example for statistics or further processing.

Regular Expressions#

Regular expressions (regexes) are powerful tools for searching text. We have already seen different ways to search for text. If we want to find the word “Article” in the string document, we can use one of the following methods:

if 'Article' in document:
position = document.find('Article')
document.startswith('Article') (to check only the first word)

With regular expressions, we can look for patterns, not just specific strings. To use regular expressions, we must import Python’s re module:

import re

We will need some text to search, so we will use the first page of a recent ECtHR case as an example.

document = '''NORWEGIAN CONFEDERATION OF TRADE UNIONS (LO) AND NORWEGIAN
TRANSPORT WORKERS’ UNION (NTF) v. NORWAY JUDGMENT

In the case of Norwegian Confederation of Trade Unions (LO) and
Norwegian Transport Workers’ Union (NTF) v. Norway,
The European Court of Human Rights (Fifth Section), sitting as a
Chamber composed of:
Síofra O’Leary, President,
Mārtiņš Mits,
Stéphanie Mourou-Vikström,
Lətif Hüseynov,
Jovan Ilievski,
Ivana Jelić, judges,
Anne Grøstad, ad hoc judge,
and Victor Soloveytchik, Section Registrar,
Having regard to:
the application against the Kingdom of Norway lodged with the Court
under Article 34 of the Convention for the Protection of Human Rights and
Fundamental Freedoms (“the Convention”) by two Norwegian associations,
the Norwegian Confederation of Trade Unions (Landsorganisasjonen i
Norge (“LO”)) and the Norwegian Transport Workers’ Union (Norsk
transportarbeiderforbund (“NTF”)) (“the applicant unions”), on 15 June
2017;
the withdrawal of Arnfinn Bårdsen, the judge elected in respect of
Norway, from sitting in the case (Rule 28 § 3 of the Rules of Court) and the
decision of the President of the Section to appoint Anne Grøstad to sit as an
ad hoc judge (Article 26 § 4 of the Convention and Rule 29 § 1(a));
the decision to give notice to the Norwegian Government (“the
Government”) of the complaint concerning Article 11 of the Convention
and to declare inadmissible the remainder of the application;
the observations submitted by the respondent Government and the
observations in reply submitted by the applicants;
the comments submitted by the European Trade Union Confederation
(ETUC), which was granted leave to intervene by the President of the
Section;
Having deliberated in private on 18 May 2021,
Delivers the following judgment, which was adopted on that date:

INTRODUCTION

1. The case concerns the alleged violation of Article 11 of the
Convention in relation to a decision by the Norwegian Supreme Court to
declare unlawful an announced boycott by a trade union which was planned
in order to pressure a Norwegian subsidiary of a Danish company to enter
into a Norwegian collective agreement applicable to dockworkers.'''

We can look for the word “Article” followed by any digit. \d matches a single digit:

matches = re.findall(r'Article \d', document)
print(matches)

['Article 3', 'Article 2', 'Article 1', 'Article 1']

Raw Strings

Backslashes have special meaning in Python strings. They are used to make special characters like tab, \t, and newline, \n, called escape sequences. To get an actual backslash, we must escape it with another backslash:

'Article \\d'

We can avoid this by using raw strings. They are prefixed with an “r”, like f-strings are prefixed with an “f”:

r'Article \d'

This is especially helpful for patterns with many backslashes.

Exercise: Find Norway

Copy the text above into your Jupyter Notebook. Make a regular expression that finds all occurrences of the string “Norway” in the text.

Quantifiers#

We can repeat characters or patterns with quantifiers:

? matches 0 or 1 instance of the preceding item. The expression Article ?5 only matches the two strings “Article5” and “Article 5”.
* matches 0 or more instances of the preceding item. For example, the expression Article *5 matches “Article5”, “Article 5” and so on, with any amount of space between the word and the digit.
+ matches 1 or more instances of the preceding item. The expression Article +5 matches “Article 5” and all variants with at least one space before the 5.

To match numbers with multiple digits, we use the expression \d+.

print(re.findall(r'Article \d+', document))

['Article 34', 'Article 26', 'Article 11', 'Article 11']

Exercise: Find words starting with “Norw”

Work with the same text as in the previous exercise. Make a regular expression that finds all words that start with the string “Norw” in the text.

Character Classes#

Our document mentions both Articles and Rules, so we can expand our expression to match both. This can be done in different ways. First, let’s look at the different character classes we can match.

\w matches a single word character. This includes the letters a to z, but also letters from other languages and alphabets.
\s matches a single space character, including space, tab, and others.
\d matches a single digit, as we have seen.
. (period) matches any single character.

You can also make your own character class, by listing the characters in brackets. For example, [abc] matches a single a, b or c.

Now, let’s try to match any word followed by any number.

print(re.findall(r'\w+ \d+', document))

['Article 34', 'on 15', 'Rule 28', 'Article 26', 'Rule 29', 'Article 11', 'on 18', 'May 2021', 'Article 11']

Exercise: Find Years

Work with the same text as in the previous exercise. Find all mentions of years in the document. For this exercise, we define a year as any four digit number.

Or#

As we can see, this also matches other substrings such as “on 15”. Instead, we should specify just the words we want to look for. We can do that with a |, which is used as or in regular expressions.

print(re.findall(r'Article|Rule \d+', document))

['Article', 'Rule 28', 'Article', 'Rule 29', 'Article', 'Article']

Exercise: Find Numbers

Work with the same text as in the previous exercise. Some older documents contain years with only two digits, like 85 instead of 1985. Find all numbers with two or four digits in the document. Does this work well for finding years?

Grouping#

Now, the digits are interpreted as belonging with “Rule” only, because of the precedence rules of regular expressions. We need to group the words with parentheses: (?: | ).

print(re.findall(r'(?:Article|Rule) \d+', document))

['Article 34', 'Rule 28', 'Article 26', 'Rule 29', 'Article 11', 'Article 11']

We can also include the paragraph number:

print(re.findall(r'(?:Article|Rule) \d+ § \d+', document))

['Rule 28 § 3', 'Article 26 § 4', 'Rule 29 § 1']

However, now we only find occurrences that include a paragraph number. We can make that part optional by putting it in parentheses followed by ?:

print(re.findall(r'(?:Article|Rule) \d+(?: § \d+)?', document))

['Article 34', 'Rule 28 § 3', 'Article 26 § 4', 'Rule 29 § 1', 'Article 11', 'Article 11']

Ignoring Case#

Our current regular expression only matches strings containing “Article” or “Rule” exactly, with uppercase initials and the rest of the letters in lowercase. We might also want to ignore the case of the letters, so that we also match “article” and “rule” written in other case variants. To do that we include the flag re.IGNORECASE.

print(re.findall(r'(?:Article|Rule) \d+(?: § \d+)?', document, flags=re.IGNORECASE))

['Article 34', 'Rule 28 § 3', 'Article 26 § 4', 'Rule 29 § 1', 'Article 11', 'Article 11']

Anchors#

Earlier, we have seen the string search methods .startswith() and .endswith() that only looks for matches at the beginning or end of strings. We can do this with regular expressions by using anchors.

For example, the tabular ECHR-OD data has several column names that contains the word “article”. The columns with names article=* give the relevant articles for each case. On the other hand, the columns with names ccl_article=* have the conclusion for each article.

We want to filter the columns, selecting only the ones giving the relevant articles, i.e. starting with “article”. We can try matching on article-\d:

print(re.findall(r'article-\d', '''article-5
                             ccl_article-6'''))

['article-5', 'article-6']

This also matches “article-6”, which is incorrect. To fix this, we can use a ^ anchor:

print(re.findall(r'^article-\d', '''article-5
                             ccl_article-6'''))

['article-5']

This expression correctly matches only lines starting with “article”. The table Regex Anchors lists the anchors we can use in regular expressions.

Table 3 Regex Anchors#
Expression	Matches
^	Beginning of string
$	End of string
\b	Word boundaries

Named Entity Recognition#

We can use Named Entity Recognition (NER) to extract names of people, companies, places and other entities. Most NER systems also extract numbers. We will use the NER module in spaCy. spaCy is a library for Natural Language Processing in Python.

We might need to install the spacy package.

pip install spacy

We import the spaCy library and its module displaCy.

import spacy
from spacy import displacy

Incompatible numpy

On some computers, you might get an error message about numpy when you try to load spaCy. In that case, you need to install an older version of numpy like this:

pip install numpy==1.26.4

spaCy can use models for many different languages. The first time we use NER in spaCy we must download the data files for the English language.

!python -m spacy download en_core_web_sm

We load a the text document LO-NTF-v-Norway.txt to run the Named Entity Recognizer on.

filename = 'LO-NTF-v-Norway.txt'
with open(filename, 'r', encoding='utf-8') as file:
    text = file.read()

We load the English NLP model:

nlp = spacy.load("en_core_web_sm")

Next, we process the text with the NLP model.

document = nlp(text)

We can extract the entities and their labels:

entities = [(ent.text, ent.label_) for ent in document.ents]

Let’s look at the data:

for entity in entities:
    print(f"Entity: {entity[0]}, Label: {entity[1]}")

Entity: NORWEGIAN CONFEDERATION OF TRADE UNIONS, Label: ORG
Entity: LO, Label: ORG
Entity: NORWEGIAN, Label: NORP
Entity: Norwegian Confederation of Trade Unions, Label: ORG
Entity: LO, Label: ORG
Entity: Norwegian Transport Workers’ Union, Label: ORG
Entity: Norway, Label: GPE
Entity: The European Court of Human Rights, Label: ORG
Entity: Chamber, Label: ORG
Entity: Mārtiņš Mits, Label: PERSON
Entity: Stéphanie Mourou-Vikström, Label: PERSON
Entity: Lətif Hüseynov, Label: PERSON
Entity: Jovan Ilievski, Label: PERSON
Entity: Ivana Jelić, Label: PERSON
Entity: Anne Grøstad, Label: PERSON
Entity: Victor Soloveytchik, Label: PERSON
Entity: Section Registrar, Label: PERSON
Entity: the Kingdom of Norway, Label: GPE
Entity: Court, Label: ORG
Entity: Article 34 of the Convention for the Protection of Human Rights, Label: LAW
Entity: Fundamental Freedoms, Label: ORG
Entity: two, Label: CARDINAL
Entity: Norwegian, Label: NORP
Entity: the Norwegian Confederation of Trade Unions, Label: ORG
Entity: LO, Label: WORK_OF_ART
Entity: the Norwegian Transport Workers’ Union, Label: ORG
Entity: 15 June
2017, Label: DATE
Entity: Arnfinn Bårdsen, Label: PERSON
Entity: Norway, Label: GPE
Entity: Rule 28 § 3 of the Rules of Court, Label: LAW
Entity: Anne Grøstad, Label: PERSON
Entity: Article 26 § 4 of the Convention and Rule 29 §, Label: LAW
Entity: the Norwegian Government, Label: ORG
Entity: the
Government, Label: WORK_OF_ART
Entity: Article 11 of the Convention, Label: LAW
Entity: the European Trade Union Confederation, Label: ORG
Entity: ETUC, Label: ORG
Entity: 18 May 2021, Label: DATE
Entity: 1, Label: CARDINAL
Entity: Article 11, Label: LAW
Entity: the Norwegian Supreme Court, Label: ORG
Entity: Norwegian, Label: NORP
Entity: Danish, Label: NORP
Entity: Norwegian, Label: NORP

We can also get the entity types:

identity_types = set(ent.label_ for ent in document.ents)
print(f"Identity types: {identity_types}")

Identity types: {'GPE', 'NORP', 'PERSON', 'LAW', 'DATE', 'WORK_OF_ART', 'CARDINAL', 'ORG'}

Exercise: Find Organizations

Download the file LO-NTF-v-Norway.txt.

Use spaCy to do NER on this document. Then make a Python set of all the organizations, that is all entities that spaCy label with org.

(Sets are collections without duplicates. Therefore, they are efficient for finding unique elements.)

Finally, we can display the tagged text.

# Visualize text with named entities as tags
displacy.render(document, style="ent", jupyter=True)

NORWEGIAN CONFEDERATION OF TRADE UNIONS ORG ( LO ORG ) AND NORWEGIAN NORP
TRANSPORT WORKERS’ UNION (NTF) v. NORWAY JUDGMENT

In the case of Norwegian Confederation of Trade Unions ORG ( LO ORG ) and
Norwegian Transport Workers’ Union ORG (NTF) v. Norway GPE ,
The European Court of Human Rights ORG (Fifth Section), sitting as a
Chamber ORG composed of:
Síofra O’Leary, President,
Mārtiņš Mits PERSON ,
Stéphanie Mourou-Vikström PERSON ,
Lətif Hüseynov PERSON ,
Jovan Ilievski PERSON ,
Ivana Jelić PERSON , judges,
Anne Grøstad PERSON , ad hoc judge,
and Victor Soloveytchik PERSON , Section Registrar PERSON ,
Having regard to:
the application against the Kingdom of Norway GPE lodged with the Court ORG
under Article 34 of the Convention for the Protection of Human Rights LAW and
Fundamental Freedoms ORG (“the Convention”) by two CARDINAL Norwegian NORP associations,
the Norwegian Confederation of Trade Unions ORG (Landsorganisasjonen i
Norge (“ LO WORK_OF_ART ”)) and the Norwegian Transport Workers’ Union ORG (Norsk
transportarbeiderforbund (“NTF”)) (“the applicant unions”), on 15 June 2017 DATE ;
the withdrawal of Arnfinn Bårdsen PERSON , the judge elected in respect of
Norway GPE , from sitting in the case ( Rule 28 § 3 of the Rules of Court LAW ) and the
decision of the President of the Section to appoint Anne Grøstad PERSON to sit as an
ad hoc judge ( Article 26 § 4 of the Convention and Rule 29 § LAW 1(a));
the decision to give notice to the Norwegian Government ORG (“ the Government WORK_OF_ART ”) of the complaint concerning Article 11 of the Convention LAW
and to declare inadmissible the remainder of the application;
the observations submitted by the respondent Government and the
observations in reply submitted by the applicants;
the comments submitted by the European Trade Union Confederation ORG
( ETUC ORG ), which was granted leave to intervene by the President of the
Section;
Having deliberated in private on 18 May 2021 DATE ,
Delivers the following judgment, which was adopted on that date:

INTRODUCTION

1 CARDINAL . The case concerns the alleged violation of Article 11 LAW of the
Convention in relation to a decision by the Norwegian Supreme Court ORG to
declare unlawful an announced boycott by a trade union which was planned
in order to pressure a Norwegian NORP subsidiary of a Danish NORP company to enter
into a Norwegian NORP collective agreement applicable to dockworkers.

Information Extraction

Contents

Information Extraction#

Regular Expressions#

Quantifiers#

Character Classes#

Or#

Grouping#

Ignoring Case#

Anchors#

Further Reading#

Named Entity Recognition#