Files and Exceptions#

We often need to read data from files. With Python we can read many different formats, for example Word documents, PDF documents, or tabular data in Excel or CSV format. Some of these file formats require third-party libraries. Here, we will look at reading from and writing to plain text files.

Opening Files#

We can open a file containing the introduction to an ECtHR case: LO-NTF-v-Norway.txt

filename = 'LO-NTF-v-Norway.txt'

Python has the function open() for opening files. We must specify the text encoding, which is often UTF-8.

We could use this function directly and assign the result to a variable.

file = open(filename, encoding='UTF-8')

However, open files consume system resources. Therefore, we must always remember to close files when we are finished with them. A large program or web application that keeps opening files without closing them will eventually run out of memory and crash.

Python can automatically close files for us if we use the with statement.

with open(filename, encoding='UTF-8') as file:
    print(file)
<_io.TextIOWrapper name='LO-NTF-v-Norway.txt' mode='r' encoding='UTF-8'>

Caution

Always use with when opening files.

Note

Notice that the print() statement above prints a description of the file object, not the file content. We must use methods of this object to get the content.

Reading the File Content#

We can use the method readline() to read single lines from the file.

with open(filename, encoding='UTF-8') as file:
    print(file.readline())
NORWEGIAN CONFEDERATION OF TRADE UNIONS (LO) AND NORWEGIAN

We can also use a for loop to process the file line by line.

with open(filename, encoding='UTF-8') as file:
    for line in file:
        print(line)
Hide code cell output
NORWEGIAN CONFEDERATION OF TRADE UNIONS (LO) AND NORWEGIAN

TRANSPORT WORKERS’ UNION (NTF) v. NORWAY JUDGMENT



In the case of Norwegian Confederation of Trade Unions (LO) and

Norwegian Transport Workers’ Union (NTF) v. Norway,

The European Court of Human Rights (Fifth Section), sitting as a

Chamber composed of:

Síofra O’Leary, President,

Mārtiņš Mits,

Stéphanie Mourou-Vikström,

Lətif Hüseynov,

Jovan Ilievski,

Ivana Jelić, judges,

Anne Grøstad, ad hoc judge,

and Victor Soloveytchik, Section Registrar,

Having regard to:

the application against the Kingdom of Norway lodged with the Court

under Article 34 of the Convention for the Protection of Human Rights and

Fundamental Freedoms (“the Convention”) by two Norwegian associations,

the Norwegian Confederation of Trade Unions (Landsorganisasjonen i

Norge (“LO”)) and the Norwegian Transport Workers’ Union (Norsk

transportarbeiderforbund (“NTF”)) (“the applicant unions”), on 15 June

2017;

the withdrawal of Arnfinn Bårdsen, the judge elected in respect of

Norway, from sitting in the case (Rule 28 § 3 of the Rules of Court) and the

decision of the President of the Section to appoint Anne Grøstad to sit as an

ad hoc judge (Article 26 § 4 of the Convention and Rule 29 § 1(a));

the decision to give notice to the Norwegian Government (“the

Government”) of the complaint concerning Article 11 of the Convention

and to declare inadmissible the remainder of the application;

the observations submitted by the respondent Government and the

observations in reply submitted by the applicants;

the comments submitted by the European Trade Union Confederation

(ETUC), which was granted leave to intervene by the President of the

Section;

Having deliberated in private on 18 May 2021,

Delivers the following judgment, which was adopted on that date:



INTRODUCTION



1. The case concerns the alleged violation of Article 11 of the

Convention in relation to a decision by the Norwegian Supreme Court to

declare unlawful an announced boycott by a trade union which was planned

in order to pressure a Norwegian subsidiary of a Danish company to enter

into a Norwegian collective agreement applicable to dockworkers.

Removing Whitespace#

When we print the file content above, we get a blank line between each line. This is because the lines we read contain a newline, \n, and the print statement also insert a newline. To avoid this, we should remove leading and trailing whitespace with the string method strip().

with open(filename, encoding='UTF-8') as file:
    for line in file:
        line = line.strip()
        print(line)
Hide code cell output
NORWEGIAN CONFEDERATION OF TRADE UNIONS (LO) AND NORWEGIAN
TRANSPORT WORKERS’ UNION (NTF) v. NORWAY JUDGMENT

In the case of Norwegian Confederation of Trade Unions (LO) and
Norwegian Transport Workers’ Union (NTF) v. Norway,
The European Court of Human Rights (Fifth Section), sitting as a
Chamber composed of:
Síofra O’Leary, President,
Mārtiņš Mits,
Stéphanie Mourou-Vikström,
Lətif Hüseynov,
Jovan Ilievski,
Ivana Jelić, judges,
Anne Grøstad, ad hoc judge,
and Victor Soloveytchik, Section Registrar,
Having regard to:
the application against the Kingdom of Norway lodged with the Court
under Article 34 of the Convention for the Protection of Human Rights and
Fundamental Freedoms (“the Convention”) by two Norwegian associations,
the Norwegian Confederation of Trade Unions (Landsorganisasjonen i
Norge (“LO”)) and the Norwegian Transport Workers’ Union (Norsk
transportarbeiderforbund (“NTF”)) (“the applicant unions”), on 15 June
2017;
the withdrawal of Arnfinn Bårdsen, the judge elected in respect of
Norway, from sitting in the case (Rule 28 § 3 of the Rules of Court) and the
decision of the President of the Section to appoint Anne Grøstad to sit as an
ad hoc judge (Article 26 § 4 of the Convention and Rule 29 § 1(a));
the decision to give notice to the Norwegian Government (“the
Government”) of the complaint concerning Article 11 of the Convention
and to declare inadmissible the remainder of the application;
the observations submitted by the respondent Government and the
observations in reply submitted by the applicants;
the comments submitted by the European Trade Union Confederation
(ETUC), which was granted leave to intervene by the President of the
Section;
Having deliberated in private on 18 May 2021,
Delivers the following judgment, which was adopted on that date:

INTRODUCTION

1. The case concerns the alleged violation of Article 11 of the
Convention in relation to a decision by the Norwegian Supreme Court to
declare unlawful an announced boycott by a trade union which was planned
in order to pressure a Norwegian subsidiary of a Danish company to enter
into a Norwegian collective agreement applicable to dockworkers.

Splitting Strings#

Sometimes we need to process text word by word. To do this, we can use the string method split(), which splits a string on whitespace by default. We can also specify some other character to split on.

with open(filename, encoding='UTF-8') as file:
    line = file.readline()
    line = line.strip()
    words = line.split()
    print(words)
['NORWEGIAN', 'CONFEDERATION', 'OF', 'TRADE', 'UNIONS', '(LO)', 'AND', 'NORWEGIAN']

Joining Strings#

When we have processed the information in the list, we can join() the items into a new string. We could use a new separator for joining the items. For example, in filenames we might use underscores instead of spaces.

line = '_'.join(words)
print(line)
NORWEGIAN_CONFEDERATION_OF_TRADE_UNIONS_(LO)_AND_NORWEGIAN

Extracting Information#

We want to extract the list of judges from the case into a Python list. The list of judges starts with the President, and ends with the Registrar. We can use these cues to extract the list.

found_start = False
judges = []

with open(filename, encoding='UTF-8') as file:
    for line in file:
        line = line.strip()
        if not found_start:
            if 'president' in line.lower():
                found_start = True
                judges.append(line)
        else:
            judges.append(line)
            if 'registrar' in line.lower():
                break

print(judges)
['Síofra O’Leary, President,', 'Mārtiņš Mits,', 'Stéphanie Mourou-Vikström,', 'Lətif Hüseynov,', 'Jovan Ilievski,', 'Ivana Jelić, judges,', 'Anne Grøstad, ad hoc judge,', 'and Victor Soloveytchik, Section Registrar,']

Here, we use the statement break to stop the loop as soon as we find the registrar. This code still has room for improvement. For example, the extracted names contain commas. This is left as an exercise.

Note

We could easily extract this information by hand from a single document. But with Python code, we can extract the information from thousands of documents in a short time.

Writing Files#

We can also write data to files. Let’s store the list of judges in a text file.

output_file_name = 'judges.txt'

When we want to open a file for writing, we need to specify writing mode, with the mode parameter 'w'.

The mode has the default value 'r' for reading, but for consistency we can specify this parameter even when reading.

with open(output_file_name, 'w', encoding='UTF-8') as outfile:
    pass

Pass Statements

We use pass statements to do nothing in the code block. The with statement and all other statements expecting an indented code block must contain at least one statement to be valid.

Once the file has been opened, we can write to it with a print() statement. We must give print() a file parameter to send the text to a file instead of the console.

with open(output_file_name, 'w', encoding='UTF-8') as outfile:
    print(judges, file=outfile)

Exceptions#

When something goes wrong in a program, an exception is raised. An exception is a “signal” that an error has occurred and must be handled. For example, exceptions can occur when user input doesn’t match the expectations. We should handle exceptions that might occur.

For example, trying to open a file that doesn’t exist raises an exception:

filename = 'non-existing-file.txt' # often from user input
with open(filename) as file:
    print(file)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[13], line 2
      1 filename = 'non-existing-file.txt' # often from user input
----> 2 with open(filename) as file:
      3     print(file)

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py:310, in _modified_open(file, *args, **kwargs)
    303 if file in {0, 1, 2}:
    304     raise ValueError(
    305         f"IPython won't let you open fd={file} by default "
    306         "as it is likely to crash IPython. If you know what you are doing, "
    307         "you can use builtins' open."
    308     )
--> 310 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: 'non-existing-file.txt'

In this case, we get a FileNotFoundError exception. Unhandled exceptions make the program stop or crash.

Handling Exceptions#

We can handle exceptions with try and except statements.

try:
    with open(filename) as file:
        print(file)
except FileNotFoundError:
    print('no such file:', filename)
no such file: non-existing-file.txt

Now, instead of crashing the program will keep running.

Handling Multiple Exceptions#

We can also handle multiple exceptions. We can handle different exceptions differently.

try:
    with open(filename) as file:
        print(file)
except FileNotFoundError:
    print('no such file:', filename)
except IOError as e:
    print('Error opening file:', e)
Hide code cell output
no such file: non-existing-file.txt

We handle more specific exceptions first, then more general exceptions. The most general exception is just called Exception.

try:
    with open(filename) as file:
        print(file)
except FileNotFoundError:
    print('no such file:', filename)
except IOError as e:
    print('Error reading from file:', e)
except Exception as e:
    print('Exception:', e)
Hide code cell output
no such file: non-existing-file.txt

F-strings#

Printing strings that contain many variables can be cumbersome. For example, we could print names and phone numbers like this:

name = 'Peder Ås'
phone = 5367
print('Please call', name, 'at phone number', phone, 'ASAP')
Please call Peder Ås at phone number 5367 ASAP

There are a lot of quotes and commas that need to be in the right place. Some prefer using f-strings, short for formatted strings. F-strings start with an “f” before the first quote:

print(f'Please call {name} at phone number {phone} ASAP')
Please call Peder Ås at phone number 5367 ASAP

The curly brackets can contain expressions, for example a dictionary lookup:

print(f'Please call {name} at phone number {clients[name]} ASAP')

Raising Exceptions#

We can make our own exceptions if something goes wrong. We signal an exception with a raise statement. The exception needs to be handled by some other part of our program.

Say we have a function that is called with some user input as an argument.

clients = {'Peder Ås': 5664,
           'Marte Kirkerud': 8952}
input_from_user = 'Peder Ås'

We check if the user input is correct. If the input is incorrect, we need to signal back to the caller with an exception. In many functions, we can’t show an error message directly to the user because interaction with the user is handled by a different part of the program.

if input_from_user in clients:
    print(f'{input_from_user} has number {clients[input_from_user]}')
else:
    raise ValueError(f'{input_from_user} is not a client')
Hide code cell output
Peder Ås has number 5664

Opening Multiple Files#

Some data sets store all the data in a single file. But the data can also be split into many smaller files. Then, we will need to iterate over the files the files to open them. Python has the library pathlib for working with directories (folders) and file names. We use Path from this library.

First, we must import it:

from pathlib import Path

We make a new Path object for our directory “data”.

directory = Path('data')

From the object “directory”, we can make an iterator over the files.

file_iterator = directory.iterdir()

We loop over this iterator to open the files.

for filename in file_iterator:
    with open(filename, encoding='UTF-8') as file:
        print(file)
<_io.TextIOWrapper name='data/case-4.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-1.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-8.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-7.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-0.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-3.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-5.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/LO-NTF-v-Norway.txt' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-9.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-6.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-2.json' mode='r' encoding='UTF-8'>

Inside the loop, we can handle each file individually.

Opening Files by Type#

If we have a directory that contains multiple file types, we might want to open only some of the files. We can filter on the filename extension using the method .glob(pattern).

filenames = directory.glob('*.txt')
print(list(filenames))
[PosixPath('data/LO-NTF-v-Norway.txt')]

Filename Extensions

Filenames usually have two parts, the stem and the extension, separated by a period. For example, the file “article.docx” has the stem “article” and the extension “docx”. The extension identifies the file type.

Opening Only the First Few Files#

If we have large data set, it can take time to process. During development, we might want to run our code with only the first few files. We can use range() for this. Below, we open only the first five files.

file_iterator = directory.iterdir()

for index in range(5):
    filename = next(file_iterator)
    with open(filename, encoding='UTF-8') as file:
        print(file)
<_io.TextIOWrapper name='data/case-4.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-1.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-8.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-7.json' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='data/case-0.json' mode='r' encoding='UTF-8'>