(files_exceptions)=
# Files and Exceptions

We often need to read data from files. With Python we can read many different formats, for example Word documents, PDF documents, or tabular data in Excel or CSV format.
Some of these file formats require third-party libraries.
Here, we will look at reading from and writing to plain text files.

## Opening Files
We can open a file containing the introduction to an ECtHR case: 
{download}`LO-NTF-v-Norway.txt<./LO-NTF-v-Norway.txt>`

In [None]:
filename = 'LO-NTF-v-Norway.txt'

Python has the function `open()` for opening files.
We must specify the text encoding, which is often UTF-8.

We could use this function directly and assign the result to a variable.

In [None]:
file = open(filename, encoding='UTF-8')

However, open files consume system resources. Therefore, we must always remember to close files when we are finished with them.
A large program or web application that keeps opening files without closing them will eventually run out of memory and crash.

Python can automatically close files for us if we use the `with` statement.

In [None]:
with open(filename, encoding='UTF-8') as file:
    print(file)

```{caution}
*Always* use `with` when opening files.
```

```{note}
Notice that the `print()` statement above prints a description of the file object, not the file content.
We must use methods of this object to get the content.
```

## Reading the File Content

We can use the method `readline()` to read single lines from the file.

In [None]:
with open(filename, encoding='UTF-8') as file:
    print(file.readline())

We can also use a `for` loop to process the file line by line.

In [None]:
with open(filename, encoding='UTF-8') as file:
    for line in file:
        print(line)

## Removing Whitespace

When we print the file content above, we get a blank line between each line.
This is because the lines we read contain a newline, `\n`, and the print statement also insert a newline.
To avoid this, we should remove *leading* and *trailing* whitespace with the string method `strip()`.

In [None]:
with open(filename, encoding='UTF-8') as file:
    for line in file:
        line = line.strip()
        print(line)

## Splitting Strings

Sometimes we need to process text word by word.
To do this, we can use the string method `split()`, which splits a string on whitespace by default.
We can also specify some other character to split on.

In [None]:
with open(filename, encoding='UTF-8') as file:
    line = file.readline()
    line = line.strip()
    words = line.split()
    print(words)

## Joining Strings
When we have processed the information in the list, we can `join()` the items into a new string.
We could use a new separator for joining the items.
For example, in filenames we might use underscores instead of spaces.

In [None]:
line = '_'.join(words)
print(line)

## Extracting Information

We want to extract the list of judges from the case into a Python list.
The list of judges starts with the President, and ends with the Registrar.
We can use these cues to extract the list.

In [None]:
found_start = False
judges = []

with open(filename, encoding='UTF-8') as file:
    for line in file:
        line = line.strip()
        if not found_start:
            if 'president' in line.lower():
                found_start = True
                judges.append(line)
        else:
            judges.append(line)
            if 'registrar' in line.lower():
                break

print(judges)

Here, we use the statement `break` to stop the loop as soon as we find the registrar.
This code still has room for improvement. For example, the extracted names contain commas.
This is left as an exercise.

```{note}
We could easily extract this information by hand from a single document.
But with Python code, we can extract the information from *thousands* of documents in a short time.
```

## Writing Files

We can also write data to files. Let's store the list of judges in a text file.

In [None]:
output_file_name = 'judges.txt'

When we want to open a file for writing, we need to specify *writing mode*, with the mode parameter `'w'`.

The mode has the default value `'r'` for reading, but for consistency we can specify this parameter even when reading.

In [None]:
with open(output_file_name, 'w', encoding='UTF-8') as outfile:
    pass

```{admonition} Pass Statements
We use `pass` statements to do nothing in the code block.
The `with` statement and all other statements expecting an indented code block must contain at least one statement to be valid.
```

Once the file has been opened, we can write to it with a `print()` statement.
We must give `print()` a `file` parameter to send the text to a file instead of the console.

In [None]:
with open(output_file_name, 'w', encoding='UTF-8') as outfile:
    print(judges, file=outfile)

## Exceptions

When something goes wrong in a program, an *exception* is *raised*.
An exception is a "signal" that an error has occurred and must be handled.
For example, exceptions can occur when user input doesn't match the expectations.
We should handle exceptions that might occur.

For example, trying to open a file that doesn't exist raises an exception:

In [None]:
filename = 'non-existing-file.txt' # often from user input
with open(filename) as file:
    print(file)

In this case, we get a `FileNotFoundError` exception.
Unhandled exceptions make the program stop or crash.

## Handling Exceptions
We can handle exceptions with `try` and `except` statements.

In [None]:
try:
    with open(filename) as file:
        print(file)
except FileNotFoundError:
    print('no such file:', filename)

Now, instead of crashing the program will keep running.

## Handling Multiple Exceptions

We can also handle multiple exceptions. We can handle different exceptions differently.

In [None]:
try:
    with open(filename) as file:
        print(file)
except FileNotFoundError:
    print('no such file:', filename)
except IOError as e:
    print('Error opening file:', e)

We handle more specific exceptions first, then more general exceptions. The most general exception is just called `Exception`.

In [None]:
try:
    with open(filename) as file:
        print(file)
except FileNotFoundError:
    print('no such file:', filename)
except IOError as e:
    print('Error reading from file:', e)
except Exception as e:
    print('Exception:', e)

## F-strings
Printing strings that contain many variables can be cumbersome.
For example, we could print names and phone numbers like this:

In [None]:
name = 'Peder Ås'
phone = 5367
print('Please call', name, 'at phone number', phone, 'ASAP')

There are a lot of quotes and commas that need to be in the right place.
Some prefer using f-strings, short for formatted strings.
F-strings start with an "f" before the first quote:

In [None]:
print(f'Please call {name} at phone number {phone} ASAP')

The curly brackets can contain expressions, for example a dictionary lookup:

In [None]:
print(f'Please call {name} at phone number {clients[name]} ASAP')

## Raising Exceptions
We can make our own exceptions if something goes wrong.
We signal an exception with a `raise` statement.
The exception needs to be handled by some other part of our program.

Say we have a function that is called with some user input as an argument.

In [None]:
clients = {'Peder Ås': 5664,
           'Marte Kirkerud': 8952}
input_from_user = 'Peder Ås'

We check if the user input is correct.
If the input is incorrect, we need to signal back to the caller with an exception.
In many functions, we can't show an error message directly to the user because interaction with the user is handled by a different part of the program.


In [None]:
if input_from_user in clients:
    print(f'{input_from_user} has number {clients[input_from_user]}')
else:
    raise ValueError(f'{input_from_user} is not a client')

## Opening Multiple Files
Some data sets store all the data in a single file.
But the data can also be split into many smaller files.
Then, we will need to iterate over the files the files to open them.
Python has the library `pathlib` for working with directories (folders) and file names.
We use `Path` from this library.

First, we must import it:


In [None]:
from pathlib import Path

We make a new `Path` object for our directory "data".

In [None]:
directory = Path('data')


From the object "directory", we can make an iterator over the files.

In [None]:
file_iterator = directory.iterdir()

We loop over this iterator to open the files.

In [None]:
for filename in file_iterator:
    with open(filename, encoding='UTF-8') as file:
        print(file)

Inside the loop, we can handle each file individually.

## Opening Files by Type
If we have a directory that contains multiple file types, we might want to open only some of the files.
We can filter on the *filename extension* using the method `.glob(pattern)`.


In [None]:
filenames = directory.glob('*.txt')
print(list(filenames))


:::{admonition} Filename Extensions
Filenames usually have two parts, the *stem* and the *extension*, separated by a period.
For example, the file "article.docx" has the stem "article" and the extension "docx".
The extension identifies the file type.
:::

## Opening Only the First Few Files
If we have large data set, it can take time to process.
During development, we might want to run our code with only the first few files.
We can use `range()` for this.
Below, we open only the first five files.


In [None]:
file_iterator = directory.iterdir()

for index in range(5):
    filename = next(file_iterator)
    with open(filename, encoding='UTF-8') as file:
        print(file)