DS2000 (Spring 2019, NCH) :: Lecture 6a

0. Administrivia

  1. Wednesday (in Practicum): in-class quiz (via Blackboard; no book/notes/Python)
    • Review PCQ 3/4
    • Dropping lowest ICQ
  2. Due Friday @ 9pm: HW5 (submit via Blackboard)
  3. No HW10 work (doubling best HW grade)

1. Why Files?

A file is an organization unit of a group of information on your computer that persists even when there is no power to the computer.

  • Anything in a Python variable is lost when the program ends (unless it was hard-coded into the source code)
  • Typing large amounts of data via input is tedious and error-prone
  • Files can be transmitted from other computers (later in the course we'll talk about another way, an API, that allows two programs to transfer information)

1a. Where Do Files Come From?

Files can be created by a user (e.g., in Atom, File -> New File) or as the output of another program.

The types of files we primarily cover in this class have text and numbers (and will usually have file extensions like txt=text, csv=comma-separated value, and data). There are others that have binary data (e.g., images), but we don't cover those.

2. A File is a Resource; Resources are Opened, Used, then Closed

When working with files, the common pattern is to first open them, then use them (read and/or write), then close them (so that they can be used by other people/programs).

In this class we focus on reading from a file (i.e., getting information from), but writing (i.e., adding/changing information) to a file is quite similar (and covered in the book).

In [2]:
# Open a file by providing its name and a "mode" (for us, r=read)
myfile = open("myfile.txt", "r")

# The myfile variable now lets you interact with the contents
# of the file via various functions
filecontents = myfile.read()

# CLOSE the file
myfile.close()

# Continue
print(filecontents)
hello there
this is the content of a file
1 2 3
:)

Since it is quite easy to forget to close a file, or have your program not get to the close function, a safer way to code is to use the following...

In [3]:
# The "with" promises to close myfile
# once Python completes the next code block
# for any reason
with open("myfile.txt", "r") as myfile:
    filecontents = myfile.read()

# The file is now closed
print(filecontents)
hello there
this is the content of a file
1 2 3
:)

2a. File Paths

If the file is in the same directory, simply provide its name. Otherwise, you will have to supply a path (i.e., sequence of directories) to find the file.

For example... open("/dir1/dir2/myfile.txt", "r")

3. Reading Data

There are multiple ways to get data out of a file...

In [6]:
# Read the entire file as a giant string
with open("myfile.txt", "r") as myfile:
    filecontents = myfile.read()

# Note that all line breaks in the file are actually in the variable
print([filecontents])

# Once in a variable, we can treat this as any other string
print(filecontents.split())
['hello there\nthis is the content of a file\n1 2 3\n:)\n']
['hello', 'there', 'this', 'is', 'the', 'content', 'of', 'a', 'file', '1', '2', '3', ':)']
In [15]:
# Sometimes we care about data line-by-line (think spreadsheets)
# There are a few ways to do this...

# Loop over all the lines
with open("myfile.txt", "r") as myfile:
    for line in myfile:
        # Again, notice the newline
        print([line])

print()
        
# Read all the lines into a list of strings
with open("myfile.txt", "r") as myfile:
    lines = myfile.readlines()

print(lines)
# for line in lines:
#     print(line)

print()

# Read a line at a time (e.g., if line contents are different)
with open("myfile.txt", "r") as myfile:
    firstline = myfile.readline()
    print(['FIRST', firstline])
    
    for nextline in myfile:
        print([nextline])
['hello there\n']
['this is the content of a file\n']
['1 2 3\n']
[':)\n']

['hello there\n', 'this is the content of a file\n', '1 2 3\n', ':)\n']

['FIRST', 'hello there\n']
['this is the content of a file\n']
['1 2 3\n']
[':)\n']

Notice that like from the input function, all data comes in as a string.

4. Examples

Let's sum all the digits in a file

In [18]:
import string

with open("myfile.txt", "r") as myfile:
    contents = myfile.read().split()

print(contents)

# Convert string of 1-9 into
# list of individual digits
digits = list(string.digits)
print(digits)

sum = 0
for element in contents:
    if element in digits:
        sum += int(element)

print("Sum of all digits in the file: {}".format(sum))
['hello', 'there', 'this', 'is', 'the', 'content', 'of', 'a', 'file', '1', '2', '3', ':)']
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Sum of all digits in the file: 6

How about using the dictionary file from HW, along with a variant of the hamming distance from earlier in the semester, to make a quick spell checker

In [16]:
# Let's say that the distance
# between two words is the difference
# in their lengths + number
# of different letters
def word_dist(w1, w2):
    dist = abs(len(w1) - len(w2))
    min_len = min(len(w1), len(w2))
    for l1, l2 in zip(w1[:min_len], w2[:min_len]):
        if l1 != l2:
            dist += 1
    return dist

print(word_dist('apple', 'apple'))
print(word_dist('apple', 'orange'))
print(word_dist('apple', 'appie!'))

def find_closest(w, words_fname):
    with open(words_fname, "r") as wfile:
        candidate = wfile.readline().strip()
        closest = [candidate]
        closest_dist = word_dist(w, candidate)
        
        for candidate in wfile:
            candidate = candidate.strip()
            dist = word_dist(w, candidate)
            if dist < closest_dist:
                closest = [candidate]
                closest_dist = dist
            elif dist == closest_dist:
                closest.append(candidate)
    
    return closest, closest_dist

def spellcheck(w):
    closest, closest_dist = find_closest(w.lower(), 'words_alpha.txt')
    if closest_dist == 0:
        return True
    else:
        return closest
    
print(spellcheck('apple'))
print(spellcheck('appie'))
print(spellcheck('craezee'))
0
6
2
True
['apple', 'eppie']
['broker', 'brose', 'broses', 'groser', 'proser']
['craizey']