# DS2500 Lesson4

Jan 24, 2023

Content:
- defaultdict
- imports
    - random.choices
- numpy & arrays

Admin:
- lab
    - due: weds @ 11:59 PM
    - still stuck? see lab digest session this weds 1030-1130
- hw resource: office hours & piazza



# `defaultdict`

Allows you to add a default value for any key which is not in the dictionary


In [1]:
# ordinarily, if you try to lookup a key not in the dictionary -> KeyError
normal_dict = {'a': 3, 'b': 65}
normal_dict['c']


KeyError: 'c'

In [2]:
x = dict()

In [4]:
# a defaultdict allows you to specify a default value 
# this default is used if one attempts to access a key not in the dictionary
from collections import defaultdict

def_dict_variable = defaultdict(lambda: 10)

# look, there isn't an error ... even though 'a' isnt a key
def_dict_variable['a']


10

In [6]:
# notice that after we access key 'a', its stored
def_dict_variable


defaultdict(<function __main__.<lambda>()>, {'a': 10})

In [9]:
def_dict_test = defaultdict(int)
def_dict_test['a']

0

In [12]:
int()

0

In [8]:
# default dictionaries are useful for counting (hw0 hint ...)
char_count = defaultdict(lambda: 0)

for char in 'aosuifasduiasdfuasduf':
    # add 1 to total number of times character is seen
    char_count[char] = char_count[char] + 1
    
char_count


defaultdict(<function __main__.<lambda>()>,
            {'a': 4, 'o': 1, 's': 4, 'u': 4, 'i': 2, 'f': 3, 'd': 3})

# Imports

`import` statements allow us to use code stored in another file (our own local file, or maybe some module we installed).


## `import`ing a local file

Lets first make a `.py` file, adjacent to this `.ipynb`, which has some code in it:


In [13]:
s_file_contents = '''
some_secret_api_key = 18451982

def print_greeting(name, language='english'):
    """ prints a greeting in english or spanish
    
    Args:
        name (str): name to greet
        language (str): 'english' or 'spanish'
    """
    
    str_greet_dict = {'english': 'hello {name}!',
                      'spanish': 'hola {name}!'}
    
    # print message
    str_greet = str_greet_dict[language]
    print(str_greet.format(name=name))
'''

# this will print the string above into the file "some_file.py" (overwrite if file exists)
with open('greet.py', 'w') as f:
    print(s_file_contents, file=f)


In [15]:
# run the local file greet.py, put its contents in the variable "greet"
import greet

# you can access the contents of greet with a period character
greet.some_secret_api_key


18451982

In [16]:
greet.print_greeting('sal')


hello sal!


In [17]:
greet.print_greeting('sal', language='spanish')


hola sal!


# `import`ing is "lazy"

In a particular python session, the import will only be run the first time it occurs.  

For example, the `import greet` below does not run `greet.py`, but just re-uses the same `greet` variable created by the first import.  

(note to self: modify greet.py to demonstrate)


In [1]:
import greet

greet.some_secret_api_key


'this is a new secret api key'

# `import`ing a module somebody else wrote

We use `import` to access code from some python module.

For example, we can import the random module and its function [random.choices](https://docs.python.org/3/library/random.html#random.choices) (useful for HW0!)


In [6]:
import random

population = 'red pill', 'blue pill'
random.choices(population)


['red pill']

### tip: skim the documentation for other keyword arguments ... they're often helpful

- `k`: how many random samples we draw
- `weights`: how likely each item in population is to be drawn
    - `weights[0]` is the weight of choosing `population[0]`
    - you can pass a probability distribution here and it'll do what you expect


In [7]:
random.choices(population, k=4)



['red pill', 'blue pill', 'red pill', 'blue pill']

In [9]:
population

('red pill', 'blue pill')

In [8]:
# red pill occurs 90% of the time, blue pill 10% of the time
random.choices(population, k=10, weights=[.9, .1])



['red pill',
 'blue pill',
 'blue pill',
 'red pill',
 'red pill',
 'red pill',
 'red pill',
 'red pill',
 'red pill',
 'red pill']

# `import`

Convenience syntaxes for shortening code

- `from random import choices`
- `import numpy as np`


In [19]:
import random

x = ['red', 'blue', 'green']
random.shuffle(x)
x

['blue', 'green', 'red']

In [10]:
# by importing with <from module import item> we can call item directly
from random import choices

choices(['a', 'b', 'c'], k=10)


['a', 'a', 'b', 'a', 'b', 'c', 'a', 'b', 'c', 'c']

In [21]:
# imports numpy, but stores it as a local module variable "np"
import numpy as np

# make a numpy array (we'll see this again shortly)
np.array([[1, 2],
          [3, 4]])


array([[1, 2],
       [3, 4]])

# In Class Activity A

In a particular game a die is rolled such that:
- a player earns 10 points 1/2 of the time
- a player earns 20 points 1/3 of the time
- a player earns 30 points 1/6 of the time

Create a single call to `random.choices` which simulates 4 players being assigned points as above.  (Your output should be a list of 4 items)


In [28]:
import random

population = [10, 20, 30]
weights = [1/2, 1/3, 1/6]
k = 4

random.choices(population=population, weights=weights, k=k)


[30, 20, 10, 10]

## Rows vs Columns

<img src="https://learnenglishfunway.com/wp-content/uploads/2021/07/Row-vs-Column.jpg" width=700 />


# Why do we make such a fuss to represent data as arrays?

Its often a convenient to consider a dataset as a two dimensional array (see below).  Where:
- every row corresponds to a particular **sample**
    - e.g. a penguin
- every column corresponds to a particular **feature**
    - e.g. how heavy all penguins are
- the intersection of a row and column contains the feature corresponding to a particular sample:
    - e.g. how heavy a particular penguin is

    
<img src="https://imgur.com/orZWHly.png" width=300 />



In [16]:
# (we'll cover this code next lesson, for today I just want us all to
# look at a dataset together)

import seaborn as sns

# data source: https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv
df_penguin = sns.load_dataset('penguins')
df_penguin.head()


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## **NumPy** (**Numerical Python**) Library
* First appeared in 2006 and is the **preferred Python array implementation**.
* High-performance, richly functional **_n_-dimensional array** type called **`ndarray`**. 
* **Written in C** and **up to 100 times faster than lists**.
* Critical in big-data processing, AI applications and much more. 
* According to `libraries.io`, **over 450 Python libraries depend on NumPy**. 
* Many popular data science libraries such as Pandas, SciPy (Scientific Python) and Keras (for deep learning) are built on or depend on NumPy. 

Big Question:
```
What is an array?  (and how is different than a list or list of lists?)
```

| Array                                 | List (Python: Dynamic Array)                         |
|---------------------------------------|------------------------------------------------------|
| Size is static (contiguous memory)    | Size can be modified quickly (non-contiguous memory) |
| Quick to compute (esp Linear Algebra) | Slower to compute (and clumsy looking code)          |
| contains 1 datatype          | may contain different data types   |

In summary, Arrays are faster, but more restrictive than lists.


### Initializing arrays:
- 1d from list / tuple
- 2d from list / tuple


In [32]:
import numpy as np

# x is a 1d array (3)
x = np.array((1, 2, 3))
x


array([1, 2, 3])

In [36]:
# y is a 2d array (2, 3)
y = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y


array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### Building some special matrics
- zeros
    - dtype
    - shape
- ones
    - dtype
    - shape
- full 
    - dtype
    - shape
    - fill_value
- eye
    - dtype
    - N



<img src="https://learnenglishfunway.com/wp-content/uploads/2021/07/Row-vs-Column.jpg" width=200 />

#### Convention: Rows First!
- we describe array shape as `(n_rows, n_cols)`
- we index into an array as `x[row_idx, col_idx]`


In [37]:
# shape = (n_rows, n_cols)
# shape = (height, width)
z = np.zeros((2, 5)) # tall array
z


array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [44]:
one_array = np.ones((2, 5), dtype=int)
one_array


array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

In [42]:
# np.full(shape=(2,5), fill_value=2)
two_array = np.full(shape=(2, 5), fill_value=2.284938983)
two_array


array([[2.28493898, 2.28493898, 2.28493898, 2.28493898, 2.28493898],
       [2.28493898, 2.28493898, 2.28493898, 2.28493898, 2.28493898]])

In [22]:
# identity matrix
# square matrix with 1's on the diagonal, 0s elsewhere
np.eye(3)


array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

## Building arrays which change: 
- `arange()`
- `linspace()`
- `geomspace()`
- `logspace()`


In [47]:
# np.arange(start (inclusive), stop (exclusive), step)
np.arange(0, 10, 2)


array([0, 2, 4, 6, 8])

In [51]:
# linearly spaced values np.linspace(start (inclusive), stop (inclusive), size)
np.linspace(0, 1, 5)


array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [52]:
# geom spaced values np.geomspace(start (inclusive), stop (inclusive), size)
np.geomspace(1, 27, 4)


array([ 1.,  3.,  9., 27.])

In [53]:
# log spaced value np.logspace(start_exp, stop_exp, size)
# start = 10^start_exp, stop = 10^stop_exp
np.logspace(0, 6, 7)


array([1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06])

### Array Attributes
- shape
- size
- ndim

Numpy can build arrays out of many different number types (bool, int, float).  ([see also](https://numpy.org/doc/stable/user/basics.types.html#:~:text=There%20are%205%20basic%20numerical,point%20(float)%20and%20complex.&text=NumPy%20knows%20that%20int%20refers,int_%20%2C%20bool%20means%20np.))
- dtype
    - astype
- nbytes


In [58]:
x = np.array([[1, 2, 3],
              [4, 5, 6]], dtype=np.uint8) 


In [59]:
x.dtype


dtype('uint8')

In [60]:
x.ndim


2

In [61]:
x.shape


(2, 3)

In [63]:
# size is total number of elements
x.size


6

In [64]:
x.nbytes


6

## Manipulating array shape

### Diagonal

The diagonal of each array is shaded below, the unshaded elements are not on the diagonal of the matrix:

$$ \begin{bmatrix}
\blacksquare & \square & \square\\
\square & \blacksquare & \square\\
\square & \square & \blacksquare\\
\square & \square & \square\\
\end{bmatrix} 
\hspace{2cm}
\begin{bmatrix}
\blacksquare & \square & \square & \square & \square\\
\square & \blacksquare & \square& \square & \square\\
\square & \square & \blacksquare& \square & \square\end{bmatrix}
\hspace{2cm}
\begin{bmatrix}
\blacksquare & \square & \square\\
\square & \blacksquare & \square\\
\square & \square & \blacksquare
\end{bmatrix} 
$$

### Numpy methods
- transpose
- .reshape()
    - order of reshape (row or column first?)


In [65]:
x = np.array([[1, 2, 3],
              [4, 5, 6]]) 
x


array([[1, 2, 3],
       [4, 5, 6]])

In [66]:
# transpose: flip across the diagonal
y = x.T
y


array([[1, 4],
       [2, 5],
       [3, 6]])

In [67]:
x


array([[1, 2, 3],
       [4, 5, 6]])

In [68]:
# reshape allows us to change shape of matrix
# (new matrix must have same total number of elements)
x.reshape((1, 6))


array([[1, 2, 3, 4, 5, 6]])

In [72]:
# x.reshape((1, 8))


In [73]:
z = np.arange(0, 12)
z


array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [74]:
z.reshape((3, 4))


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [75]:
# -1 may be used at most in the shape argument
# its value will be chosen to ensure output array has same number of elements
z.reshape((3, -1))


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [77]:
# be mindful that -1 can be replaced by some integer to keep same number of elements in array
# z.reshape((5, -1))


In [78]:
# we can fill the array across the rows first (order='C') ...
z.reshape((3, 4), order='C')


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [79]:
# or down columns first (order='F') ...
z.reshape((3, 4), order='F')


array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

## In Class Activity B
1. Build an array by:
- getting 100 equally spaced values from 11 to 42
- reshaping it into an array with 5 columns
2. How much memory does the computer use to store the array above if ...
    - ... each item in array is a `float`
    - ... each item in array is an 8 bit unsigned integer `np.uint8`
        - is anything lost in this representation? (explain in comment please)
3. (++) Build an `11x11` checkerboard matrix.  A `3x3` checkerboard is shown below for reference:
$$ \begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix} $$
- hint: try `[0, 1] * 3`, how could you use this?
- hint: you can slice matrices just like tuples / lists
    - `x[1:3]` gets the second and third items in an array


In [91]:
import numpy as np

x = np.linspace(11, 42, 100, dtype=float).reshape((-1, 5))
x.nbytes

800

In [92]:
# equivilent ways of changing the data type of an array
# np.uint8(x)
x.astype(np.uint8)

array([[11, 11, 11, 11, 12],
       [12, 12, 13, 13, 13],
       [14, 14, 14, 15, 15],
       [15, 16, 16, 16, 16],
       [17, 17, 17, 18, 18],
       [18, 19, 19, 19, 20],
       [20, 20, 21, 21, 21],
       [21, 22, 22, 22, 23],
       [23, 23, 24, 24, 24],
       [25, 25, 25, 26, 26],
       [26, 26, 27, 27, 27],
       [28, 28, 28, 29, 29],
       [29, 30, 30, 30, 31],
       [31, 31, 31, 32, 32],
       [32, 33, 33, 33, 34],
       [34, 34, 35, 35, 35],
       [36, 36, 36, 36, 37],
       [37, 37, 38, 38, 38],
       [39, 39, 39, 40, 40],
       [40, 41, 41, 41, 42]], dtype=uint8)

In [93]:
x = np.linspace(11, 42, num=100, dtype=np.uint8)
x.nbytes


100

## Array Indexing (slicing)

You can index arrays, everything we've previously shown about `start:stop:step` indexing works for arrays too!


In [94]:
x = np.arange(11)
x


array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [95]:
x[5]


5

In [48]:
x[2:6]


array([2, 3, 4, 5])

In [49]:
x[-3:]


array([ 8,  9, 10])

In [50]:
x[:5]


array([0, 1, 2, 3, 4])

A two dimensional array requires two indices to get a value: `x[row_idx, col_idx]`

(Just like our convention for rows first in shape, the row index comes first as we index into the array)


In [96]:
x = np.arange(20).reshape((4, 5))
x


array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [52]:
# row_idx=1 (second row since python starts counting at 0)
# col_idx=2 (third row since python starts counting at 0)
x[1, 2]


7

In [53]:
# we can start:stop:step slice either index

# get a slice of rows and a constant column
x[0:2, 2]


array([2, 7])

In [54]:
# get a slice of columns and a constant row
x[2, 0:2]


array([10, 11])

## Super useful slice syntax on arrays:
(so useful it deserves its own title)


In [55]:
# by default, the slice indexing chooses start:stop to give the entire object
x = np.array([1, 2, 3])
x[:]


array([1, 2, 3])

In [56]:
# we can use this to get an entire rows or columns as needed
x = np.arange(20).reshape((4, 5))
x


array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [57]:
# get the first column
x[:, 0]


array([ 0,  5, 10, 15])

In [58]:
# get the second row
x[1, :]


array([5, 6, 7, 8, 9])

In [59]:
# get the last two columns
x[:, -2:]


array([[ 3,  4],
       [ 8,  9],
       [13, 14],
       [18, 19]])

## In Class Activity C

Using a single array slice, extract all values which match each value below in the matrix `x`
- 1
- 2
- 3
- 4
- 5

- extract the last column of x
- extract the last row of x
- extract the first three elements of the last column of x


In [102]:
x = np.array([[0., 1., 3., 0., 5., 5.],
              [0., 0., 3., 0., 0., 0.],
              [0., 2., 3., 0., 0., 0.],
              [0., 0., 3., 4., 4., 4.],
              [0., 0., 3., 4., 4., 4.],
              [0., 0., 3., 4., 4., 4.]])


In [103]:
x[0, 1]

1.0

In [104]:
x[2, 1]

2.0

In [105]:
x[:, 2]

array([3., 3., 3., 3., 3., 3.])

In [106]:
x[3:, 3:]

array([[4., 4., 4.],
       [4., 4., 4.],
       [4., 4., 4.]])

In [107]:
x[0, -2:]

array([5., 5.])

In [108]:
# last column
x[:, -1]

array([5., 0., 0., 4., 4., 4.])

In [109]:
# last row
x[-1, :]

array([0., 0., 3., 4., 4., 4.])

In [112]:
# first 3 elements of last column
x[:3, -1]

array([5., 0., 0.])

In [113]:
some_dict = {'a': 1, 'b': 2}

In [115]:
for key in some_dict:
    print(key)

a
b
