DS2000 (Spring 2019, NCH) :: Lecture 6b

0. Administrivia

  1. Due today @ 9pm: HW5 (submit via Blackboard)
  2. Next week: NOTHING -- Reading Week!! :)
  3. Due before the following Monday's lecture: pre-class quiz (via Blackboard; feel free to use book/notes/Python)
  4. Following Wednesday (in practicum): in-class quiz 4 (Files)
  5. Due following Friday @ 9pm: HW6 (submit via Blackboard)

Dataset: Political Speeches

Our goal: analyze top word counts from these two politicians

Bonus: Reading in Files in Other Directories

Windows separates directories with \, Mac/Linux separates with /, so we use the os module to help create platform-independent code...

In [2]:
import os

transcript_directory =  "political-speech-files"

def get_speech_path(politician):
    return os.path.join(transcript_directory, "{}-speeches.txt".format(politician))

print(get_speech_path("obama"))
print(get_speech_path("trump"))
political-speech-files/obama-speeches.txt
political-speech-files/trump-speeches.txt

Get All Words

In [3]:
# If you haven't used nltk before, run the following
import nltk

nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /Users/nate/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[3]:
True
In [4]:
import string
from nltk.corpus import stopwords

def preprocess(word):
    bad_letters = string.punctuation + string.digits
    word = list(word.lower())
    return "".join([letter for letter in word if (letter=="'" or letter not in bad_letters)])

def get_all_words(politician):
    with open(get_speech_path(politician), 'r', encoding='utf8') as f:
        all_words = f.read().split()
    preprocessed = [preprocess(word) for word in all_words]
    
    # common words we don't care about
    stop_words = stopwords.words('english') + ['—', '–']
    
    return [word for word in preprocessed if word and word not in stop_words]

obama_words = get_all_words('obama')
trump_words = get_all_words('trump')

print(len(obama_words))
print(len(trump_words))
366775
81665
In [5]:
print(obama_words[:30])
['chip', 'kathy', 'nancy', 'graciously', 'shared', 'father', 'nation', 'loved', "walter's", 'friends', 'colleagues', 'protégés', 'considered', 'hero', 'men', 'intrepid', 'gathered', 'today', 'honored', 'pay', 'tribute', 'life', 'times', 'man', 'chronicled', 'time', 'know', 'mr', 'cronkite', 'personally']
In [6]:
print(trump_words[:30])
['speech', 'thank', 'much', "that's", 'nice', 'great', 'guy', 'get', 'fair', 'press', 'get', 'fair', 'tell', "i'm", 'strongly', 'great', 'respect', 'steve', 'king', 'great', 'respect', 'likewise', 'citizens', 'united', 'david', 'everybody', 'tremendous', 'resect', 'tea', 'party']

Get Some Words

Sometimes it's useful to look at a random sample of words (e.g., to understand the dataset, try out a function that might take a long time)...

In [7]:
import random

def sample_words(words, k, seed=None):
    if seed is not None:
        random.seed(seed)
        
    return random.sample(words, k)
In [8]:
print(sample_words(obama_words, 100, 322))
['economies', 'freedoms', 'people', "life's", 'values', 'bridged', 'menace', 'us', 'years', 'always', 'skills', 'important', 'price', 'educate', 'us', 'constituents', 'instruments', 'mobilize', 'billion', 'civil', 'structure', 'killed', 'needed', 'want', 'men', 'help', 'others', 'across', 'bill', 'like', 'world', 'democracies', 'laden', 'doors', 'remember', 'put', 'work', 'spectrum', 'suggestions', 'afford', 'folks', 'barbed', 'journalism', 'may', 'education', 'electricity', 'prefer', 'times', 'work', 'heating', 'caused', 'empowered', 'keeping', 'effective', 'isolating', 'look', 'suspected', 'law', 'decision', 'somehow', 'america', 'frightening', 'responsibility', 'palestinians', 'jobs', 'honor', 'charged', 'act', 'degree', 'future', 'facing', 'us', 'want', 'creed', 'first', 'security', 'present', 'system', 'note', 'pronounced', 'disagreements', 'james', 'sees', 'goal', 'mayor', 'broken', 'learns', 'go', 'hour', 'improve', 'negotiations', 'minority', 'could', 'vast', 'grew', 'inclusive', 'opened', 'justice', 'godgiven', 'arguments']
In [9]:
print(sample_words(trump_words, 100, 322))
['gotten', 'sponsor', 'proud', 'way', 'time', 'run', 'play', 'another', 'going', 'there’s', 'maybe', 'it’s', 'got', 'certain', 'money', 'heard', 'finish', 'great', 'inversions', 'horrible', 'long', 'shots', 'business', 'probably', 'killing', 'guy', 'guy', 'know', 'freaking', 'way', 'pay', 'nevada', 'help', 'nobody', 'hear', 'sitting', 'bragging', 'friends', 'people', 'saw', 'made', 'never', 'lost', 'killed', 'take', 'great', 'right', 'he’s', 'illegal', 'us', 'ford', 'unemployment', 'iran', 'build', 'straightened', 'tell', 'cruz', 'love', 'night', 'love', 'giving', 'hundreds', 'rude', 'he’s', 'great', 'call', 'don’t', 'john', 'bridge', 'melissa', 'schedule', "we're", 'florida', 'said', 'happen', 'couple', 'offense', 'differences', 'direct', 'take', 'it’s', 'properties', 'shown', 'truck', 'second', 'numbers', 'tremendous', 'like', 'i’m', 'country', 'think', 'immigrants', 'champion', 'camera', "we're", 'people', 'security', 'mine', 'we’re', 'probably']

Counting Words

Our goal will be to produce a list of lists, where the inner lists have two elements (word, count).

Note: our next topic will make this MUCH faster.

In [12]:
def add_word_to_count(word, counts):
    # try to add to existing count
    for count in counts:
        if count[0] == word:
            count[1] += 1
            return
    
    # if no match, set count to 1
    counts.append([word, 1])


def word_count(words):
    counts = []
    
    for word in words:
        add_word_to_count(word, counts)
    
    return counts

obama_count = word_count(obama_words) # ~1 minute
trump_count = word_count(trump_words) # ~10 seconds
In [13]:
print(obama_count[:10])
[['chip', 9], ['kathy', 5], ['nancy', 24], ['graciously', 2], ['shared', 153], ['father', 143], ['nation', 826], ['loved', 131], ["walter's", 3], ['friends', 229]]
In [14]:
print(trump_count[:10])
[['speech', 54], ['thank', 228], ['much', 321], ["that's", 49], ['nice', 147], ['great', 687], ['guy', 183], ['get', 632], ['fair', 34], ['press', 85]]
In [15]:
# Let's sort these counts
def get_count(count_pair):
    return count_pair[1]

def sort_counts(counts):
    return sorted(counts, key=get_count, reverse=True)

obama_sorted_counts = sort_counts(obama_count)
trump_sorted_counts = sort_counts(trump_count)
In [16]:
print(obama_sorted_counts[:100])
[['people', 3110], ['us', 2506], ['one', 1830], ['america', 1692], ['new', 1657], ['know', 1583], ['make', 1567], ['world', 1546], ['time', 1477], ['american', 1400], ['going', 1386], ['work', 1380], ['like', 1355], ['that’s', 1332], ['country', 1305], ['every', 1287], ['want', 1266], ['states', 1249], ['united', 1238], ['would', 1229], ['years', 1198], ['also', 1184], ['today', 1182], ['it’s', 1154], ['get', 1066], ['must', 1063], ['president', 1060], ['americans', 1060], ['need', 1041], ['right', 1019], ['many', 989], ['security', 978], ['we’re', 977], ['even', 958], ['government', 909], ["that's", 900], ['way', 896], ['care', 886], ['thank', 879], ['together', 871], ['health', 840], ['nation', 826], ['first', 807], ['help', 805], ['war', 798], ['come', 797], ['we’ve', 795], ['take', 777], ['future', 775], ['think', 775], ['got', 760], ['jobs', 755], ['made', 754], ['economy', 753], ['back', 734], ['said', 730], ['good', 727], ['year', 706], ['last', 691], ['well', 677], ['families', 676], ['day', 661], ['could', 660], ['see', 653], ['believe', 649], ['much', 639], ['still', 632], ['change', 629], ['congress', 628], ['two', 616], ['keep', 597], ['say', 594], ['go', 593], ['great', 574], ['children', 574], ['nations', 573], ['across', 572], ['better', 571], ['sure', 565], ['never', 560], ['young', 560], ['support', 559], ['let', 555], ['i’m', 545], ['may', 542], ['part', 540], ['around', 537], ['home', 537], ["we're", 528], ['don’t', 523], ['give', 520], ['working', 516], ['women', 515], ['place', 513], ['peace', 511], ['system', 509], ['lives', 506], ['men', 501], ['next', 498], ['done', 494]]
In [17]:
print(trump_sorted_counts[:100])
[['going', 2055], ['people', 1328], ['know', 1314], ['it’s', 1103], ['we’re', 982], ['don’t', 888], ['said', 771], ['i’m', 769], ['want', 760], ['great', 687], ['they’re', 676], ['get', 632], ['like', 626], ['think', 625], ['one', 588], ['country', 527], ['say', 510], ['right', 501], ['that’s', 496], ['look', 400], ['go', 391], ['money', 390], ['lot', 377], ['got', 367], ['many', 366], ['good', 355], ['make', 348], ['us', 342], ['really', 338], ['back', 337], ['way', 329], ['mean', 327], ['much', 321], ['would', 318], ['even', 314], ['take', 307], ['he’s', 305], ['see', 290], ['never', 288], ['tell', 286], ['time', 280], ['win', 279], ['i’ve', 278], ['love', 274], ['trump', 269], ['well', 269], ['you’re', 265], ['big', 259], ['things', 243], ['thing', 240], ['come', 237], ['can’t', 236], ['didn’t', 234], ['believe', 231], ['thank', 228], ['everybody', 228], ['world', 219], ['ever', 215], ['deal', 211], ['years', 209], ['president', 203], ['trade', 203], ['okay', 197], ['china', 194], ['something', 188], ['happen', 188], ['jobs', 187], ['need', 186], ['million', 186], ["we're", 185], ['guy', 183], ['could', 181], ['america', 177], ['wall', 171], ['talk', 170], ['done', 167], ['bad', 165], ['actually', 161], ['i’ll', 161], ['better', 159], ['let', 158], ['ago', 156], ['new', 156], ['came', 156], ['hillary', 155], ['mexico', 153], ['care', 153], ['oh', 152], ['number', 152], ['every', 148], ['states', 148], ['nice', 147], ['nobody', 145], ['little', 145], ['give', 145], ['incredible', 142], ['first', 139], ['we’ve', 138], ['what’s', 137], ['folks', 136]]
In [18]:
# Analysis of top-k words for each (that are different)
k = 200

top_k_obama = [pair[0] for pair in obama_sorted_counts[:k]]
top_k_trump = [pair[0] for pair in trump_sorted_counts[:k]]

most_obama = [word for word in top_k_obama if word not in top_k_trump]
most_trump = [word for word in top_k_trump if word not in top_k_obama]
In [19]:
print(most_obama)
['must', 'americans', 'security', 'government', "that's", 'together', 'health', 'nation', 'help', 'war', 'future', 'economy', 'families', 'still', 'congress', 'children', 'nations', 'across', 'sure', 'young', 'support', 'may', 'part', 'around', 'home', 'working', 'women', 'peace', 'system', 'lives', 'men', 'next', 'energy', 'history', 'law', 'iraq', 'insurance', 'rights', 'life', "we've", 'without', 'stand', 'power', 'businesses', 'continue', 'companies', 'education', 'community', 'making', 'god', 'leaders', 'economic', 'already', 'tax', 'forward', 'nuclear', 'able', 'act', 'reform', 'cannot', 'opportunity', 'clear', "i'm", 'plan', 'another', 'state', 'progress', 'responsibility', 'challenges', 'face', 'past', 'human', 'hope', 'means', 'free', 'including', 'whether', 'since', 'citizens', 'democracy', 'freedom', 'effort', 'century', 'live', 'afghanistan', 'house', 'national', 'international', 'workers', 'forces', 'family', 'meet', 'generation', 'washington', 'times', 'efforts', 'true', 'small', 'college', 'region', 'common']
In [20]:
print(most_trump)
['money', 'really', 'mean', 'he’s', 'tell', 'win', 'love', 'trump', 'you’re', 'big', 'can’t', 'didn’t', 'ever', 'deal', 'trade', 'okay', 'china', 'happen', 'million', 'guy', 'wall', 'talk', 'bad', 'actually', 'i’ll', 'hillary', 'mexico', 'oh', 'number', 'nice', 'nobody', 'little', 'incredible', 'what’s', 'remember', 'talking', 'run', 'saying', 'person', 'bring', 'probably', 'problem', 'tremendous', 'billion', 'somebody', 'iowa', 'doesn’t', 'coming', 'maybe', 'there’s', 'amazing', 'happened', 'nothing', 'anything', 'thousands', 'clinton', 'went', 'poll', 'deals', 'obama', 'use', 'you’ve', 'wanted', 'immigration', 'getting', 'second', 'guys', 'else', 'saw', 'she’s', 'smart', 'anybody', 'tough', 'happening', 'anymore', 'press', 'everything', 'isis', 'iran', 'won’t', 'numbers', 'whole', 'they’ve', 'dollars', 'border', 'thought', 'trillion', 'call', 'gave', 'israel', 'building', 'politicians', 'started', 'totally', 'worst', 'wouldn’t', 'away', 'real', 'horrible', 'left', 'south']