DS2000 (Spring 2019, NCH) :: Lecture 7b

0. Administrivia

  1. Due today @ 9pm: HW6 (submit via Blackboard)
  2. Due before Monday's lecture: pre-class quiz (via Blackboard; feel free to use book/notes/Python)
  3. I'm in London next week!
  4. Due next Friday @ 9pm: HW7 (submit via Blackboard)

Dataset: IMDB Reviews

Our goal: sentiment analysis (categorize movie reviews as positive/negative based upon positivity/negativity of the words used in the review)

Vocabulary

Let's get a list of all the distinct words (aclImdb/imdb.vocab)

In [2]:
import os

def read_vocab():
    with open(os.path.join('aclImdb', 'imdb.vocab'), 'r', encoding='utf8') as vocab:
        return vocab.read().split()

vocab = read_vocab()
print(vocab[:30])
['the', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on', 'not', 'you', 'he', 'are', 'his', 'have', 'be', 'one', '!', 'all', 'at']

Word Weights

Let's get a corresponding weight for each word (aclImdb/imdbEr.txt).

Where did these come from? More later in the class when we talk about machine learning :)

In [3]:
def read_weights():
    with open(os.path.join('aclImdb', 'imdbEr.txt'), 'r', encoding='utf8') as weights:
        return [float(w) for w in weights.read().split()]

weights = read_weights()
print(weights[:30])
[0.0490972013402, 0.201363575849, 0.0333946807184, 0.099837669572, -0.0790210365788, 0.188660139871, 0.00712569582356, 0.109215821589, -0.154986397986, -0.222690363917, -0.0772307310155, -0.291845817772, 0.266363416394, 0.0212741184666, 0.0800135132377, -0.246354038618, -0.065874838866, 0.147885815777, -0.0649772327953, -0.178636415473, 0.0215810241282, 0.235710718431, 0.0387731958409, 0.458060460618, -0.284436793822, -0.175227962402, 0.0871135855367, -0.134838837245, -0.0338426829751, -0.185783731249]

Let's Explore!

In [4]:
min_weight = min(weights)
min_index = weights.index(min_weight)
print(min_weight, min_index, vocab[min_index])
-4.5 18730 perú
In [5]:
for weight_index, weight in enumerate(weights):
    if weight < -4.1:
        print("{}: {}".format(weight, vocab[weight_index]))
-4.5: perú
-4.5: ixteen
-4.5: jouissance
-4.5: wrathful
-4.5: daneliuc
-4.5: wurb
-4.5: art-film
-4.5: ruge
-4.5: orange-tinted
-4.5: moovies
-4.5: dirks
-4.5: zombiez
-4.5: soid
-4.5: ruptured
-4.5: jgar
-4.5: bucke
-4.5: rakes
-4.26680988826: unwatchably
-4.5: compost
-4.5: turn-on
-4.5: bloodsurfing
-4.5: dedede
-4.5: boogers
-4.5: caca
-4.5: lemmya
-4.5: qm
-4.5: tracee
-4.5: gwynyth
-4.5: sorrowfully
-4.5: one-tenth
-4.5: agis
-4.5: jinxed
-4.5: lummox
-4.19828976708: shelli
-4.5: castmates
-4.21144529414: scientologists
-4.5: culty
-4.5: mad-dog
-4.5: doorknobs
-4.11833618643: denigrates
-4.5: sumi
-4.5: sudbury
-4.5: sences
-4.5: expresion
-4.5: zzzz
-4.5: tripple
-4.5: gh
-4.5: zukovic
-4.5: evilmaker
-4.5: commericals
-4.5: crucifies
-4.5: distended
-4.5: al-quada
-4.5: knee-deep
-4.5: pleb
-4.5: subspace
-4.5: onside
-4.5: duwayne
-4.23526246887: sickles
-4.5: schnass
-4.5: rectangle
-4.5: linch
-4.5: line-dancing
-4.5: amature
-4.26416283788: pursuant
-4.5: sheeple
-4.5: slo
-4.5: hicksville
-4.5: taqueria
-4.5: bobbins
-4.5: shysters
-4.5: revolta
-4.5: cliché-driven
-4.5: atlantica
-4.5: zomg
In [6]:
max_weight = max(weights)
max_index = weights.index(max_weight)
print(max_weight, max_index, vocab[max_index])
4.5 11613 xica
In [7]:
for weight_index, weight in enumerate(weights):
    if weight > 4.1:
        print("{}: {}".format(weight, vocab[weight_index]))
4.5: xica
4.32672195298: danelia
4.5: filone
4.5: bazza
4.31334181282: nwh
4.5: alekos
4.5: riedelsheimer
4.5: citizenx
4.5: telemundo
4.5: machi
4.5: englebert
4.5: horstachio
4.5: cloudkicker
4.5: cybersix
4.5: graaff
4.5: swatch
4.5: kimiko
4.23142584: katzir
4.5: rangi
4.5: ashwar
4.5: limbic
4.5: pocasni
4.5: trce
4.5: petitiononline
4.5: machesney
4.5: maratonci
4.5: super-hot
4.5: soso
4.5: gouald
4.5: psa
4.5: dmd
4.5: winkelman
4.5: mini-dv
4.21139785593: mumtaz
4.5: skyward
4.5: sheeba
4.5: neat-freak
4.5: wallah
4.21139785593: jole
4.18540530171: sherbert
4.5: naudets
4.5: transcribed
4.18107523608: zeman
4.5: animates
4.5: double-d
4.5: magorian
4.23142584: armourae
4.5: genii
4.5: weems
4.5: heffer
4.12035213723: holywell
4.12035213723: katona
4.5: linoleum
4.5: foabh
4.5: roseaux
4.5: willowy
4.5: runnin
4.5: telescoping
4.5: stuttgart
4.5: fun-bloodbath
4.5: fremantle
4.5: pufnstuff
4.5: gobo
4.5: psycho-analysis
4.5: authoritarianism
4.5: hogun
4.5: eyecandy
4.5: motoring
4.5: boneheads
4.5: cognition
4.36730500621: kf
4.12035213723: emsworth
4.5: ganster
4.5: bigger-than-life
4.5: veto
4.5: karnage
4.5: loire
4.5: pricing
4.5: lipper
4.5: hubbie
4.5: crimefighter
4.5: ovid
4.5: telstar
4.5: valderama
4.5: venger
4.26570039682: ralphy
4.23142584: homem
4.12035213723: country-boy
4.5: barcode
4.5: big-league
4.18540530171: raliegh
4.5: goddam
4.5: hyun-soo
4.12035213723: re-broadcast
4.18540530171: nagai
4.12035213723: hallo
4.5: megas
4.5: wrenchmuller
4.5: aku
4.18540530171: holocausts
4.5: non-offensive
4.12035213723: hrishita
4.5: legalize
4.5: kwame
4.12035213723: lbp
4.5: glamourise
4.12035213723: hermandad
4.5: katanas
4.5: zd
4.5: aparently
4.5: blazers
4.14363272335: goebels
4.5: sub-story
4.18107523608: sep
4.5: miku
4.5: hovis
4.5: nikolayev
4.5: middles
4.5: montanas
4.5: kleine
4.37621490892: meisner
4.5: lighter-than-air
4.5: hiya
4.5: stopkewich
4.29221698617: boylen
4.5: doozers
4.5: socking
4.5: grunner
4.5: transmutation
4.5: stuntpeople
4.5: store-owner
4.5: yasha
4.5: rahad
4.5: femur
4.5: subdivisions
4.5: ferencz
4.12035213723: bowzer
4.12035213723: ele
4.5: sumire
4.5: sholem
4.5: préjean
4.36730500621: nuttball
4.18540530171: scaredy-cat
4.18540530171: banjo-kazooie
4.5: sweetums
4.5: plonker
4.5: fanfilm
4.5: pelswick
4.12035213723: consummately
4.5: labeija
4.5: smartie
4.5: makowski
4.5: clavius
4.5: ring-wraiths
4.5: g-gundam
4.5: norge
4.5: tounge
4.5: work-out
4.12035213723: bania
4.5: havin
4.5: alegria
4.5: acrifice
4.5: truley
4.5: hangal
4.5: arterial
4.5: opiate
4.5: marzia
4.5: buh
4.29221698617: msties
4.5: moomins
4.12035213723: madres
4.5: ribsi
4.5: polt
4.5: labyrinths
4.2575088395: bhagyashree
4.5: chirp
4.5: asumi
4.5: unsurpassable
4.5: playwrite
4.5: quantitative
4.29221698617: mccreary
4.5: mastan
4.5: cadmus
4.5: debriefing
4.5: watcha
4.5: televise
4.5: tenderhearted
4.5: poeple
4.5: eréndira
4.5: hanneke
4.5: blackmore
4.5: john-rhys
4.5: counterstrike
4.18107523608: midwinter
4.26570039682: dike
4.5: mocha
4.18540530171: rostova
4.5: klembecker
4.5: nervosa
4.5: lyudmila
4.5: dipti
4.14866701295: alexanderplatz
4.23142584: disinherit
4.5: confucian
4.5: nonviolence
4.12035213723: asti
4.5: classism
4.5: bams
4.12035213723: sugarcoating
4.5: takemitsu
4.5: ex-christian
4.5: caprino
4.5: imbibe
4.18540530171: conker
4.5: nota
4.18107523608: screw-on
4.12035213723: sussanah
4.5: scfi
4.5: zoran
4.5: htv
4.5: commandoes
4.5: novac
4.23514100459: promenade
4.31334181282: sandcastles
4.5: drosselmeier
4.5: dosent
4.5: lawnmowerman
4.5: grout
4.5: seiko
4.5: augers
4.12035213723: daker
4.5: walmington-on-sea
4.5: filthiness
4.18540530171: klok
4.36730500621: snit
4.21139785593: gégauff
4.5: e-mailed
4.5: yuba
4.5: nyland
4.18540530171: wallaces
4.29221698617: shobha
4.5: tripled
4.5: pathar
4.5: eventuality
4.5: emeryville
4.5: magnifique
4.5: trestle
4.12035213723: interactivity
4.5: vexation
4.18540530171: girlies
4.5: dragonballz
4.5: untutored
4.35696725489: zavattini
4.5: aristophanes
4.12035213723: non-issue
4.23142584: splashdown
4.5: jurisprudence
4.18540530171: tomo
4.5: kiva
4.5: h-b
4.5: probationary
4.5: dapne
4.23142584: murli
4.5: hôtel
4.12035213723: kono
4.5: sublimate
4.5: horthy
4.5: crooke
4.5: gerri
4.5: earlobes
4.5: tsunehiko
4.5: hissed
4.5: brummie
4.18540530171: neighbourhoods
4.5: garbages
4.21473375157: dowdell
4.5: sweet-talking
4.18540530171: valery
4.13901368966: dango
4.5: travola
4.33056761792: kingly
4.5: hz
4.10980215752: crowell
4.12035213723: year-long
4.5: dolorous
4.5: doofenshmirtz
4.5: superbit
4.5: groovie
4.5: reuters
4.5: first-degree
4.5: re-occurring
4.5: harrassed
4.18540530171: mystically
4.12035213723: end-fight
4.5: senesh
4.23142584: wacthing
4.5: enabler
4.5: trumped-up
4.5: samuari
4.5: heth
4.5: mcreedy
4.5: berfield
4.5: heinkel
4.12035213723: meditates
4.5: quato
4.28263242929: inu
4.5: starblazers
4.18540530171: ikuru
4.5: tusk
4.5: fairview
4.5: commentating
4.5: ips
4.5: flåklypa
4.5: thyself
4.5: sages
4.5: razer
4.12035213723: leckie
4.5: xlr
4.2533118857: kats
4.5: heartstopping
4.5: croucher
4.5: lookinland
4.29221698617: non-reality
4.5: mim
4.5: tates
4.5: antonis
4.12035213723: whinge
4.18540530171: everbody
4.5: beachcombers
4.12035213723: serf
4.23645508902: scavo
4.5: harley-davidson
4.5: paganistic
4.5: conure
4.12035213723: crustacean
4.5: bodyline
4.5: blag
4.18540530171: pachelbel
4.5: esterhase
4.5: gorgs
4.18540530171: malnourished
4.5: transgenic
4.5: gravitation
4.12035213723: bear-like
4.5: shultz
4.5: dass
4.5: luddites
4.5: commode
4.29434969687: intellectually-challenged
4.5: donato
4.5: bie
4.5: replicates
4.18540530171: dionysian
4.5: super-sexy
4.5: mourir
4.2575088395: alka
4.26570039682: lodz
4.12035213723: film-lovers
4.5: brylcreem
4.5: ferro
4.12035213723: drop-off
4.5: ioffer
4.5: inflame
4.5: tal

Join the Vocabulary and Weights

Let's create a dictionary that associates words with weights

In [8]:
def make_word_weight_dict(vocab, weights):
    return {word:weight for word, weight in zip(vocab, weights)}

vocab_dict = make_word_weight_dict(vocab, weights)

print(vocab_dict['terrible'])
print(vocab_dict['boring'])
print(vocab_dict['great'])
print(vocab_dict['hilarious'])
-2.18077869986
-1.78837930632
1.09468232801
0.993189582589

Read Reviews

Let's read in all the positive/negative reviews

In [9]:
import random

def read_all_reviews_from_dir(dir_path):
    reviews = []
    
    for f in os.listdir(dir_path): # get a list of all files in a directory
        with open(os.path.join(dir_path, f), 'r', encoding='utf8') as review_file:
            reviews.append((f, review_file.read()))
    
    return reviews


def read_all_reviews():
    base_path = os.path.join('aclImdb', 'test')
    
    pos_reviews = read_all_reviews_from_dir(os.path.join(base_path, 'pos'))
    pos_labels = ['positive'] * len(pos_reviews)
    neg_reviews = read_all_reviews_from_dir(os.path.join(base_path, 'neg'))
    neg_labels = ['negative'] * len(neg_reviews)
    
    all_reviews = pos_reviews + neg_reviews
    all_labels = pos_labels + neg_labels
    
    reviews_and_labels = list(zip(all_reviews, all_labels))
    
    random.shuffle(reviews_and_labels)
    
    return reviews_and_labels


reviews = read_all_reviews()

print(reviews[0])
(('329_1.txt', 'I\'ve always loved horror flicks. From some of the usual well-known like "The Exorcist" to some of the more underrated like "Black Christmas" or "Just Before Dawn". But who are people kidding,even calling this trash a b-movie. It\'s straight up bottom-of-the-barrel Z-grade. The acting is the worst ever on film. Really,I\'ve seen better on an episode of the "Young and the Restless"...SPOILER...Lookout for when the woman comes to tell them about the legend of Jack-o. She pauses sometimes for a matter of seconds as if someone is flashing her cue cards and she\'s struggling to read her lines. A RIOT! <br /><br />Oh,and besides the bad acting,absolutely no gore or F/X. And Jack-o looked like a plastic lit pumpkin. Watch Linnea Quigley in "Night of the Demons",or "Silent Night,Deadly Night",far superior flicks.'), 'negative')

Score Reviews

Now let's try to score a review based upon the per-word weights (normalized by the number of words).

In [10]:
def score_review(review_text, vocab_dict):
    review_words = review_text.split()
    score = 0
    
    for word in review_words:
        score += vocab_dict.get(word, 0)
    
    return score / len(review_words)


def review_summary(review_of_interest, reviews, vocab_dict):
    print(reviews[review_of_interest][0][1])
    print()
    print("Rating: {}".format(reviews[review_of_interest][1]))
    print("Score: {}".format(score_review(reviews[review_of_interest][0][1], vocab_dict)))


review_scores = [score_review(review[0][1], vocab_dict) for review in reviews]

Let's Explore!

In [11]:
min_review = min(review_scores)

print(min_review)
review_summary(review_scores.index(min_review), reviews, vocab_dict)
-0.39236446380089995
This was truly horrible. Bad acting, bad writing, bad effects, bad scripting, bad camera shots, bad filming, bad characters, bad music, bad editing, bad casting, bad storyline, bad ... well, you get the idea. It was just, just ... what's the word? Oh yeah ... BAD!

Rating: negative
Score: -0.39236446380089995
In [12]:
max_review = max(review_scores)

print(max_review)
review_summary(review_scores.index(max_review), reviews, vocab_dict)
0.36281647436158926
A surprisingly beautiful movie. Beautifully conceived, beautifully directed, beautifully acted, beautifully acted and most beautifully photographed.....the cinematography is nothing short of splendid. It is a war movie but is epic in it's scope and blends romance, tragedy and comedy into a story that is as harrowing as it is provoking.

Rating: positive
Score: 0.36281647436158926