#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Felix Muzny 11/8/2022 DS 2000 Lecture 18 - sentiment analysis, sets, moving averages Logistics: - Homework 8 is out - this is your last HW for DS 2000 - there is a significant amount of extra credit available (so start early if this is something that you need) - due 11/18 - yes, you are doing sentiment analysis for this homework - checking your grades: - most accurate is calculate based on Gradescope grades - HW is 90% (total HW points you got/ total HW points) - Quizzes are 10% (drop your lowest quiz) - you can look in Canvas—know that this lags behind Gradescope - No quiz this week - No lecture on Friday (Veteran's day!) - I'll see you all next week! - remote attendance (https://bit.ly/remote-ds2000-muzny) Three ways to participate (please do one of these!) 1) via the PollEverywhere website: https://pollev.com/muzny 2) via text: text "muzny" to the number 22333 to join the session 3) via Poll Everywhere app (available for iOS or Android) """ """ HW 7 reflection --- Look at the sample speeds.pdf figure. (Or the one that you produced) How do we feel about it? A. looks good B. I have questions C. looks bad D. That line at 0 makes me queasy E. This graph is wrong? Why does this look bad? - tons of values at zero - this doesn't make sense - means that speed was 0 for a ton of trips, we don't think that this many trips were just people getting a bike and not moving How could we answer "what's going on with 0"? - investigate if this is the trips that we don't have station data for - speed is distance/duration -> look at both of these variables to find "where" this result comes from - is this correct? Go investigate our calculation for bugs - write a little code to display trips with 0 mph speed -> look at these with your eyes and see if you can find patterns """ """ Sentiment ---- Is anti-hero (Taylor Swift) positive, negative, neutral? A. positive B. negative C. neutral Hypothesis: (15 pos, 80 neg, 5 neutral) Are the following reviews: A. positive B. negative C. neutral The movie ____________ is-forgive the critical jargon-pretty good (72 pos) - most significant words are "pretty good" which is positive Clearly, _________'s film, while riddled with glaringly awful mistakes, is not bad at all. (30 pos, 30 neg, 30 neutral) - "glaringly awful" + "not bad" = neutral ? - this seems hard because maybe sarcasm/something else? You can't believe what you're looking at because it's so hideous to behold. The best thing here is that it's at least under two hours (75 neg) - hideous - the best thing is actually "a bad thing" The film is a triple-decker weirdburger from the twitching ears to the too- long tails that make the ensemble look like lemurs. -> Cats (2017?) To measure sentiment: start with the easiest thing first: -> count how many positive and negative words we have -> download word lists from the internet/past research to load in those words """ """ Sets --- - a data structure -> like: lists, dictionaries -> these are for storing a collection of values - a set is a unique collection of values - it has no order (no indices, no keys) - it is super fast to look things up in # not so fast if value in list: # super fast if value in set: """ # create a new set s1 = set() # sets don't have indices # TypeError: 'set' object does not support item assignment # s1[0] = 0 # can't append to a set # AttributeError: 'set' object has no attribute 'append' # s1.append(0) # Add to a set! s1.add(0) print(s1) s1.add(31) print(s1) # if I try to add a value again, no error, but doesn't re-add s1.add(31) print(s1) s1.add(31) s1.add(31) print(s1) # still {0, 31} # make a set from a list ls = [1, 1, 1, 2, 3] s2 = set(ls) print(s2) # test to see if a value is in a set print(1 in s2) print(9 in s2) """ Moving Averages --- """ # from dataproc.py (Prof. Rachlin's version of # data_utils.py) def avg(L): """ Compute the numerical average of a list of numbers. If list is empty, return 0.0 """ if len(L) > 0: return sum(L) / len(L) else: return 0.0 def get_window(L, idx, window_size=1): """ Extract a window of values of specified size centered on the specified index L: List of values idx: Center index window_size: window size """ minrange = max(idx - window_size // 2, 0) maxrange = idx + window_size // 2 + (window_size % 2) return L[minrange:maxrange] def moving_average(L, window_size=1): """ Compute a moving average over the list L using the specified window size L: List of values window_size - The window size (default=1) return - A new list with smoothed values """ mavg = [] for i in range(len(L)): window = get_window(L, i, window_size) mavg.append(avg(window)) return mavg ls = [1, 2, 3, 7, 10, 12] print(ls) print(avg(ls)) # 5.833333333333333 # a moving average smoothes out variations due to # daily fluctuation (or specific lines that are very pos/neg) print(get_window(ls, 2, window_size = 3)) # [1.0, 2.0, 3.0, 7.0, 10.0, 12.0] print(moving_average(ls, window_size = 1)) # index: description # 0: average of just 1 # 1: average of 1 and 2 # 2: average of 2 and 3 # 3: average of 3 and 7 # [1.0, 1.5, 2.5, 5.0, 8.5, 11.0] print(moving_average(ls, window_size = 2)) # what happens w/ big window sizes? # if window_size was 10 # first 9 calculations will have 1, then 2, then 3, etc numbers averaged """ Next time: - Jupyter Notebooks - Classes and objects """