PYTHON 3.6.2


last hacked on Jul 22, 2017

The 2016 U.S. presidential election yielded an unexpected result for many - while an overwhelming majority of polls pointed to Hillary Clinton as the winner, Donald Trump eventually emerged the victor. In this project, we dive into the election poll data provided by [FiveThirtyEight](https://projects.fivethirtyeight.com/2016-election-forecast/) with the goal of visualizing the data to gain insight about polling accuracy leading up to the election.
## Table of Contents 1. Environment Setup 2. Loading Data 3. Plotting all data - Scatter plots - Continuous Graph with Error Bars - Bubble charts 4. Plotting by Pollster Grade - Scatter Plots - Boxplots - Bubble Charts 5. Plotting by State - Categorical Scatter Plot - Chloropleth Maps ## Environment Setup This project was completed in an Anaconda virtual environment running Python 3.6. More information about setting up virtual environments can be found [here](https://github.com/junseo-park/election/blob/master/README.md). ## Cleaning Data We begin by loading and cleaning the dataset. df = pd.read_csv('http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv') categories = ['type', 'state', 'enddate', 'pollster', 'grade', 'samplesize', 'population', 'adjpoll_clinton', 'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin', 'poll_id'] df2 = df.loc[:, categories] df_po = df2[df2.loc[:,'type']=='polls-only'] df_po = df_po.reset_index(drop=True) df_po.loc[:,'enddate'] = pd.to_datetime(df_po.loc[:,'enddate']) ## Plotting All Data ### Scatter Plot We begin with a simple scatter plot of all points: <iframe width="900" height="600" frameborder="0" scrolling="no" src="//plot.ly/~junseopark/14.embed"></iframe> It looks like Trump generally began the election campaign with lower figures, but the data doesn't indicate anything conclusive towards the end of the campaign. One strange outlier is that several polls indicated Clinton would attract around 85% of the votes. Since there was so much variation in the data, we decided to take a look at the overall trends including the error bars for the standard deviation of the polls. (From this point on, we have excluded Johnson and McMullin due to low impact on the results.) ### Continuous Graph with Error Bars Continuous error graphs allow us to see the mean and standard deviation of all polls, depicted over time. First we have to reorganize our dataset into bins in order to find standard deviations: df_date = df_po[df_po.loc[:,'state']=='U.S.'] df_date = df_date.loc[:,['enddate', 'adjpoll_clinton', 'adjpoll_trump']] df_date = df_date.reset_index(drop=True) for index in range(len(df_date)): d = df_date.loc[index, 'enddate'] if (d.strftime("%d") <= '05'): df_date.loc[index, 'enddate']=d.replace(day=1) elif (d.strftime("%d") <= '10'): df_date.loc[index, 'enddate']=d.replace(day=6) elif (d.strftime("%d") <= '15'): df_date.loc[index, 'enddate']=d.replace(day=11) elif (d.strftime("%d") <= '20'): df_date.loc[index, 'enddate']=d.replace(day=16) elif (d.strftime("%d") <= '25'): df_date.loc[index, 'enddate']=d.replace(day=21) elif (d.strftime("%d") <= '31'): df_date.loc[index, 'enddate']=d.replace(day=26) df_bins = pd.DataFrame(columns=('date', 'mean_c', 'mean_t', 'sd_c', 'sd_t')) while (len(df_date) != 0): row0 = df_date.iloc[0,:] d = row0.loc['enddate'] mean_c=np.mean(df_date.loc[:,'adjpoll_clinton'][df_date.loc[:,'enddate']==d]) mean_t=np.mean(df_date.loc[:,'adjpoll_trump'][df_date.loc[:,'enddate']==d]) sd_c = np.std(df_date.loc[:,'adjpoll_clinton'][df_date.loc[:,'enddate']==d]) sd_t = np.std(df_date.loc[:,'adjpoll_trump'][df_date.loc[:,'enddate']==d]) df_bins.loc[df_bins.shape[0]] = [d, mean_c, mean_t, sd_c, sd_t] df_date = df_date[df_date.loc[:,'enddate']!=d] df_po = df_po.reset_index(drop=True) df_bins.sort_values(by='date', inplace=True) Since our data is now organized correctly, we're ready to plot. <iframe width="900" height="600" frameborder="0" scrolling="no" src="//plot.ly/~junseopark/26.embed"></iframe> As was reported in the media (and the election results), Clinton held a lead in the popular vote until the end. The error bars begin with a wide spread, but towards the end of the race, the errors decrease in size, indicating a more confident sampling result. For context, the FBI issued its first statement regarding Clinton's private email server in the beginning of July (when we see Trump's poll ratings rise significantly). Other events that occurred but did not produce significant impact: - May 26, 2016: Trump secures Republican nomination - June 6, 2016: Clinton secures Democratic nomination - July 18-21, 2016: Republican National Convention - July 25-28, 2016: Democratic National Convention - September 26, 2016: First Presidential debate - October 4, 2016: VP debate - October 7, 2016: Trump's incriminating video is released - October 9, 2016: Second Presidential debate - October 19, 2016: Third Presidential debate - October 28, 2016: FBI announces a continued investigation into Clinton's private email server ### Bubble Charts Next, we look at the relation between sample size and poll results. <iframe width="900" height="600" frameborder="0" scrolling="no" src="//plot.ly/~emmanduncan/260.embed"></iframe> By isolating each candidate, we can see that as sample size increases, the poll results tend to converge. Very high and low percentages correlate to smaller polls, conducted at the state level. This further shows that the outcome of the election was difficult to predict, even with a larger poll sampling. In addition, we see that sample size tends to increase as we get closer to the date of the election. This makes sense, since more political awareness increases as the election date approaches. ## Plotting by Pollster Grade The dataset has 10 different `grade` levels: A+, A, A-, B+, B, B-, C+, C, C-, and D. In addition, there some polls do not have a ranking. For simplicity, we recategorized these into six groups: A+, A, B, C, D, and NA. ### Scatter Plot <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~junseopark/10.embed"></iframe> We see that there are not enough D and NA polls to draw conclusive evidence for those two groups. However, it is fascinating to see that generally, A and A+ polls expected Trump to fare worse in the beginning, whereas B and C polls do not show any particular bias towards Clinton (aside from the aforementioned 85% Clinton polls). ### Boxplot For further comparison by grade, we also plot each in boxplot form. This will allow us to see not only the change of data over time, but also the spread of the data. <iframe width="900" height="600" frameborder="0" scrolling="no" src="//plot.ly/~junseopark/28.embed"></iframe> We must be careful with boxplots, as we are grouping a dataset over time into one visualization. As expected, the lower grades (B and C) demonstrate a larger spread than the A+ and A polls. This is expected - logically, the polls with worse reputations should have more variability in their results. But even so, each of the grades still suggest that Clinton will emerge victorious. Perhaps we're still missing something here. ### Bubble Chart Next, we look at sample size as it relates to the grade of each pollster. <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~emmanduncan/262.embed"></iframe> As seen above, a larger sample size does not necesarially indicate a more accurate poll. Looking just at the U.S. polling data, polls graded 'B' or 'C' tend to have the largest sample size, with this size growing as the date approaches the election date. This indicates that sample size may not have a direct influence on the pollster grade. The polls graded 'A' and 'A+' have a smaller and fairly consistent sample size no larger than 2,500. From FiveThirtyEight, in order to calculate these poll grades, they consider "type of election surveyed, [and] a poll’s sample size" ([source](https://fivethirtyeight.com/features/how-fivethirtyeight-calculates-pollster-ratings/)). Although sample size plays a role in predicting accuracy, perhaps a more consistent sampling is better than a larger size of sample. For state polling data, sample size is relatively smaller and dependent on the population of each state. ## Plotting by State What role did each state play in the election? We begin by looking at all poll results divided by state: ### Categorical Scatter Plot <iframe width="900" height="600" frameborder="0" scrolling="no" src="//plot.ly/~emmanduncan/278.embed"></iframe> In this plot, it is easy to identify which states are clearly red or clearly blue (if the data points grouped by color and spread out), as well as the swing states (in which the data points are all grouped towards the center). It is interesting to note that even in the most polarized states, such as D.C. there are a few polls where Trump and Clinton's percentages are almost equal. state_dict = {'Alabama':'AL', 'Alaska':'AK', 'Arizona':'AZ', 'Arkansas':'AR', 'California':'CA', 'Colorado':'CO', 'Connecticut':'CT', 'Delaware':'DE', 'District of Columbia':'DC', 'Florida':'FL', 'Georgia':'GA','Hawaii':'HI', 'Idaho':'ID', 'Illinois':'IL', 'Indiana':'IN', 'Iowa':'IA', 'Kansas':'KS', 'Kentucky':'KY', 'Louisiana':'LA', 'Maine':'ME', 'Maryland':'MD', 'Massachusetts':'MA', 'Michigan':'MI', 'Minnesota':'MN', 'Mississippi':'MS', 'Missouri':'MO', 'Montana':'MT', 'Nebraska':'NE', 'Nevada':'NV', 'New Hampshire':'NH', 'New Jersey':'NJ', 'New Mexico':'NM', 'New York':'NY', 'North Carolina':'NC', 'North Dakota':'ND','Ohio':'OH', 'Oklahoma':'OK', 'Oregon':'OR', 'Pennsylvania':'PA', 'Rhode Island':'RI', 'South Carolina':'SC', 'South Dakota':'SD', 'Tennessee':'TN', 'Texas':'TX', 'Utah':'UT', 'Vermont':'VT', 'Virginia':'VA', 'Washington':'WA', 'West Virginia':'WV', 'Wisconsin':'WI', 'Wyoming':'WY'} data_list = [] for state, code in state_dict.items(): dict1 = {} samplesize = df_po[df_po.loc[:,'state']==state].loc[:,'samplesize'] clinton = df_po[df_po.loc[:,'state']==state].loc[:,'adjpoll_clinton'] trump = df_po[df_po.loc[:,'state']==state].loc[:,'adjpoll_trump'] dict1['state'] = state dict1['code'] = code dict1['samplesize'] = samplesize.sum() dict1['clinton'] = (clinton * samplesize).sum() dict1['trump'] = (trump * samplesize).sum() data_list.append(dict1) df_map = pd.DataFrame(data_list, columns=['code', 'state', 'samplesize', 'clinton', 'trump']) df_map.loc[:,'clinton_pct'] = df_map.loc[:,'clinton']/df_map.loc[:,'samplesize'] df_map.loc[:,'trump_pct'] = df_map.loc[:,'trump']/df_map.loc[:,'samplesize'] `df_map` now contains the aggregate percentages for each state. To look at the difference between the polls and the actual results by state, we read in [another dataset](https://github.com/junseo-park/election/blob/master/results.csv) ([source](https://simonrogers.net/2016/11/16/us-election-2016-how-to-download-county-level-results-data/)) containing the actual election results. We then subtracted the poll results from the actual results to see how much better (or worse) each candidate performed in each state. results = pd.read_csv('results.csv') results = results.sort_values('state') results = results.reset_index(drop=True) df_map['clinton_diff'] = results['clinton'] - df_map['clinton_pct'] df_map['trump_diff'] = results['trump'] - df_map['trump_pct'] <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~junseopark/30.embed"></iframe> <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~junseopark/32.embed"></iframe> From these two graphs, we can see that generally, both candidates performed better across the board. This can be attributed to partisanship; though a voter may not agree with the views of a particular candidate throughout the campaign, the voter is more likely to vote for a candidate simply to support the party and undermine the opposition. But when comparing these percentage boosts for both candidates, we can see that Trump's percentages rose by a larger number and had a bigger influence on the results. This was particularly true in the swing states; while Clinton's results increased around 2%, Trump's percentages in these states rose by around 7% - a substantial lead.


keep exploring!

back to all projects