REPRODUCING A GRAPH FROM THE ECONOMIST

USING GGPLOT TO RECREATE A GRAPH FROM THE ECONOMIST
1

R

MEDIUM

last hacked on Jul 22, 2017

# Replicating a Graph from _The Economist_ ## Data Scientists and Visualizations If you have not already seen, head over to _The Economist_ and check out their website. In particular, look at the graphs they created here: http://www.economist.com/blogs/graphicdetail The graphs they created are perhaps the paragon of data visualization. Their graphs are crisp, clean, and informative. Forget the article, the entire meat of the story is told through a graph that a five year old can read. Every data scientist faces, at some point, the question of, "what significant question do I answer through data visualization." Data scientists can clean and manipulate data like no other person can. But, I would argue, they are not true data scientists if they cannot form questions that no one knew they wanted answers to. ## The Problem The problem is I cannot teach you how to ask fascinating questions about the data you just cleaned. Actually, nobody can help you. There is no algorithm that can teach you what to extract from your data to answer a burning question. However, what you can do is become familar with a wide variety of topics such as sports, politics, technology, finance, and, most importantly, _your_ interests. You do not have to be an expert in every topic, but learn the patterns and problems that those topics face. Only then will you begin to see the interconnectedness across all disciplines. ## Project Abstract In this project, we will tackle __Data Visualization__ in `R` using the beautiful package `ggplot`. In particular, we will try our best to replicate this graph created by _The Economist_. ![Economist](https://cloud.githubusercontent.com/assets/22850980/24850224/0dedd2e8-1d84-11e7-88e6-4137b3f662fb.jpg) When I first looked at this graph, I knew plotting the data was going to be easy. I see a scatter plot layered on a line. Easy, right? Well, kind of. In this project, we will try to get as close as we can to the actual output from _The Economist_ using the ggplot library. ## Inspiration and Documentation __The Economist__ * How can you not be inspired by their graphs! Look at what they plotted and try to figure out a reason for why, say, a histogram was more useful and informative instead of, say, a scatterplot matrix. http://www.economist.com/blogs/graphicdetail __Harvard Workshop on `ggplot`__ * Harvard put together a workshop that introduced students to `ggplot` in the most Harvard way possible--"reproduce this graph from _The Economist_ given these tools we will discuss in the workshop." Talk about throw them to the lions and come out alive. My first `ggplot` was attempting a scatterplot. They give excellent information here and the challenge was this same exact graph that we will replicate. Unfortunately, they did not give a solution so most of this project was me doing trial and error and a ton of Google searches. I am sure there are better and more efficient solutions to my code, but that is your challenge. http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html
# Reproducing a Graph from _The Economist_ ## This is Our Goal ![Economist](https://cloud.githubusercontent.com/assets/22850980/24850224/0dedd2e8-1d84-11e7-88e6-4137b3f662fb.jpg) ## Getting Started First we need to load the `library()` requirements. ```R library(ggplot2) library(ggrepel) library(grid) ``` Warning message: “package ‘ggplot2’ was built under R version 3.3.2”Warning message: “package ‘ggrepel’ was built under R version 3.3.2” Next, we need to set the working directory, if necessary, and load in our data. ```R getwd() setwd("/Users/Macbook/Desktop/My_Projects") economist <- read.csv("Rgraphics/dataSets/EconomistData.csv") ``` '/Users/Macbook' Before I start messing with the data at all, I always look at the `head()`. ```R head(economist) ``` <table> <thead><tr><th scope=col>X</th><th scope=col>Country</th><th scope=col>HDI.Rank</th><th scope=col>HDI</th><th scope=col>CPI</th><th scope=col>Region</th></tr></thead> <tbody> <tr><td>1 </td><td>Afghanistan </td><td>172 </td><td>0.398 </td><td>1.5 </td><td>Asia Pacific </td></tr> <tr><td>2 </td><td>Albania </td><td> 70 </td><td>0.739 </td><td>3.1 </td><td>East EU Cemt Asia</td></tr> <tr><td>3 </td><td>Algeria </td><td> 96 </td><td>0.698 </td><td>2.9 </td><td>MENA </td></tr> <tr><td>4 </td><td>Angola </td><td>148 </td><td>0.486 </td><td>2.0 </td><td>SSA </td></tr> <tr><td>5 </td><td>Argentina </td><td> 45 </td><td>0.797 </td><td>3.0 </td><td>Americas </td></tr> <tr><td>6 </td><td>Armenia </td><td> 86 </td><td>0.716 </td><td>2.6 </td><td>East EU Cemt Asia</td></tr> </tbody> </table> ## Data Wrangling All I did here was to get the $R^2$ line. I think this is what they did. I could be wrong. ```R pred <- predict(lm(HDI~ log(CPI), data = economist)) ``` Looking at the graph and the head of our data, I see no clear indication of why they chose to label certain points. So I handpicked these out and stored them in a vector called `pointsToLabel`. After that, I used `factor` to match up the corresponding labels to what _The Economist_ actually labeled them as. *** Example: From the Region column, "Asia Pacific" corresponds to "Asia & Oceania" *** ```R pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan", "Afghanistan", "Congo", "Greece", "Argentina", "Brazil", "India", "Italy", "China", "South Africa", "Spane", "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France", "United States", "Germany", "Britain", "Barbados", "Norway", "Japan", "New Zealand", "Singapore") economist$Region <- factor(economist$Region, levels = c("EU W. Europe", "Americas", "Asia Pacific", "East EU Cemt Asia", "MENA", "SSA"), labels = c("OECD", "Americas", "Asia &\nOceania", "Central &\nEastern Europe", "Middle East &\nNorth Africa", "Sub-Saharan\nAfrica")) ``` ## Plotting Let's start off small and build our way up. The documentation for `ggplot` gives us this general outline. *** ` ggplot(data, aes(x, y, <other aesthetics like color, fill, etc.>))` *** ```R ggplot(economist, aes(x = CPI, y = HDI)) ``` ![png](https://cloud.githubusercontent.com/assets/22850980/24988654/281fd320-1fbc-11e7-8fd9-71c6a0da56eb.png) Perfect. This is what we wanted. We are setting ourselves up for greatness. Let's start adding our scatterplot and linear model. ```R ggplot(economist, aes(x = CPI, y = HDI))+ geom_point()+ geom_smooth(aes(y = pred)) ``` `geom_smooth()` using method = 'loess' ![png](https://cloud.githubusercontent.com/assets/22850980/24988658/28209526-1fbc-11e7-8381-89c25187e921.png) This is good. But, this looks so basic it hurts. I rather look at the blank canvas of before because at least I can visualize a pretty plot. Let's take advantage of `ggplot` geometries to specify aesthetics like the shape, color, fill, line color, line weight, etc. ```R ggplot(economist, aes(x = CPI, y = HDI))+ geom_point(aes(color = Region), shape = 1, fill = 4, stroke = 1.5, alpha = 1, size = 3) + geom_smooth(aes(y = pred), color = "red", linetype = 1, weight = 2, fullrange=TRUE) ``` `geom_smooth()` using method = 'loess' ![png](https://cloud.githubusercontent.com/assets/22850980/24988656/282014fc-1fbc-11e7-8b6c-96cf595e53c0.png) This is atrocious. The legend, tells me nothing. The colors of the circles, tells me nothing. If I saw this in a magazine, I would think they are plotting the Consumer price index to the Human Death Index, if that's a thing, and I colored it according to region. Either you are from the Americas or you are not applicable. Let's add some labels to see what's going on. ```R ggplot(economist, aes(x = CPI, y = HDI))+ geom_point(aes(color = Region), shape = 1, fill = 4, stroke = 1.5, alpha = 1, size = 3) + geom_smooth(aes(y = pred), color = "red", linetype = 1, weight = 2, fullrange=TRUE)+ geom_text_repel(aes(label=Country)) ``` `geom_smooth()` using method = 'loess' ![png](https://cloud.githubusercontent.com/assets/22850980/24988655/281ff012-1fbc-11e7-9b99-99de3b4925c6.png) So, let's start off with the good. __The Good__ * I now know that the dots correspond to countries. __The Bad__ * Come up with your own insult. I'm done. This is where we get on Google and start asking some questions. This is the perfect example of shooting yourself in the foot with too much information. Find a balance. The people at _The Economist_ probably had this exact same graph at some point. They did not have any foresight like us. They had no idea what the end product would be. This is where the real work starts. ## Where are we? *** ![Economist](https://cloud.githubusercontent.com/assets/22850980/24850224/0dedd2e8-1d84-11e7-88e6-4137b3f662fb.jpg) ## Fine Tuning _Didn't we pick out certain countries to label?_ * Yes! Let's put that in there using `subset()` _I still don't know what the axis labels are._ * Let's put that in there using `scale_x_continuous` and `scale_y_continuous` _We have not broken up the axis yet either._ * Perfect timing since we just used `scale_x_continuous`and `scale_y_continuous`. We can put in the optional arguments `breaks` and `limits` _Last request. I promise. But, can we add a title? And can we get rid of the ugly gray background, if possible? * That was two requests, but yes two both of them. For the title, we can use `ggtitle()`. And, for the ugly gray background, `theme_minimal()`. Remember, for these two, we are layering. Therefore, we have to use the "+" for these. ```R ggplot(economist, aes(x = CPI, y = HDI))+ geom_point(aes(color = Region), shape = 1, fill = 4, stroke = 1.5, alpha = 1, size = 3)+ geom_text_repel(aes(label=Country), data = subset(economist, Country %in% pointsToLabel), force = 10) + geom_smooth(aes(y = pred), color = "red", linetype = 1, weight = 2, fullrange=TRUE) + scale_x_continuous(name = "Corruption Perceptions Index", breaks = seq(1, 10, by = 1), limits = c(1, 10))+ scale_y_continuous(name = "Human Development Index, 2011", breaks = seq(0, 1.0, by = 0.1), limits = c(0.2, 1.0))+ scale_color_manual(name = "", values = c("#24576D", "#099DD7", "#28AADC", "#248E84", "#F2583F", "#96503F")) + ggtitle("Corruption and Human development") + theme_minimal() ``` `geom_smooth()` using method = 'loess' Warning message: “Removed 10 rows containing missing values (geom_smooth).” ![png](https://cloud.githubusercontent.com/assets/22850980/24988657/2820561a-1fbc-11e7-918d-955eca497f02.png) This is leaps and bound better than before. If you are satisfied, you can stop here. The last chunk is mainly trial and error when it comes to positioning the legend, title, axis attributes, and various other minor changes. I suggest playing around with different values to see what changes the last bits of the graph. ```R ggplot(economist, aes(x = CPI, y = HDI))+ geom_point(aes(color = Region), shape = 1, fill = 4, stroke = 1.5, alpha = 1, size = 3)+ geom_text_repel(aes(label=Country), data = subset(economist, Country %in% pointsToLabel), force = 10) + geom_smooth(aes(y = pred), color = "red", linetype = 1, weight = 2, fullrange=TRUE) + scale_x_continuous(name = "Corruption Perceptions Index", breaks = seq(1, 10, by = 1), limits = c(1, 10))+ scale_y_continuous(name = "Human Development Index, 2011", breaks = seq(0, 1.0, by = 0.1), limits = c(0.2, 1.0))+ scale_color_manual(name = "", values = c("#24576D", "#099DD7", "#28AADC", "#248E84", "#F2583F", "#96503F")) + ggtitle("Corruption and Human Development") + theme_minimal() + # start with a minimal theme and add what we need theme(text = element_text(color = "gray20"), legend.position = "top", # position the legend in the upper left legend.direction = "horizontal", legend.justification = c(0.1,0), # anchor point for legend.position. legend.text = element_text(size = 11, color = "gray10"), axis.text = element_text(face = "italic"), axis.title.x = element_text(vjust = -1), # move title away from axis axis.title.y = element_text(vjust = 2), # move away for axis axis.ticks.y = element_blank(), # element_blank() is how we remove elements axis.line = element_line(color = "gray40", size = 0.5), axis.line.y = element_blank(), panel.grid.major = element_line(color = "gray50", size = 0.5), panel.grid.major.x = element_blank() ) + guides(colour = guide_legend(nrow = 1)) # forces legend to be in a single line (mR2 <- summary(lm(HDI ~ log(CPI), data = economist))$r.squared) grid.text("Sources: Transparency International; UN Human Development Report", x = .02, y = .03, just = "left", draw = TRUE) grid.segments(x0 = 0.81, x1 = 0.825, y0 = 0.90, y1 = 0.90, gp = gpar(col = "red"), draw = TRUE) grid.text(paste0("R² = ", as.integer(mR2*100), "%"), x = 0.835, y = 0.90, gp = gpar(col = "gray20"), draw = TRUE, just = "left") ``` # Full Code ```R library(ggplot2) library(plotly) library(tidyr) library(ggrepel) library(ggthemes) library(grid) getwd() setwd("/Users/Macbook/Desktop/My_Projects") economist <- read.csv("Rgraphics/dataSets/EconomistData.csv") pred <- predict(lm(HDI~ log(CPI), data = economist)) head(economist) pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan", "Afghanistan", "Congo", "Greece", "Argentina", "Brazil", "India", "Italy", "China", "South Africa", "Spane", "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France", "United States", "Germany", "Britain", "Barbados", "Norway", "Japan", "New Zealand", "Singapore") economist$Region <- factor(economist$Region, levels = c("EU W. Europe", "Americas", "Asia Pacific", "East EU Cemt Asia", "MENA", "SSA"), labels = c("OECD", "Americas", "Asia &\nOceania", "Central &\nEastern Europe", "Middle East &\nNorth Africa", "Sub-Saharan\nAfrica")) ggplot(economist, aes(x = CPI, y = HDI)) +geom_point(aes(color = Region), shape = 1, fill = 4, stroke = 1.5, alpha = 1, size = 3)+ geom_text_repel(aes(label=Country), data = subset(economist, Country %in% pointsToLabel), force = 10) + geom_smooth(aes(y = pred), color = "red", linetype = 1, weight = 2, fullrange=TRUE) + scale_x_continuous(name = "Corruption Perceptions Index", breaks = seq(1, 10, by = 1), limits = c(1, 10))+ scale_y_continuous(name = "Human Development Index, 2011", breaks = seq(0, 1.0, by = 0.1), limits = c(0.2, 1.0))+ scale_color_manual(name = "", values = c("#24576D", "#099DD7", "#28AADC", "#248E84", "#F2583F", "#96503F")) + ggtitle("Corruption and Human development")+ theme_minimal()+ # start with a minimal theme and add what we need theme(plot.background = element_rect(fill = "#FAEBD7"), # background color text = element_text(color = "gray20"), legend.position = "top", # position the legend in the upper left legend.direction = "horizontal", legend.justification = c(0.1,0), # anchor point for legend.position. legend.text = element_text(size = 11, color = "gray10"), axis.text = element_text(face = "italic"), axis.title.x = element_text(vjust = -1), # move title away from axis axis.title.y = element_text(vjust = 2), # move away for axis axis.ticks.y = element_blank(), # element_blank() is how we remove elements axis.line = element_line(color = "gray40", size = 0.5), axis.line.y = element_blank(), panel.grid.major = element_line(color = "gray50", size = 0.5), panel.grid.major.x = element_blank() ) + guides(colour = guide_legend(nrow = 1)) # forces legend to be in a single line (mR2 <- summary(lm(HDI ~ log(CPI), data = economist))$r.squared) grid.text("Sources: Transparency International; UN Human Development Report", x = .02, y = .02, just = "left", draw = TRUE) grid.segments(x0 = 0.81, x1 = 0.825, y0 = 0.90, y1 = 0.90, gp = gpar(col = "red"), draw = TRUE) grid.text(paste0("R² = ", as.integer(mR2*100), "%"), x = 0.835, y = 0.90, gp = gpar(col = "gray20"), draw = TRUE, just = "left") ``` # Final Product ![Economist](https://cloud.githubusercontent.com/assets/22850980/24860425/60d845ae-1da9-11e7-8809-ac02fa3a1dbd.jpg)

COMMENTS


Hey man. So I used this as a basis for a project I made to help me get used to ggplot. Great works on this and thanks for posting it.





keep exploring!

back to all projects