BIKES STOLEN IN ISLA VISTA (2010-2016)

EXPLORATORY ANALYSIS ON LOCATIONS OF BICYCLE THEFTS IN ISLA VISTA
0

R

MEDIUM

last hacked on Jan 19, 2019

This project is an analysis on the recorded bikes stolen within **Isla Vista** during the years **2010** to **2016**, which serves as a guide to help you explore a raw data set and dive into the process known as *data wrangling*. Although we could not make many statistically sound conclusions or predictions due to the incomplete data we were dealt with, we thought this would be a good example of working with real time data, which can consist of missing values and imperfect data.
# Load Packages To start off, we load the appropriate packages into our R environment. As a side note, all the features we used for our `ggplot2` instance only work after you install `devtools` to directly download `ggplot2` from *GitHub* using an **API** call. If this is not done, then the `subtitle` feature will not appear on your plots. **UPDATE** (5/23/2017): As of the newest **CRAN** release, this step isn't necessary. `subtitle` should be a feature in `ggplot2`. ## Loading R Packages ```{r} install.packages(devtools) # MUST BE RUN BEFORE THE 'devtools::install_github' # OR THE FUNCTION WILL NOT RUN PROPERLY** library(devtools) devtools::install_github("dkahle/ggmap") devtools::install_github("hadley/ggplot2") install.packages(data.table) install.packages(plyr) library(ggmap) library(ggplot2) library(data.table) library(plyr) ``` # Get the Data This data was obtained through the [Santa Barbara Sheriff's County Office](http://www.sbsheriff.org/). We received the **CSV** file and proceeded to clean it to fit our statistical analysis. To access the data, go [here](https://github.com/inertia7/spatial_data_bikesStolen_R) and download the file named bikesStolen.csv. An important process during this step was manually collecting the **Longitude** and **Latitude** since the data only featured the overarching addresses. Another thing to note was that for the addresses that were featured, we decided to include the **Longitude** and **Latitude** of the **beginning of the street** (so instead of the specific area in the street like **6574 Sabado Tarde**, the instance was recorded as **6500 Sabado Tarde** We begin by reading in the **CSV** file into our **R** environment ```{r} bikesStolen <- read.csv("bikesStolen.csv", stringsAsFactors = F) ``` Take note of how we included 'stringsAsFactors = F' when reading the **CSV** file. If we don't state this, every column with strings will be misinterpreted as factors. For this project, we used the 'data.table' package. We converted the object to a **Data Table** and output the first 6 values of our data set to give us an idea of what the data set looks like (If you want a more in-depth look, you can use the function 'View(bikesStolen)', which will allow you to view the entire data set in a separate window.) ```{r} bikesStolen <- data.table(bikesStolen) head(bikesStolen) ``` ## Terminal Output ``` > head(bikesStolen) fromDate toDate fromTime toTime 1 4/20/2014 4/20/2014 900 1200 2 3/8/2011 3/8/2011 1630 1930 3 6/10/2011 6/20/2011 1500 1000 4 8/4/2012 8/5/2012 2230 100 5 8/4/2012 8/5/2012 2230 100 6 10/22/2011 10/24/2011 30 1230 Brand Model Speed Color Location URBAN CRUISER 1 WHI 1000 CAMINO PESCADERO TREK ALLANT NA BLK 1000 EL EMBARCADERO RD PFFRAN JULIUS 1 WHI 1000 EL EMBARCADERO RD RETROS 58 1 SIL 1000 EL EMBARCADERO RD 1 BLK 1000 EL EMBARCADERO RD K2 ZED 2.0 21 GRY 6500 CERVANTES RD ``` As you can see, we have 9 variables corresponding to various potential factors for which the bicycles were stolen. This is given by 'fromDate' and 'toDate', which indicate the estimated dates of theft. 'fromTime' and 'toTime' indicate the estimated time of theft. We have other variables relating to the brand, model, color, and speed of the bicycle, which we will go into later. Next, we check the structure of the **Data Table** to see if the variables are in the desired format using 'str'. ```{r} str(bikesStolen) ``` ## Terminal Output ``` > str(bikesStolen) Classes 'data.table' and 'data.frame': 940 obs. of 9 variables: $ fromDate: chr "4/20/2014" "3/8/2011" "6/10/2011" "8/4/2012" ... $ toDate : chr "4/20/2014" "3/8/2011" "6/20/2011" "8/5/2012" ... $ fromTime: int 900 1630 1500 2230 2230 30 900 400 400 1300 ... $ toTime : int 1200 1930 1000 100 100 1230 900 500 500 915 ... $ Brand : chr "URBAN" "TREK" "PFFRAN" "RETROS" ... $ Model : chr "CRUISER" "ALLANT" "JULIUS" "58" ... $ Speed : int 1 NA 1 1 1 21 12 24 1 7 ... $ Color : chr "WHI" "BLK" "WHI" "SIL" ... $ Location: chr "1000 CAMINO PESCADERO " "1000 EL EMBARCADERO RD" "1000 EL EMBARCADERO RD" ... ``` # Data Wrangling This section will be the most important section of our project. We show the process known as **data wrangling**, which is where we convert the data table into the appropriate format. Once this is completed, we proceed to write a new **CSV** file and do some exploratory analysis. ## Data Types The variables and data types to be converted are listed below (We go into more detail in the following section): **Variable** | **Data Type** --------------|--------------- fromDate | Date Type toDate | Date Type Brand | Factor Type Model | Factor Type Speed | Factor Type Colors | Ordered Factor Type ## Converting Data Types Here, we see that the **Date** variables were also interpreted as characters. We want them to only be interpreted as dates, so we change these variables into date format using the function 'as.Date': ```{r} bikesStolen$fromDate <- as.Date(bikesStolen$fromDate, format = "%m/%d/%Y") bikesStolen$toDate <- as.Date(bikesStolen$toDate, format = "%m/%d/%Y") ``` We convert all the variables that are supposed to be factors using the 'as.factor' function: ```{r} bikesStolen$Speed <- as.factor(bikesStolen$Brand) bikesStolen$Speed <- as.factor(bikesStolen$Model) bikesStolen$Speed <- as.factor(bikesStolen$Speed) ``` Then, we create an ordered factor for the Color of the bicycles because we see that the count of the colors are listed in order of **Car Color Codes**. We looked up what each code meant, then used the 'ordered' function to set the appropriate color and order of the colors. ```{r} colorLabels <- c("Black","Blue","White","Green","Red", "Grey", "Silver", "Yellow","Purple","Pink","Orange", "Brown","Light Green", "Light Blue","Burgundy/Maroon","Turquoise", "Dark Green","Tan","Chrome","Gold","Diamond Blue", "Cream/Ivory","Beige", "Not Given") bikesStolen$Color <- ordered(bikesStolen$Color, levels=c("BLK","BLU","WHI","GRN","RED","GRY", "SIL","YEL","PLE","PNK","ONG", "BRO","LGR","LBL","MAR","TRQ","DGR", "TAN","COM","GLD","DBL","CRM","BGE", ""), labels=colorLabels) ``` ## GeoCoding We currently have the addresses where the bikes were stolen, but we can't visualize them as is. A great way to do this would be to make a spatial plot, and for that, we need to turn our addresses into coordinates. To get their **Longitudes** and **Latitudes**, we use the 'geocode' function in 'ggmaps'. NOTE: (1) this process takes a while to run, and there is a limit as to how many **API** calls you can do a day (2,500 address per IP address) (2) For one of the streets, the wrong address was interpreted (we were given a location in some other state). To prevent that from causing errors down the line, we manually add **Goleta** to the location within the **CSV** file. ```{r} coords <- geocode(bikesStolen$Location) bikesStolen <- data.table(bikesStolen, coords) ``` Here's a snippet of 'ggmaps' making the **API** call: ## Terminal Output ```{r} > coords <- geocode(DF$Location) ``` ``` Source : https://maps.googleapis.com/maps/api/geocode/json?address=1000%20CAMINO%20PESCADERO%20 Source : https://maps.googleapis.com/maps/api/geocode/json?address=1000%20EL%20EMBARCADERO%20RD Source : https://maps.googleapis.com/maps/api/geocode/json?address=1000%20EL%20EMBARCADERO%20RD Source : https://maps.googleapis.com/maps/api/geocode/json?address=1000%20EL%20EMBARCADERO%20RD Source : https://maps.googleapis.com/maps/api/geocode/json?address=1000%20EL%20EMBARCADERO%20RD Source : https://maps.googleapis.com/maps/api/geocode/json?address=6500%20CERVANTES%20RD ``` Let's output our data to make sure that the coordinates came out correctly. ```{r} head(bikesStolen) ``` ## Terminal Output ```{r} > head(bikesStolen) ``` ``` fromDate toDate fromTime toTime Brand Model Speed Color 1:2014-04-20 2014-04-20 900 1200 URBAN CRUISER 1 White 2:2011-03-08 2011-03-08 1630 1930 TREK ALLANT NA Black 3:2011-06-10 2011-06-20 1500 1000 PFFRAN JULIUS 1 White 4:2012-08-04 2012-08-05 2230 100 RETROS 58 1 Silver 5:2012-08-04 2012-08-05 2230 100 1 Black 6:2011-10-22 2011-10-24 30 1230 K2 ZED 2.0 21 Grey Location 1000 CAMINO PESCADERO 1000 EL EMBARCADERO RD 1000 EL EMBARCADERO RD 1000 EL EMBARCADERO RD 1000 EL EMBARCADERO RD 6500 CERVANTES RD lon lat -119.8587 34.41047 -119.8559 34.41022 -119.8559 34.41022 -119.8559 34.41022 -119.8559 34.41022 -119.8538 34.41642 ``` Now that we have the data table in the format we want, our next step is identifying the problems we encounter with this data set. As stated earlier, the most obvious problem is missing data. The next section goes into detail about the types of missing data we might encounter. # Outputting CSV File We decided to include this section because some people might want to skip the data wrangling and go straight into exploratory analysis. However, it is necessary to include both the raw file and the cleaned, user-friendly file once you have processed it. This is to ensure the original file is saved in case we need to refer to it later on. Therefore, we output a csv with the newly converted columns and data types. ```{r} write.csv(bikesStolen, file = "bikesStolen.csv") ``` At this point, if you would like to skip ahead to the exploratory analysis section, you're more than welcome to do so! # Dealing with Missing Data When working with this data set, we have to be mindful of the impact the missing data has on our analysis. It would be a *grave* mistake to take the results we received from our exploratory analysis and make conclusions without first understanding why our data is missing. This is important because missing data is a part of a statistician's life (we will not be handed a clean data set like we have in classes). In real world situations, whether it's through gathering your own data or using data for your job, you will encounter missing values. Thus, understanding why certain data is missing can lead to statistically sound reporting/conclusions and make you more reliable when you are responsible for doing data analysis for a company. It can also help you identify and correct data collecting practices that lead to missing info. ## Types of Missing Data There are **3** types of missing data within statistical analysis. Here we give a brief overview of the three: * **Not Missing at Random (or NMAR)**: referred to as *non-ignorable*, is missing data that has a clear pattern. This type plays a crucial role in statistical analysis and is important for us to be careful of. * **Missing Completely at Random (or MCAR)**: missing data with no direct relationship to the data set. They are missing simply because they are random. Highly unlikely to encounter in the real world, but it's the ideal form of missing data. This can be remedied by statistical modeling/inference with no present bias, where removing the data or *imputing* (more on this later) the data would yield sound statistical findings! * **Missing at Random (or MAR)**: This process can be confusing to understand due to its name (it sounds very similar to **MCAR**), but **MAR** refers to the relationship that the missing data has with other variables within the data set. An example would be finding out that the reason why stolen bicycle models were not reported (and missing from the data) is related to the location of the theft. This won't hold true for every single missing model value, but it does hold true for 90% of the people within the street of Del Playa because they figured the bicycle model they have is too common in the area and will be difficult to find. This might lead them to forego reporting that their bike was stolen. ## Finding Statistical Significance of Missing data Upon review of *how much missing data is too much missing data*, we find that if a certain feature has over **5% missing data**, we should not include that feature in our statistical analysis. So we calculate the amount of missing data for our variable 'Color' within our **bikesStolen** 'data.table' as such: ```{r} (length(bikesStolen$Color[bikesStolen$Color == "Not Given"])/length(bikesStolen$Color)) * 100 ``` ## Terminal Output ```{r} [1] 13.72% ``` There are about **14%** of the colors missing, so we conclude that we should **delete** this feature from the 'data.table' if we were to do any predictive statistical analysis (but we're not). # Exploratory Analysis ## Converting Date and Time Sections For this section we combine the `to` and `from` sections to be more representative of the time frames. We edited the hour columns to be in the `%H:%m` format then combined the two sections and found the mean and standard deviation. fromTime_zeros <- mapply(function(x, y) paste0(rep(x, y), collapse = ""), 0, 4 - nchar(DF$fromTime)) This first section creates a vector with the appropriate amount of 0's to add to the time columns fromTime_temp <- paste0(fromTime_zeros, DF$fromTime) fromTime_temp <- format(strptime(fromTime_temp, format="%H%M"), format = "%H:%M") fromTime_temp[fromTime_temp == '00:01'] <- '01:00' fromTime_temp <- data.table(fromTime_temp) fromTime_date <- format(strptime(temp$temp, format = "%H:%M"), format = "%H:%M") Now we output it to be in the correct format but as a character since if we converted to time-date it would not be the correct date. Same process for the `to` columns # TO TIME STUFF toTime_zeros <- mapply(function(x, y) paste0(rep(x, y), collapse = ""), 0, 4 - nchar(DF$toTime)) toTime_temp <- paste0(toTime_zeros, DF$toTime) toTime_temp <- format(strptime(toTime_temp, format="%H%M"), format = "%H:%M") toTime_temp[toTime_temp == '00:01'] <- '01:00' toTime_temp <- data.table(toTime_temp) testTimeFrom <- strptime(temp$temp, format="%H:%M") testTimeTo <- strptime(temp_to$temp_to, format="%H:%M") # View(test_df) # CONCATENATING TO DATE VALUES fromStuff <- paste0(DF$fromDate, ' ', fromTime_date) fromStuffNew <- strptime(fromStuff, format='%Y-%m-%d %H:%M') # TO STUFF toStuff <- paste0(DF$toDate, ' ', toTime_temp$toTime_temp) toStuffNew <- strptime(toStuff, format='%Y-%m-%d %H:%M') Here we're combining the two new columns and converting to hours. combine_df <- data.table(fromStuffNew, toStuffNew) new_combine_df <- combine_df[complete.cases(combine_df$toStuffNew), ] new_combine_df_hrs <- (new_combine_df$toStuffNew - new_combine_df$fromStuffNew)/(3600*24) new_combine_df_hrs <- data.table(new_combine_df_hrs) ## Mean of Time Difference Here we calculate the mean for the new `data.table`. mean_time <- mean(new_combine_df_hrs$new_combine_df_hrs) mean_time <- as.vector(mean_time) sprintf('The average time interval for bicycle thefts is: %.4s days', (mean_time)) ## Terminal Output [1] "The average time interval for bicycle thefts is: 3.16 days" ## Standard Deviation of Time Difference std_time <- sd(new_combine_df_hrs$new_combine_df_hrs) std_time sprintf('The standard deviation for time intervals of bicycle thefts is: %.5s', (std_time)) ## Terminal Output [1] "The standard deviation for time intervals of bicycle thefts is: 12.96" ## Distribution of Time Difference For this section we created an interactive visual to show the distribution of the time difference. There are some outliers but generally people report the thefts to have occurred within an hour time span. ggplot(new_combine_df_hrs, aes(new_combine_df_hrs)) + geom_histogram(bins = '773', fill = 'white', colour = 'turquoise4') + labs(x = 'Hours', y = 'Count') + ggtitle('Distribution of Time Difference', subtitle = 'Difference in "to" and "from" time') + theme_minimal() <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~raviolli77/78.embed"></iframe> ## Visual Representation of Missing Values for Color For the next section, we decide to dissect the color column more to visually show the impact that missing values have on the total amount of bicycles reported and the color of the reported bicycles. ```{r} colors <- data.table(table(bikesStolen$Color)) colnames(colors) <- c("Color", "Counts") # HERE, WE MANUALLY # ADD THE NAMES BECAUSE THE DATA TABLE WILL # NAME THE COLUMNS V1 AND N # NEXT WE RUN THE 'ordered' FUNCTION AGAIN TO MAKE SURE # THEY ARE OUTPUTTED IN THE ORDER WE WOULD LIKE THEM TO BE! colors$Color <- ordered(colors$Color,levels=colorLabels) colors$Color ``` ## Terminal Output ``` > colors Color Counts 1: Black 176 2: Blue 126 3: White 98 4: Green 73 5: Red 68 6: Grey 58 7: Silver 41 8: Yellow 34 9: Purple 31 10: Pink 24 11: Orange 17 12: Brown 9 13: Light Green 9 14: Light Blue 8 15: Burgundy/Maroon 8 16: Turquoise 6 17: Dark Green 5 18: Tan 5 19: Chrome 4 20: Gold 4 21: Diamond Blue 3 22: Cream/Ivory 2 23: Beige 2 24: Not Given 129 Color Counts ``` We can see from the terminal output that the amount of bicycle colors not reported is second highest in count after **black**. This suggests that there is a critical failure in the data entry with respect to this variable, when bicycles are reported. Here, we visually plot the colors utilizing 'GGplot2' to show the significance of the amount of missing colors for the reported stolen bicycles! ```{r} bC <- ggplot(colors, aes(Color, Counts, fill = Color)) + geom_bar(stat='identity', colour = "turquoise2") + labs(title="Color of Bikes Stolen within Isla Vista(2011-2016)") + theme(panel.background = element_rect(fill = "gray98"), axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none") + scale_fill_manual(values=c("#000000", "#1f23af", "#ffffff", "#03c11d","#dd1616", "#808080", "#C0C0C0", "#ffff00", "#800080","#FFC0CB", "#FF8C00", "#8B4513","#76EE00", "#ADD8E6", "#800000", "turquoise4","#006400", "#D2B48C","#a8a8a8", "#D4AF37", "#0EBFE9", "#FCFBE3", "#f5f5dc", "#2d423f")) ``` <img src="https://i.imgur.com/D7BGLuz.png"> As we can see from the plot, there is a large percentage of bicycles where the color wasn't reported. This plays a significant role in our statistical analysis because we cannot predict or make assumptions based on the given data; we can remove these values and say that if you have a **Black** bicycle you are more likely to have it stolen, but this would be a risky assumption to make. If we were to find the values of the non-reported colors, this can greatly offset the distribution we have assumed with the given data. But due to the amount of missing values, we cannot say which color of bicycle will lead to more thefts. This is a pitfall that is all too common when handling real data. We thought it was important to note because it is something you will experience when not using data that is given through a neat, pre-made data set. ## Creating function Before we continue into our other segment in order to capture the differences in bicycle colors throughout the years we created a function to help us create this individual plots. ```{r} byYearColor <- function(yr){ temp <- bikesStolen[fromDate <= sprintf("%s-12-31", yr) & fromDate >= sprintf("%s-01-01", yr)] ggplot(temp, aes(Color, fill = Color)) + geom_bar(stat='count', colour = "turquoise2") + labs(title = sprintf("Color of Bikes Stolen within Isla Vista (%s)", yr), x = 'Color', y = 'Count') + theme(panel.background = element_rect(fill = "gray98"), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") + scale_fill_manual(values = c("#000000", "#1f23af", "#ffffff", "#03c11d", "#dd1616", "#808080", "#C0C0C0", "#ffff00", "#800080", "#FFC0CB", "#FF8C00", "#8B4513", "#76EE00", "#ADD8E6", "#800000", "turquoise4", "#006400", "#D2B48C", "#a8a8a8", "#D4AF37", "#0EBFE9", "#FCFBE3", "#f5f5dc", "#2d423f"), limits = levels(temp$Color)) } ``` Once you run this function you will get a bar plot for the respective year you input. We then employed a for loop to save all the visuals and created a gif to better show the results. <img src='https://i.imgur.com/Gn410q3.gif'> years <- c(2011, 2012, 2013, 2014, 2015, 2016) # IF YOU WISH TO SAVE THE PLOTS RUN THIS FOR LOOP for (i in years){ myPlots <- byYearColor(i) ggsave(myPlots, filename=paste("images/bikesStolenColorYear", i,".png", sep="")) } Once we determine that the colors of the reported stolen bicycles is the only variable that had more than **5%** missing data, we move on to the data we can make statistical inferences on. This starts with the location of the bicycle thefts. This form of statistical analysis is referred to as **Spatial Analysis**. We utilize 'Ggplot2' and 'ggmaps' to produce the following plots. ## Spatial Maps For this next step, we include plots that visually show the frequency of stolen bicycles for the whole data set. We did this process using the package 'ggmap' and this as reference. We encourage people to research the process of creating maps using spatial data. This can give you some insight as to what the data set is showing us visually. Here, we included the code to cleaning up the data set and creating the plots! (This can look daunting, but when you understand the parts, it becomes simple!) ```{r} density <- ddply(bikesStolen, .(Location), "nrow") colnames(density) <- c("Location", "count") DF <- merge(bikesStolen, density, by = "Location") ``` Now that we have the newly transformed data frame, we can go about creating plots. For this part, we include various methods and conclusions as to which is the best in showing us valuable information. ```{r} islaVistaMap <- qmap("Isla Vista", zoom = 15, maptype = "hybrid", source = "google") circle_scale_amt = .4 islaVistaMap + geom_point(data=DF, mapping = aes(x = long, y = lat, colour = Location), na.rm=TRUE, alpha = 1/40, size = DF$count*circle_scale_amt) + theme(legend.position="none") + scale_size_continuous(range=range(DF$count)) + ggtitle("Frequency of Bikes Stolen in Isla Vista", subtitle = "Encompassing 2010-2016") ``` <img src = "https://i.imgur.com/2ejqw4C.png"> Unfortunately, there are limitations as to how we create these plots, and interactivity is one of them. We are not able to create interactive plots like we have before, but we see these plots as a starting point to show how we want to visualize our data! Next, we plot it again, but we make the colors the same. We use **red**: ```{r} islaVistaMap + geom_point(data=DF, mapping = aes(x = long, y = lat), na.rm=TRUE, alpha = 1/40, colour = 'red', size = DF$count*circle_scale_amt) + theme(legend.position="none") + scale_size_continuous(range=range(DF$count)) + labs(title = "Frequency of Bikes Stolen in Isla Vista", subtitle = "Encompassing 2011-2016") ``` <img src = "https://i.imgur.com/1C2Ra8u.png"> This plot is a step up, but again shows limitations when creating **Spatial Data** plots. We can see this for the overall reports of stolen bicycles. They are more likely either to happen on **Embarcadero Del Norte** or more likely to be reported on this street. There are also more stolen bicycles on **6600 Del Playa** and **6600 Sueno Road**! However, it is important to state that, again, we have a fair amount of locations missing and these play a large role in our analysis! Another note to make is that it would be worthwhile for you to create these plots for every year, which we exclude from our analysis to make it easier to grasp. ## Density Plots We include **density plots** in our next step. For this plot, we include the entire data set, so it will look very similar to the two past plots. However, this has more statistical merit than utilizing 'geom_point'. For these plots, we use the 'stat_density2d' function available for 'GGplot2'! ```{r} islaVistaMap + stat_density2d( aes(x = long, y = lat, fill = ..level.., alpha = .15), size = 2, bins = 4, data = DF, geom = "polygon", na.rm = TRUE) + labs(title = "Frequency of Bikes Stolen in Isla Vista", subtitle = "Encompassing 2010-2016") + theme(legend.position="none") ``` <img src = "https://i.imgur.com/DqS0nlO.png"> ### Spatial Plot for Each Year For the following plots presented in **GIF** form, we follow the same process, but divide the **Data Table** into the respective year. We include the values for **2010** within the ** 2011 Data Table**, because this year doesn't include many reported bicycles. We will show you the *format*, so that you can see it on this page, but we omitted the process for every year, since it would take a lot of space within this project. Simply replace the respective year where **X** is to plot that image (i.e. if you wanted to plot the year **2015** you would replace **X** with five). ```{r} byYear <- function(yr){ temp <- DF[fromDate <= sprintf("%s-12-31", yr) & fromDate >= sprintf("%s-01-01", yr)] islaVistaMap + stat_density2d( aes(x = long, y = lat, fill = ..level.., alpha = ..level..), size = 2, bins = 4, data = temp, geom = "polygon", na.rm = TRUE ) + labs(title = "Frequency of Bikes Stolen in Isla Vista", subtitle = sprintf("Encompassing the year %s", yr)) + theme(legend.position="none") } ``` We use this function six times and then create a GIF of these six plots! You can find the entire code on the **GitHub** repo, and if you have any suggestions to automate this, please let us know! <img src="https://i.imgur.com/yuG2LZt.gif"> This GIF gives us a better understanding on the location and amount of theft per year. Before we start making assumptions for the plots we generated, we have to dive into the issue of missing data and the possible solutions there are with respect to **Missing Data**. So we will come back to these plots later on in the project. In order to save all these pictures easily we implemented a for loop on the function and used `ggsave`. Same as the bar charts relating to color of bicycles. for (i in years){ myPlots <- byYear(i) ggsave(myPlots, filename=paste("images/bikesStolenYear", i,".png", sep="")) } # Conclusions The biggest takeaway we have seen is that we cannot make any statistical inferences with respect to the color of the bicycles due to the amount of missing values. But we can see some interesting trends in the locations of thefts through the years in Isla Vista. # Works Cited ## Citations for Packages Used in project Citations created using the function (in **R**): + D. Kahle and H. Wickham. **ggmap: Spatial Visualization with ggplot2**. The R Journal, 5(1), 144-161. http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf. + H. Wickham. **ggplot2: Elegant Graphics for Data Analysis**. Springer-Verlag New York, 2009. + H. Wickham (2011). **plyr: The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software**, 40(1), 1-29. http://www.jstatsoft.org/v40/i01/. + H. Wickham and Winston Chang (2016). **devtools: Tools to Make Developing R Packages Easier**. R package version 1.12.0. https://CRAN.R-project.org/package=devtools. + M. Dowle, A. Srinivasan, T. Short, S Lianoglou with contributions from R Saporta and E Antonyan (2015). **data.table: Extension of Data.frame**. R package version 1.9.6. https://CRAN.R-project.org/package=data.table. + R. Brownrigg (R version, 2016). Original S code by Richard A. Becker and Allan R. Wilks. **mapdata: Extra Map Databases**. R package version 2.2-6. https://CRAN.R-project.org/package=mapdata

COMMENTS







keep exploring!

back to all projects