DATA WAREHOUSING (PART 2)

BECOMING ACQUAINTED WITH BASIC SQL COMMANDS
0

SQL

EASY

last hacked on Aug 30, 2017

# Becoming Acquainted with Basic SQL Commands # Data Definition Language **DDL** statements are used to create and modify tables within your database. An example and most used is the `CREATE TABLE` method which as its name states will create a table. For this example I will use the famous [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) to create the table called `iris_table`. # Creating Database For this example we will be creatig a test database to show you the workflow of creating and inputting data into a **MySQL** database. Database Tables: + iris_table - Contains the famous *Fisher's* iris dataset along with some added features + phys_features - Contains some physical features relating to the two types of *angiosperms* + sci_names - Contains a sample of 10 *angiosperms* with scientific groupings and duration ## Database Creation SHOW DATABASES; You should see all databases (or none if you haven't created any yet!) Let's go ahead and create the database for this example. We will call it `iris_database` CREATE DATABASE IF NOT EXISTS iris_database; Now we're going to go ahead and enter/use the database we just created. USE iris_database; Next we'll take a look the data set we want to use and some important **SQL** syntax ## Entity Relationship Models Here we created a model diagram for our database where we have three relational tables about the iris data set. <img src='https://raw.githubusercontent.com/raviolli77/SQL_Workshop/master/data/iris/databaseDesign.png'> ## Preview of our data set | Sepal Length | Sepal Width | Petal Length | Petal Width | Class | Symbol | Angiosperms | |-----|-----|-----|-----|-------------|------|------| | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | IRSE | Dicot | | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | IRSE | Dicot | | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | IRSE | Dicot | | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | IRSE | Dicot | ### Create Table CREATE TABLE IF NOT EXISTS iris_table( id INT NOT NULL AUTO_INCREMENT, sepal_length FLOAT, sepal_width FLOAT, petal_length FLOAT, petal_width FLOAT, class TEXT, symbol VARCHAR(6) NOT NULL, angiosperms VARCHAR(7) NOT NULL, PRIMARY KEY(id) ); ### Loading Data In this example we will be using a *csv* file to load in our data. Although the import wizard is a nice feature within the *MySQL Workbench*, we want to be able import our database using **SQL** context. For this first section we won't do it the *right* way just to have you work through certain functionalities of **SQL**. We will then show the correct way after we struggle through this methodology. LOAD DATA LOCAL INFILE '~/myProjects/approp/wd/iris.csv' INTO TABLE iris_table FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n' IGNORE 1 LINES (sepal_length, sepal_width, petal_length, petal_width, class, symbol, angiosperms); **IMPORTANT TO NOTE**: the section `\r\n` works for *Windows*, the appropriate method for *Macs* is `\r` For this demonstration we have 3 tables but are only showing the creation of one to save time. ### Primary Keys For this dataset, there were no original *primary keys*, but for the sake of this example we created an index from 1 to 149 (the dataset has 150 observations but we omitted the last observation for demonstration purposes) for each instance of the iris flower. If you recall when we create our table we utilized `AUTO_INCREMENT` this creates a primary key that does increments of 1 starting with 1, useful when a relational table doesn't have a primary key already set. So now we'll preview how our data set looks like (We'll show the *SQL* query on how to get this in the next section). ### Preview of Dataset with Pk | id | Sepal Length | Sepal Width | Petal Length | Petal Width | Class | Symbol | Angiosperms | |-----|-----|-----|-----|-----|-------------|------|------| | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | IRSE | Dicot | | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | IRSE | Dicot | | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | IRSE | Dicot | | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | IRSE | Dicot | ## Adding Foreign Key Here we will add foreign keys to the `iris_table`. ALTER TABLE iris_table ADD CONSTRAINT fk_sci_names FOREIGN KEY (symbol) REFERENCES sci_names(symbol); Recall `symbol` is a *primary key* for the table `sci_names` table. ALTER TABLE iris_table ADD CONSTRAINT fk_phys_features FOREIGN KEY (angiosperms) REFERENCES phys_features(angiosperms); Recall `angiosperms` is a *primary key* for the table `phys_features` table. ## Most effective way Here is the way you do it if you know all foreign keys and primary keys for the database: CREATE TABLE IF NOT EXISTS iris_table( id INT NOT NULL AUTO_INCREMENT, sepal_length FLOAT, sepal_width FLOAT, petal_length FLOAT, petal_width FLOAT, class TEXT, symbol VARCHAR(6) NOT NULL, angiosperms VARCHAR(7) NOT NULL, FOREIGN KEY (symbol) REFERENCES sci_names, FOREIGN KEY (angiosperms) REFERENCES phys_features, PRIMARY KEY (id) ); # Data Manipulation Language For this example say you received a new input that you wanted to add to your table: ### Data needed to be inputted Important to note that with the `AUTO_INCREMENT` feature we **do not** have to specify the `id` column when inserting records | Sepal Length | Sepal Width | Petal Length | Petal Width | Class | Symbol | Angiosperms | |-----|-----|-----|-----|-------------|------|-----| | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica | IRVI | Dicot | INSERT INTO iris_table (sepal_length, sepal_width, petal_length, petal_width, class, symbol, angiosperms) VALUES (5.9, 3.0, 5.1, 1.8, 'Iris-virginica', 'IRVI', 'Dicot'); Now that we have all our data ready we are ready to start making queries that are more data analysis related! # Data Selection This is the first and most common query you should be acquainted: SELECT * FROM iris_table; We are essentially telling **SQL** to choose everything (with use of the *) from the table `iris_table`. This will return the entire 150 rows and 7 columns into your workbench. The beauty and simplicity of *SQL* as you will see that the queries read much like plain english, so you can deduce what is being done by readin the query at hand, although queries do get more complicated when starting off the learning curve is not too steep. Often we will be utilizing tables with enormous amounts of data so we can limit the output (similar to the `head` in **R** and **Python**): SELECT * FROM iris_table LIMIT 5; ### Output | id | sepal_length | sepal_width | petal_length | petal_width | class | symbol | angiosperms | |---|-----|-----|-----|-----|-------------|------|-------| | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | IRSE | Dicot | | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | IRSE | Dicot | | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | IRSE | Dicot | | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | IRSE | Dicot | | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | IRSE | Dicot | ## Distinct The data in `sci_names` is a very small subset of the actual domain of plant classification, but say we had a much larger dataset that included many more groupings as defined by [Natural Resource Conservation Service](https://plants.usda.gov/java/). A useful way of finding all available groups would be using the `DISTINCT` functionality as such: SELECT DISTINCT angiosperms FROM sci_names; ### Output | angiosperms | |-------| | Monocot | | Dicot | ## Count One of the most fundamental features of **SQL** is the `COUNT` function. This function will output the number of rows in the query, but when used along side the `GROUP BY` function, this will help create useful groupings that help give insight to our data. SELECT class, COUNT(*) FROM iris_table GROUP BY class; ### Output | class | COUNT(*) | |-------|----------| | Iris-setosa | 50 | | Iris-versicolor | 50 | | Iris-virginica | 50 | It goes without saying more complicated and useful queries could be made with the `COUNT` function, but for the sake of this example we will leave it as such. I do recommend using another database to see variable interactions using the `COUNT` function on more than one column. ## Having Normally I would go about showing the `WHERE` feature but the previous query is a good transition for the `HAVING` functionality SELECT angiosperms, COUNT(*) FROM sci_names GROUP BY group HAVING COUNT(*) > 4; | angiosperms | COUNT(*) | |---------|---| | Monocot | 5 | ## Where The `WHERE` function is another important and crucial function needed to do selectional conditioning in **SQL**. In this query we are searching for all iris flowers that have a sepal length greater than 3 (Should be 67 flowers). We didn't include the output because of space but with the function you can start doing more interesting queries to gain important insight into your data. SELECT * FROM iris_table WHERE sepal_width > 3.0; ## And/Or + `AND` - this will display all instances when both conditions on seperated by the `AND` are true + `OR` - this will display all instances when at least one of the conditions seperated by the `OR` are true Building on the last example we can add another condition to further gain more insight to our data. SELECT * FROM iris_table WHERE sepal_width > 3.0 AND petal_length < 4.0; Here you should get a query of size 42. These examples should show that this is where the money is when it comes to doing analysis in **SQL**. When we do these conditional selections we are looking into our data from a certain dimension, thus we can use the `WHERE` function to exact information from a large database and use this for our analysis using **R** or **Python** (more on this later). Important to note: the `NOT` function is also an important tool which helps negate a conditional selection. You can try this functionality for yourself since we will not be including a query in this tutorial. ## Order By This function is useful because it can help with organization of ones queries. A simple example is as follows, say we want to view the `sci_names` table by alphabetic order according to the scienctific name. The following query will do so: SELECT * FROM sci_names ORDER BY class ASC; ### Output | symbol | class | angiospems| duration | |------|----------------|------------|-----------| | ABER | Abutilon-eremitopetalum | Dicot | Perennial | | ACVI6 | Acrocomia-vinifera | Monocot | NotGiven | | CASQ2 | Carex-squarrosa | Monocot | Perennial | | ERIOP2 | Eriophyllum-lanatum | Dicot | NotGiven | | IRSE | Iris-setosa | Monocot | Perennial | | IRVE2 | Iris-versicolor | Monocot | Perennial | | IRVI | Iris-virginica | Monocot | Perennial | | SUVI | Suksdorfia-violacea | Dicot | Perennial | | TORA2 | Toxicodendron-radicans | Dicot | Perennial | # Joins The bread and butter of **SQL** is `JOINS` SELECT iris_table.sepal_length, iris_table.sepal_width, sci_names.duration, phys_features.num_of_pores FROM iris_table INNER JOIN sci_names ON iris_table.symbol = sci_names.symbol INNER JOIN phys_features ON iris_table.angiosperms = phys_features.angiosperms; ### Output | sepal_length | sepal_width | duration | num_of_pores | |-----|-----|-----------|-----| | 5.1 | 3.5 | Perennial | 3 | | 4.9 | 3 | Perennial | 3 | | 4.7 | 3.2 | Perennial | 3 | | 4.6 | 3.1 | Perennial | 3 | | 5 | 3.6 | Perennial | 3 | ## Left/Right Join Say we were interested in seeing which flowers were not included in our `iris_table` on a larger scale, just looking at the tables would not suffice so we would would do what is called a *Left Join*. *Left* and *Right Joins* help create queries where we are trying to find data that is not inclusive to either both or more tables. This is best seen through example: SELECT sci_names.* FROM sci_names LEFT JOIN iris_table ON sci_names.symbol= iris_table.symbol WHERE iris_table.symbol IS NULL; ### Output | symbol | class | angiosperms | duration | |------|-----------------------------|-----------|-----------| | ABER | Abutilon-eremitopetalum | Dicot | Perennial | | ACVI6 | Acrocomia-vinifera | Monocot | NotGiven | | CASQ2 | Carex-squarrosa | Monocot | Perennial | | ERIOP2 | Eriophyllum-lanatum | Dicot | NotGiven | | SUVI | Suksdorfia-violacea | Dicot | Perennial | | TORA2 | Toxicodendron-radicans | Dicot | Perennial | Notice how none of our *iris* flowers are present in the query! # Alias ## Variable Alias SELECT iris_table.sepal_length AS sp, ir.petal_length AS pl FROM iris_table LIMIT 5; ### Output | sp | pl | |----|----| | 5.1 | 1.4 | | 4.9 | 1.4 | | 4.7 | 1.3 | | 4.6 | 1.5 | | 5 | 1.4 | ## Table Alias SELECT ir.petal_length AS pl, ir.petal_width AS pw, sn.duration AS dur, pf.num_of_pores AS pores, pf.num_of_cotyledons AS coty FROM iris_table ir INNER JOIN sci_names sn ON ir.symbol = sn.symbol INNER JOIN phys_features pf ON ir.angiosperms = pf.angiosperms; ### Output | pl | pw | dur | pores | coty | |-------|-------|-----------|---|---| | 1.4 | 0.2 | Perennial | 3 | 2 | | 1.4 | 0.2 | Perennial | 3 | 2 | | 1.3 | 0.2 | Perennial | 3 | 2 | | 1.5 | 0.2 | Perennial | 3 | 2 | | 1.4 | 0.2 | Perennial | 3 | 2 | ## Like We use the `LIKE` functionality to search for patterns in strings. So say for example we were trying to find the scientific name of a flower but we couldn't remember the actual name (I mean who would be able to), but we do know it starts with *i* so using the `LIKE` functionality helps us narrow our search. SELECT * FROM sci_names WHERE class LIKE 'i%'; ### Output | symbol | class | angiospems| duration | |------|----------------|------------|-----------| | IRSE | Iris-setosa | Monocot | Perennial | | IRVE2 | Iris-versicolor | Monocot | Perennial | | IRVI | Iris-virginica | Monocot | Perennial | Another pattern searching I use quite often at work is `%pattern%` which finds all instances of this pattern. So in this case say we were looking for how many species of iris there were present in significantly larger database we would use the following syntax: SELECT * FROM sci_names WHERE class LIKE '%iris%'; You would get the same result as the last query, but on a larger scale you would narrow your search down that much more! # Conclusions That was a quick run through, again this isn't an exhaustive overview, but it does help give context as to basic and integral functionalities of **SQL**. If you would like to contribute/proofread my shit feel free to let me know and I can add you as a contributor. ## Click [Here](https://www.inertia7.com/projects/70) for the **SQL** workshop Guidelines

COMMENTS







keep exploring!

back to all projects