last hacked on Jul 22, 2017

Notes on NLP using book written by: Steven Bird, Ewan Klien, and Edward Loper
# Natural Language Processing With Python First thing is first make sure the package `nltk` is downloaded into your **Python** environment. You can check by running $ python Or `python3` whatever the command is for **Python 3** for your machine. ### Terminal Output (Varies in distribution mine is build on Anaconda) Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> Once you are greeted with this, run the following in the command line: >>> import nltk If there is no error message you are good to go. If there you should do `quit()` to exit the python instance then you should pip install it on the terminal as following: pip3 install nltk Next we're going to download the books available in the `nltk` package. Do so by typing these into the command line: >>> import nltk >>> Once you run this a window will pop up, download the books available and you should be prompted with the following in the terminal. ### Terminal Output *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 For this follow-through of the book I will be using **The Book of Genesis** (`text3`) for all examples. First we will be grabbing the length of the text from start to finish, with all words and symbols present. We use the `len` function to do so. >>> len(text3) ### Terminal Output 44764 Therefore we see that *Genesis* has 44,764 **tokens**. **Token** is the term for any sequence of character that we want to group together. The important thing to note is that this count is inclusive of words that repeat, so if we wanted to ask *how many distinct words does this text contain* we would have to structure our script differently. We do so by using the function `set()`. This will give us all unique instances of **tokens**. >>> sorted(set(text3)) Running this will give you a huge ass list of all unique **tokens** starting with *characters*, then first letter uppercase words and finally lower case words (all words in alphabetical order as well). Now if we wanted the count of this set we would do similar to what we did earlier. >>> len(set(text3)) ### Terminal Output 2789 Skipped a large chunk of the book because it went over the basics of lists, indexing, etc. which I didn't really need to review (but if you do I highly encourage to look over these sections). # Simple Statistics ## Frequency Distribution Some basic statistics we would like to look at when doing **NLP** include which words appear the most in our body of text. In order to do this we will be using the `FreqDist` function. >>> fdist1 = FreqDist(text3) So if you run `fdist1` your output will be as follows: ### Terminal Output FreqDist({',': 3681, 'and': 2428, 'the': 2411, 'of': 1358, '.': 1315, 'And': 1250, 'his': 651, 'he': 648, 'to': 611, ';': 605, ...}) This is important to note because you see that *and*, *the*, and *of* are some of the most common words. This usually doesn't tell us much and these words are commonly referred to as **stop words**, in other analyses I've seen relating to **NLP**, **stop words** are usually removed as a part of the pre-process. In order to receive only the words and not the count we will use the `.keys()` function. **IMPORTANT TO NOTE**: Since we are using **Python3.X** we need to implicitly create a list (since a dictionary is created), because in **Python2.7** it was a list, but you'll get an error code if you execute it the way the book has it) >>> vocabulary1 = list(fdist1.keys()) Now we can find the 10 most frequent words used in the **Book of Genesis** by indexing as such. >>> vocabulary1[:10] ### Terminal Output ['handmaid', 'Beware', 'washed', 'breed', 'clo', 'beside', 'who', 'women', 'years', 'grisled'] ## Finding Long Words Say now we wanted to find words that were 10 characters or more. We will use a form of short hand notation that will go through the text and search for words that are 10 characters or longer. **NOTE**: The book's exercise had 15 characters but *Genesis* didn't have any so it returned an empty list, therefore I changed the character length. >>> G = set(text3) Now let's use a *for loop* and find the words >>> long_words = [w for w in G if len(w) > 10] Let's sort the results and see which words are longer than 10 characters >>> sorted(long_words) ### Terminal Output ['Abelmizraim', 'Allonbachuth', 'Beerlahairoi', 'Canaanitish', 'Chedorlaomer', 'EleloheIsrael', 'Girgashites', 'Hazarmaveth', 'Hazezontamar', 'Ishmeelites', 'Jegarsahadutha', 'Jehovahjireh', 'Kirjatharba', 'Melchizedek', 'Mesopotamia', 'Peradventure', 'Philistines', 'Zaphnathpaaneah', 'abomination', 'acknowledged', 'buryingplace', 'circumcised', 'commandment', 'commandments', 'confederate', 'continually', 'countenance', 'deceitfully', 'deliverance', 'established', 'everlasting', 'exceedingly', 'generations', 'habitations', 'handmaidens', 'imagination', 'inhabitants', 'inheritance', 'instruments', 'interpretation', 'interpretations', 'interpreted', 'interpreter', 'leanfleshed', 'maidservants', 'menservants', 'merchantmen', 'multiplying', 'peradventure', 'plenteousness', 'possessions', 'progenitors', 'righteousness', 'ringstraked', 'seventeenth', 'sheepshearers', 'shoelatchet', 'storehouses', 'strengthened', 'threshingfloor', 'uncircumcised', 'womenservan', 'womenservants', 'yesternight'] So it looks like a lot of old-style names are longer than 10 characters. Go figure! **Note**: The longer way of writing this which is good practice for me is as follows: >>> long_words = [] >>> for w in G: ... if len(w) > 10: ... long_words.append(w) ... This will output the same result once you use the `sorted()` function on `long_words`. ## Hapaxes As we have seen in the returned list many of the words are names that might have only appeared within context of a specific story in *Genesis*. If these words appear only once in a corpus or artist's work they are referred to as **hapaxes**. So now let's try to find words that are longer than 10 but also have a higher frequency. The book uses a frequency of 7, but let's try 5! >>> sorted([w for w in G if len(w) > 7 and fdist1[w] > 5]) Pretty straight forward command we're using an inclusive *and* to add to our previous *for loop* find the counts in `fdist1` that exceed 5 while sorting right off the bat and the result is as follows ### Terminal Output ['Abimelech', 'Aholibamah', 'Almighty', 'Arphaxad', 'Bashemath', 'Beersheba', 'Benjamin', 'Canaanites', 'Egyptian', 'Egyptians', 'Gomorrah', 'Machpelah', 'Manasseh', 'Padanaram', 'Peradventure', 'Philistines', 'Therefore', 'Wherefore', 'according', 'answered', 'appeared', 'birthright', 'blessing', 'brethren', 'buryingplace', 'children', 'circumcised', 'commanded', 'conceived', 'concerning', 'covenant', 'creature', 'creepeth', 'creeping', 'daughter', 'daughters', 'departed', 'establish', 'everlasting', 'exceedingly', 'families', 'favoured', 'firmament', 'firstborn', 'fruitful', 'gathered', 'generations', 'grievous', 'handmaid', 'hearkened', 'journeyed', 'mourning', 'multiply', 'multitude', 'offering', 'peradventure', 'possession', 'presence', 'prevailed', 'returned', 'righteous', 'ringstraked', 'servants', 'speckled', 'stranger', 'substance', 'themselves', 'therefore', 'together', 'wherefore', 'wilderness', 'youngest'] Again for my own practice here's the (long?) way of doing this process >>> myList = [] >>> for w in G: ... if len(w) > 8 and fdist1[w] > 5: ... myList.append(w) ... After you would run `sorted(myList)` and you would receive the same results. ## Collocations and Bigrams **Collocations** are series of words that appear together unusually often. *Dank memes* is a *collocation* whereas *Nice memes* is not. An important trait is that when substituting a word with something similar, it won't work like *Chilly memes*. We'll start off with list of word pairs known as **bigrams**. We can show this easily by running the `bigrams()` functions as follows: >>> from nltk import bigrams >>> list(bigrams(['Somebody', 'once', 'told', 'me'])) Again since we're utilizing **Python3.X** you must convert the *bigram* into a list to see the desired results. ### Terminal Output [('Somebody', 'once'), ('once', 'told'), ('told', 'me')] Now if we were to run this function on *Genesis* it would spit out a huge list of tuples (trust me I tried it). So fortunate for us there is a `.collocations()` function that will give us the most common *bigrams* in our body of text. >>> text3.collocations() ### Terminal Output said unto; pray thee; thou shalt; thou hast; thy seed; years old; spake unto; thou art; LORD God; every living; God hath; begat sons; seven years; shalt thou; little ones; living creature; creeping thing; savoury meat; thirty years; every beast As you can see here we find some collocations that would be reasonable considering its *Genesis* like `LORD, God` and `thou salt`, but we also find some oddly facsinating ones like `savoury meat` which leaves a lot to the imagination. Here they provided a table of useful functions and description. I thought it would be better for my understanding to show the actual output as opposed to the description | Example | Output | Description | |---------|--------|-------------| | `fdist1['LORD']` | `166` | Count of times word appears in text | | `fdist1.freq('LORD')` | `0.003708254216463755` | Frequency of word give | | `fdist1.N()` | `44765` | Total number of tokens | | `fdist1.most_common(n)` (where you assign n = Number you want) | when n = 5; `[(',', 3681), ('and', 2428), ('the', 2411), ('of', 1358), ('.', 1315)]` | gives n most common tokens | | `fdist1.max()` | `','` | most common token | | `fdist1.tabulate()` | word | other word | Rest are messy so they will not be included in this tutorial. # Conditionals and Making Decisions Here we'll some examples of several conditional methods using **relational operators** won't go into detail explaining, will just show examples. | Operators | Relationships | |-----------|---------------| | < | less than | | <= | less than or equal to | | == | equal to | | != | not equal to | | > | greater than | | >= | greater than or equal to | Here we'll be doing different operations to the same sentence (which was taken from the *Wall Street Journal*): >>> sent7 = ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] Now the following commands with only change in operators used. I'm just going to put the command and the output on the same line to save space. ### Less than ( < ) >>> [w for w in sent7 if len(w) < 4] [',', '61', 'old', ',', 'the', 'as', 'a', '29', '.'] ### Less than or equal to ( <= ) >>> [w for w in sent7 if len(w) <= 4] [',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.'] ### Equal to ( == ) >>> [w for w in sent7 if len(w) == 4] ['will', 'join', 'Nov.'] ### Not Equal to ( != ) >>> [w for w in sent7 if len(w) != 4] ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', '29', '.'] So remember this pattern `[w for w in text if condition ]` ! This will an important tool for you when doing **NLP**. # Word Comparison Operators There are other operators which will be important in your analysis, which we will cover here! We won't go over every single one, but we'll include examples as before. I started with using the `.endswith()` function and chose 'ou' since I know *Genesis* uses thou a lot I was curious what other words ended with 'ou'. >>> sorted(w for w in set(text3) if w.endswith('ou')) ['Thou', 'bou', 'fou', 'hou', 'mou', 'ou', 'sou', 'thou', 'you'] Interesting, I didn't know some of these were words. Next operator is similar to **SQL** syntax I am use to so I will show a query to help bring this home for me. >>> sorted(term for term in set(text3) if 'sin' in term) ['blessing', 'blessings', 'business', 'purposing', 'sin', 'since', 'sinew', 'sinners', 'sinning'] **NOTE FOR SELF**: This is very similar to: SELECT term FROM text3 WHERE term LIKE '%sin%' ORDER BY term asc; The next function picks up words that contain an initial capitalized letter. >>> sorted(item for item in set(text3) if item.istitle()) ['A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', 'Adam', 'Adbeel', ...] **IMPORTANT TO NOTE**: For these calls since we are using the `sorted()` function we do not include '[]', therefore without the `sorted()` function you would include `[]` or else you would get an error code. More examples: Checks for tokens with all upper case letters >>> [item for item in set(text3) if item.isupper()] ['O', 'LORD', 'G', 'A', 'LO', 'I'] Likewise there's the `.islower()` function but not going to show/run since the output would be huge. But you get the picture. # Operating on Every Element Here we include some examples of tools useful in pre-processing for **NLP**. This one is especially important because I often see that people make all words lower caps before fitting models! >>> [w.lower() for w in text3] Not going to show output but a really important functionality. Next function will give the length of each word in our text. >>> [len(w) for w in text3] Notice how these have the form of `[f(w) for ...]` or `[w.f() for ...]` this is called *list comprehension* important when becoming more *pythonic*. From what we know now, let's go back to the previous question of the length of the text. We'll try different iterations of what we learned. This command counts all tokens in the *Genesis* includes repeats and punctuation. >>> len(text3) 44764 Recall this next one eliminates repeats but doesn't eliminate repeats when they differ by having lower caps or upper caps or is titlecased. >>> len(set(text3)) 2789 Next if we want to be just a little more accurate we would go through the entire text and convert all letters to lower caps, but this process doesn't eliminate the punctuation characters. **NOTE TO SELF**: Figure out why. The book mentions that it has to do with *list comprehension* so I'll look over that >>> len(set(word.lower() for word in text3)) 2628 This final step filters the non-alphabetic items so we get the true count of unique words! Finaly. >>> len(set(word.lower() for word in text3 if word.isalpha())) 2615 ## Nested Code Blocks Basic nested code block examples. Don't think I need to go into too much detail. >>> word = 'shrek' >>> if len(word) < 7: ... print('word length is less than 7') ... word length is less than 7 Simple enough, now let's look at *for loops*. >>> for word in ['Call', 'me', 'meme', 'lord', '.'] ... print(word) ... Call me meme lord . ## Looping with Conditions We can start combining these two statements to make effective **NLP** tools. We'll show some examples using the following sentence. >>> sent1 = ['They', 'don\'t', 'think', 'it', 'be', 'like', 'it', 'is', ',', 'But', 'it', 'do'] Let's start with a conditional loop looking for words that end in 't' >>> for word in sent1: ... if word.endswith('t'): ... print(word) ... Don't it it But it Next let's create what's called a *control flow chart*, to find specific requirements. In this case we will look for the following: + First, we will look for tokens that are all lower caps + Then, we will look for tokens that are title cased + And lastly, we will look for tokens that are punctuations >>> for token in sent1: ... if token.islower(): ... print(token, 'is a lowercase word') ... elif token.istitle(): ... print(token, 'is a titlecase word') ... else: ... print(token, 'is a punctuation') ... They is a titlecase word don't is a lowercase word think is a lowercase word it is a lowercase word be is a lowercase word like is a lowercase word it is a lowercase word is is a lowercase word , is a punctuation But is a titlecase word it is a lowercase word do is a lowercase word Next we will be doing a more complicated search using a similar technique as above, except using the `[for w in ...]` syntax. We will be looking for words that have 'ou' and 'in'. Notice how we added the parameter `end = ' '` in our print statement, this will print each token with a space inbetween them, as opposed to the newline break. >>> tricky = sorted(w for w in set(text3) if 'ou' in w or 'ou' in w) >>> for word in tricky: ... print(word, end = ' ') ... Should Sojourn Thou about aloud besought bou boug bough bought bound brought cloud colours confound couch couched couching could counted countenance countries country devour devoured double doubled doubt drought enough favour favoured fou found fountain fountains four fourscore fourteen fourteenth fourth gracious graciously grievous ground honour honourable hou hous house household households journey journeyed journeys labour loud mou mount mountain mountains mourn mourned mourning mouth mouths nought nourish nourished ou ought our ours ourselves out plenteous plenteousness poured precious prosperous righteous righteousness roughly round rouse savour savoury should shoulder shoulders shouldest sojourn sojourned sojourner sou sought soul souls south southward storehouses thoroughly thou though thought thoughts thousand thousands through throughout touch touched toucheth touching troubled trough troughs without would wouldest wounding wrought you young younge younger youngest your yourselves youth ## Conclusion This concludes the first chapter of the book and gives us a brief overview of what lies ahead when tackling **NLP**. This chapter is especially helpful to give you context of some basic operations on **tokens**, as well as showing how to iterate on bodies of text, with and without conditional selecting, which will be useful when you start pre-processing your data set for fitting models.


keep exploring!

back to all projects