Scrape A Webpage (Python)


Project Summary

Abstract

This project focuses on scraping the web using Python, and is adapted from the book Automate the boring Stuff with Python.

Web Scraping, also known as Web Harvesting or Web Data Extraction, refers to the process of extracting information (typically in HTML format) from websites. Web scraping is used to rank websites by traffic, track and rank products and prices, and can be useful to data scientist who they are looking to model data from the web.

For this project we leverage Python 3, including the modules sys, requests, and webbrowser.

We encourage you to try replicating this project and make your own contributions!

Requirements

Steps

Contributors

1. The Code

Create a new file called scrapeSite.py. Then type the following code into your newly created file.

#! python3
#----------------------------------------------------------
# scrapeSite.py opens and scrapes a site of your choosing
#----------------------------------------------------------
# execute in the command line:
#
#   python3 scrapeSite.py 
#----------------------------------------------------------

# importing required modules
import sys, requests, webbrowser

# declaring command-line argument as variable
# this argument should be webpage to be scraped
siteToScrape = ''.join(sys.argv[1:])

# opening webpage in default browser
webbrowser.open(siteToScrape)

# scraping source code for webpage
res = requests.get(siteToScrape)

# checking for errors
res.raise_for_status()

# creating file to dump scraped source code
playFile = open('scrapeMent.txt', 'wb')

# dumping scraped source code into file in "chunks" of 100000 bytes at a time
for chunk in res.iter_content(100000):
    playFile.write(chunk)

Learn to install the required software and run your script in the following sections.

2. Install Python 3 and Required Modules

You will need to install Python 3 to run this project. Download it at python.org.

You will also need the Python 3 modules sys, requests, and webbrowser. We recommend you use pip to install them. Google how to install pip for Python 3 and your operating system. Then run the following commands on your terminal to install the required packages:

sudo pip3 install sys
sudo pip3 install requests
sudo pip3 install webbrowser

3. Execute The Script

To execute the script, enter the following this command into your terminal:

python3 scrapeSite.py URL

Make sure to replace URL with the URL of a website you would like to scrape. For example, if you'd like to scrape Inertia7.com, then you would enter:

python3 scrapeSite.py http://www.inertia7.com

CONGRATULATIONS!

Congratulations for getting this far! We hope you enjoyed this project. Please reach out to us here if you have any feedback or would like to publish your own project.

GitHub Repo Link Fork this project on GitHub

Try this project next:


Forecasting the Stock Market (R)

Forecasting the Stock Market (R)

Time-Series Analysis of the S&P 500 Stock Index with R