How to Extract Links from an HTML Page Using Python
To extract links from an HTML document using Python, you can leverage the `requests` and `BeautifulSoup` modules.
This script automates the process of manually opening a webpage, inspecting its source code to locate hyperlinks (the <a> tags), and returning a list of all the links found on the page.
import requests
from bs4 import BeautifulSoup
url = "https://www.andreaminini.com/python/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all links
for link in soup.find_all('a', href=True):
print(link['href'])
This is a basic example of web scraping, an automated technique for extracting information from a webpage. Let’s walk through how it works, line by line.
The script begins by importing the `requests` and `BeautifulSoup` libraries.
- requests
This library is used to make HTTP requests. In the context of web scraping, `requests` is commonly used to download the content of a webpage. - BeautifulSoup
This library is used to parse HTML and XML documents, allowing you to easily navigate and manipulate the DOM tree of a webpage.
The `url` variable holds the address of the webpage you want to analyze.
url = "https://www.andreaminini.com/python/"
The requests.get(url) method sends a GET request to that URL.
If the request is successful, the server responds with the content of the webpage, which is stored in the `response` variable.
response = requests.get(url)
Next, the HTML content stored in `response` is parsed using BeautifulSoup.
soup = BeautifulSoup(response.text, 'html.parser')
This command parses the HTML text using the specified parser (`'html.parser'` in this case), creating a `BeautifulSoup` object that represents the HTML structure in a format that can be easily navigated and manipulated.
The response.text contains the raw HTML content returned by the HTTP request.
The result of the parsing operation is stored in the `soup` variable.
Finally, the for loop iterates through the `soup` object, extracting all the links.
for link in soup.find_all('a', href=True):
print(link['href'])
The method `soup.find_all('a', href=True)` searches the HTML document for all `<a>` elements that have an `href` attribute, which are the hyperlinks.
For each `<a>` element found, it prints the value of the `href` attribute, which is the URL the link points to.
This straightforward script extracts all the links present on the page specified by the `url` variable.
And that's it.