How to Extract Links from an HTML Page Using Python

To extract links from an HTML document using Python, you can leverage the `requests` and `BeautifulSoup` modules.

This script automates the process of manually opening a webpage, inspecting its source code to locate hyperlinks (the <a> tags), and returning a list of all the links found on the page.

import requests
from bs4 import BeautifulSoup

url = "https://www.andreaminini.com/python/"
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

# Extract all links
for link in soup.find_all('a', href=True):
print(link['href'])

This is a basic example of web scraping, an automated technique for extracting information from a webpage. Let’s walk through how it works, line by line.

The script begins by importing the `requests` and `BeautifulSoup` libraries.

requests
This library is used to make HTTP requests. In the context of web scraping, `requests` is commonly used to download the content of a webpage.
BeautifulSoup
This library is used to parse HTML and XML documents, allowing you to easily navigate and manipulate the DOM tree of a webpage.

The `url` variable holds the address of the webpage you want to analyze.

url = "https://www.andreaminini.com/python/"

The requests.get(url) method sends a GET request to that URL.

If the request is successful, the server responds with the content of the webpage, which is stored in the `response` variable.

response = requests.get(url)

Next, the HTML content stored in `response` is parsed using BeautifulSoup.

soup = BeautifulSoup(response.text, 'html.parser')

This command parses the HTML text using the specified parser (`'html.parser'` in this case), creating a `BeautifulSoup` object that represents the HTML structure in a format that can be easily navigated and manipulated.

The response.text contains the raw HTML content returned by the HTTP request.

The result of the parsing operation is stored in the `soup` variable.

Finally, the for loop iterates through the `soup` object, extracting all the links.

for link in soup.find_all('a', href=True):
print(link['href'])

The method `soup.find_all('a', href=True)` searches the HTML document for all `<a>` elements that have an `href` attribute, which are the hyperlinks.

For each `<a>` element found, it prints the value of the `href` attribute, which is the URL the link points to.

This straightforward script extracts all the links present on the page specified by the `url` variable.

And that's it.