How to Extract Links from an HTML Page Using Python

To extract links from an HTML document using Python, you can leverage the `requests` and `BeautifulSoup` modules.

This script automates the process of manually opening a webpage, inspecting its source code to locate hyperlinks (the <a> tags), and returning a list of all the links found on the page.

import requests
from bs4 import BeautifulSoup

url = "https://www.andreaminini.com/python/"
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

# Extract all links
for link in soup.find_all('a', href=True):
    print(link['href'])

This is a basic example of web scraping, an automated technique for extracting information from a webpage. Let’s walk through how it works, line by line.

The script begins by importing the `requests` and `BeautifulSoup` libraries.

  • requests
    This library is used to make HTTP requests. In the context of web scraping, `requests` is commonly used to download the content of a webpage.
  • BeautifulSoup
    This library is used to parse HTML and XML documents, allowing you to easily navigate and manipulate the DOM tree of a webpage.

The `url` variable holds the address of the webpage you want to analyze.

url = "https://www.andreaminini.com/python/"

The requests.get(url) method sends a GET request to that URL.

If the request is successful, the server responds with the content of the webpage, which is stored in the `response` variable.

response = requests.get(url)

Next, the HTML content stored in `response` is parsed using BeautifulSoup.

soup = BeautifulSoup(response.text, 'html.parser')

This command parses the HTML text using the specified parser (`'html.parser'` in this case), creating a `BeautifulSoup` object that represents the HTML structure in a format that can be easily navigated and manipulated.

The response.text contains the raw HTML content returned by the HTTP request.

The result of the parsing operation is stored in the `soup` variable.

Finally, the for loop iterates through the `soup` object, extracting all the links.

for link in soup.find_all('a', href=True):
    print(link['href'])

The method `soup.find_all('a', href=True)` searches the HTML document for all `<a>` elements that have an `href` attribute, which are the hyperlinks.

For each `<a>` element found, it prints the value of the `href` attribute, which is the URL the link points to.

This straightforward script extracts all the links present on the page specified by the `url` variable.

And that's it.

 
 

Please feel free to point out any errors or typos, or share suggestions to improve these notes. English isn't my first language, so if you notice any mistakes, let me know, and I'll be sure to fix them.

FacebookTwitterLinkedinLinkedin
knowledge base

Python

  1. The Python Language
  2. How to Install Python on Your PC
  3. How to Write a Program in Python
  4. How to Use Python in Interactive Mode
  5. Variables
  6. Numbers
  7. Logical Operators
  8. Iterative Structures (or Loops)
  9. Conditional Structures
  10. Exceptions
  11. Files in Python
  12. Classes
  13. Modules

Miscellaneous

Source