Python - Extract URL from Text

Neha Kumawat

a year ago

Python - Extract URL from Text | Insideaiml
Python - Extract URL from Text | Insideaiml
URL's extraction is accomplished from a book file by utilizing standard articulation. The articulation brings the content anywhere it coordinates the instance. Just the re module is employed for this reason.

Example

We can take an info document containing a few URLs and procedure it through the accompanying system to extricate the URLs. The findall()function is utilized to discover all examples coordinating with the standard articulation.

Inout File

Shown is the input file below. Which contains teo URLs.

Nowadays you can learn almost anything by just visiting http://www.google.com. But if you are completely new to computers or the internet then first you need to learn those fundamentals. Next
you can visit a good e-learning site like - https://insideaiml.com to learn further on a variety of subjects.
Presently, when we take the above information record and procedure it through the accompanying project we get the necessary yield which gives just the URLs extricated from the document.
import re
 
with open("path\url_example.txt") as file:
        for line in file:
            urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
            print(urls)
When we run the above program we get the following output
['http://www.google.com.']

['https://insideaiml.com']

How does it work?

It attempts to discover any event of TLD in given content. On the off chance that TLD is discovered it begins from that position to extend limits to the two sides scanning for "stop character" (generally whitespace, comma, single or twofold statement).
A dns check alternative is accessible to likewise dismiss invalid area names.

Requirements

  • IDNA for converting links to IDNA format
  • uritools for domain name validation
  • appdirs for determining user’s cache directory
  • dnspython to cache DNS results
pip install idna
pip install uritools
pip install appdirs
pip install dnspython

Another Example:

You can see the order line program toward the finish of urlextract.py. Be that as it may, all that you have to know is this:
from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']
Or you can get generator over URLs in the text by:
from urlextract import URLExtract

extractor = URLExtract()
example_text = "Text with URLs. Let's have URL janlipovsky.cz as an example."

for url in extractor.gen_urls(example_text):
    print(url) # prints: ['janlipovsky.cz']
Or on the other hand in the event that you need to simply check if there is at any rate one URL you can do:
from urlextract import URLExtract

extractor = URLExtract()
example_text = "Text with URLs. Let's have URL janlipovsky.cz as an example."

if extractor.has_urls(example_text):
    print("Given text contains some URL")
If you want to have up to date list of TLDs you can use update():
from urlextract import URLExtract

extractor = URLExtract()
extractor.update()
or update_when_older() method:
from urlextract import URLExtract

extractor = URLExtract()
extractor.update_when_older(7) # updates when list is older that 7 days

Known Issues

Since TLD can be an alternate route as well as some important word we would see "bogus matches" when we are looking for URL in some HTML pages. The bogus match can happen for instance in CSS or JS when you are alluding to HTML thing utilizing its classes.
Example HTML code:
Jan
In the event that this HTML scrap is on the contribution of urlextract.find_urls() it will return p.bold.name as a URL. Conduct of urlextract is right, since .name is substantial TLD and urlextract simply observe that there is bold.name legitimate space name and p is legitimate sub-area.

License

This bit of code is authorized under The MIT License.
For more related articles and courses visit InsideAIML.

Submit Review

We're Online!

Chat now for any query