Web Scraping - Processing Images and Videos

Nishant Rathi

4 months ago

Web Scraping - Processing Images and Videos | insideaiml
Web Scraping - Processing Images and Videos | insideaiml

Let’s first try to know what is web scraping?

In theory, web scraping is the practice of gathering data through any means other
than a program interacting with an API (or, obviously, through a human using a web
browser). This is most commonly accomplished by writing an automated program
that queries a web server, requests data (usually in the form of the HTML and other
files that comprise web pages), and then parses that data to extract needed information.
So basically, Web scraping is a process of downloading, storing, and processing web content.
Web scraping process
Web scraping process
In this article, let us understand how to process the content downloaded from the web.

Introduction

The web media content that we obtain during scraping can be images, audio, and video files, in the form of non-web pages as well as data files. But, can we trust the downloaded data especially on the extension of data we are going to download and store in our computer memory? This makes it essential to know about the type of data we are going to store locally.

How to get Media Content from Web Page using Python

In this section, we are going to learn how we can download media content which correctly represents the media type based on the information from web server. We can do it with the help of Python requests module as we did in previous chapter.
 The first step, we need to import necessary Python modules −
import requests
It will import the requests package of python which is basically used for web scrapping
Now in the next step, we provide the URL of the web page from where we want to download and store data locally.
URL = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
We create HTTP response object using following code:
r = requests.get(URL)
With the help of below line of code, we can save the received content as .png file.
with open("ThinkBig.png",'wb') as f:
f.write(r.content)
After running the above Python script, we will get a file named ThinkBig.png, which would have the downloaded image.

Extracting Filename from URL

After downloading the content from a web page, we also want to save it in a file with a file name found in the URL. But we can also check if numbers of additional fragments exist in URL too. For this, we need to find the actual filename from the URL.
With the help of following Python script, using urlparse, we can extract the filename from URL −
import urllib3 import os url =
"https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
a = urlparse(url)
a.path 
Now you can see the output as shown below −
‘MetaSlider_ThinkBig-1080x180.jpg'
Once we run the above script, we will get the filename from URL.

Information about Type of Content from URL

While extracting the contents from web server, by GET request, we can also check its information provided by the web server. With the help of following Python script, we can determine what web server means with the type of the content −
First, we need to import necessary Python modules as follows −
import requests
Now, we need to provide the URL of the web page we want to download and save locally.
URL= "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
Following line of code will create HTTP response object.
r = requests.get(URL, allow_redirects=true)
Now, we can get what type of information about content can be provided by web server.
for headers in r.headers:
print(headers)
we can observe the output as shown below −
Date Server 
Upgrade Connection 
Last-Modified 
Accept-Ranges 
Content-Length 
Keep-Alive 
Content-Type 
With the help of following line of code, we can get the particular information about content type, say content-type −
print (r.headers.get('content-type'))
We can observe the output as shown below −
image/jpeg
With the help of following line of code, we can get the particular information about content type, say EType −
print (r.headers.get('ETag'))
We can observe the output as shown below −
We can observe the output as shown below −
none
Now let’s observe the below command −
print (r.headers.get('content-length'))
we can observe the output as shown below −
12636
With the help of following line of code, we can get the particular information about content
To get the particular information about the content type, say Server –
we can do it as follows
print (r.headers.get('Server'))
So we can observe the below output which provide the server type−
Apache

Generating Thumbnail for Images

Thumbnail is a very small description or representation. A user may want to save only thumbnail of a large image or save both the image as well as thumbnail.
Now in this section we are going to create a thumbnail of the image named ThinkBig.png downloaded in the previous section “How to get Media Content from Web Page using Python”.
For this Python script, we need to install Python library named Pillow.
What is Pillow?
It is one of the python packages which is a fork of the Python Image library having useful functions for manipulating images.
It can be installed as −
pip install pillow
The following Python script will create a thumbnail of the image and will save it to the current directory by prefixing thumbnail file with Th_
import glob from PIL
import Image
for infile in
glob.glob("ThinkBig.png"): 
img =
Image.open(infile) 
img.thumbnail((128,
128), Image.ANTIALIAS) 
if
infile[0:2] != "Th_": 
img.save("Th_"
+ infile, "png") 
You can understand the above code very easily and you can check for the thumbnail file in your current directory.

Screenshot from Website

In web scraping, one of the very common tasks is to take screenshot of a website.
For implementing this, we are going to use selenium and webdriver.
The following Python script will take the screenshot from the website and will save it to current directory.
From selenium import webdriver 
path = r'C:\\Users\\neha\\Desktop\\Chromedriver'
browser =
webdriver.Chrome(executable_path = path) browser.get('https://tutorialspoint.com/')
screenshot =
browser.save_screenshot('screenshot.png')
browser.quit 
You can observe the output as shown below −
DevTools listening on ws://127.0.0.1:1456/devtools/browser/488ed704-9f1b-44f0- a571-892dc4c90eb7 >
After running the script, you can check your current directory for screenshot.png file.
Image The block cannot be displayed correctly.

Thumbnail Generation for Video

Now suppose we have downloaded videos from a website and wanted to generate thumbnails for them so that a specific video, based on its thumbnail, can be clicked.
For generating thumbnail for videos, we need a simple tool called ffmpeg which can be downloaded from www.ffmpeg.org. After downloading, we need to install it as per the specifications of our OS.
The following Python script will generate a thumbnail of the video and will save it to our local directory −
import subprocess
 video_MP4_file = “C:\Users\neha\desktop\solar.mp4”
thumbnail_image_file =
'thumbnail_solar_video.jpg' 
subprocess.call(['ffmpeg', '-i',
video_MP4_file, '-ss', '00:00:20.000', '- vframes', '1', thumbnail_image_file,
"-y"])
After running the above script, we will get the thumbnail named thumbnail_solar_video.jpg saved in our local directory.

Ripping an MP4 video to an MP3

Suppose we have downloaded some videos file from a website, but we only need audio from that file to serve our purpose, then it can be done in Python with the help of Python library called moviepy which can be installed with the help of following command −
pip install moviepy
Now, after successfully installing moviepy with the help of following script we can convert and MP4 to MP3.
import moviepy.editor as mp 
clip =
mp.VideoFileClip(r"C:\Users\gaurav\Desktop\1234.mp4")
clip.audio.write_audiofile("movie_audio.mp3") 
We can observe the output as shown below −
[MoviePy] Writing audio in movie_audio.mp3 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 674/674 [00:01<00:00, 476.30it/s] [MoviePy] Done.
It will save the audio MP3 file in the your local directory

Submit Review

We're Online!

Chat now for any query