Web Scraping using Python and Selenium(XPATH)

Shubham Pandey
Analytics Vidhya
Published in
6 min readJun 24, 2021

--

The website used in this article is Ellen_show, This article is for educational purposes and it is a project to learn web scraping using selenium. I chose the above-mentioned website because I love the Ellen DeGeneres show… so!! and the data extracted here belongs to the owners of the show. This article is a let’s do it, for theory please read here.

Now let’s start, first install selenium using pip and import the following libraries in your notebook.

import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from IPython.display import Image
import datetime

Add Chrome Driver path but before that download, the latest version(or check the version of your chrome browser)of chrome driver from this link and PATH is where the downloaded .exe file is located)

PATH ="C:\Program Files\chromedriver.exe"

Code below will open chrome browser

driver = webdriver.Chrome(PATH)

Adding Link of the website we want to scrape from

driver.get('https://www.thetvdb.com/series/the-ellen-degeneres-show')

Note: if you are here just for the code, scroll down to the Final scrapper code or else follow with me.

source: https://www.thetvdb.com/series/the-ellen-degeneres-show

Scroll down on the website, and you will see this

source: https://www.thetvdb.com/series/the-ellen-degeneres-show

See the little arrow that is a link for season 1 data and then there is a list of 18 seasons in total.

I want my scraper code to automatically go into each link of the season's list and then scrape data(all the episodes) for each season and append the data in one single table. So by this, we will have one dataset of all seasons with their episodes. Sounds fun right!!!

Right click -> inspect -> on RHS top bar click on arrow on right of Elements tab -> then select season 1 -> you will see a portion highlighted in blue -> right click on it -> COPY -> copy XPATH -> you will get below mentioned path, copy it in one cell and do not touch it for now.

Season 1 XPATH : //*[@id=”page-series”]/div[3]/div[2]/div[2]/div[6]/div[1]/ul/li[3]/ h4/a

Do exactly what we did before for season 18 this time. You will get following

Season 18 XPATH : //*[@id=”page-series”]/div[3]/div[2]/div[2]/div[6]/div[1]/ul/li[20]/h4/a

So we want to iterate from Season_1's XPATH to Season_18’s XPATH

Now, what is XPATH? : XPath is a query language for selecting nodes from an XML document. So Selenium uses this and many other ways to navigate through HTML or XML documents, XPATH is one of them. The other ways to select a node are using id, classes, etc. but I find XPATH headache-free if that’s a word.

Now let’s get back to the task, can you spot the difference between Season1_XPATH and Season18_XPATH?

if no, here’s the answer - in the XPATH, till ul/, everything is the same and then after li, h4/a is also common in both of them.

The main game here is in li[3] to li[20], difference= 20–3+1 =18, which is equal to the total no. of seasons. So our loop will go from 3 to 21(the end of the range is excluded in python) and we will get each season.

Okay, so we can access each of these seasons' links great!!, but now what?

Now our code should be able to visit each season’s link and click on it to open the page containing all the data of that season

Code snippet to click on the web-page elements using its XPATH

The code above can click on elements of the webpage( and the path to those elements is given using the XPATH we saw before). So this can open each link one by one.

Can it?

No, it can’t, because after visiting a link, you should get back to the main page again only then it will be able to select the next link of the Season list.

Code snippet for looping through seasons

Do Something useful

Now, just like before, click on the row of episode 1 and collect XPATH for S01E01, Name, First Aired, just like we did for Seasons and copy them in a separate cell for reference.

episode 1:

  • Season index : //*[@id=”page-season”]/div[4]/div[2]/div[1]/table/tbody/tr[1]/td[1]
  • Guests(Name): //*[@id=”page-season”]/div[4]/div[2]/div[1]/table/tbody/tr[1]/td[2]/a
  • Date(First aired): //*[@id=”page-season”]/div[4]/div[2]/div[1]/table/tbody/tr[1]/td[3]/div

episode 166:

  • Season index: //*[@id=”page-season”]/div[4]/div[2]/div[1]/table/tbody/tr[166]/td[1]
  • Guests(Name): //*[@id=”page-season”]/div[4]/div[2]/div[1]/table/tbody/tr[166]/td[2]/a
  • Date(First aired): //*[@id=”page-season”]/div[4]/div[2]/div[1]/table/tbody/tr[166]/td[3]/div

Now let’s spot the difference between the first episode’s and the last episode’s -(Season_index, Guests, Date XPATH)

  • Season_index XPATH changes from tr[1] to tr[166] rest everything is the same
  • Guest(Name) XPATH changes from tr[1] to tr[166] rest everything is the same
  • Date(First aired) XPATH changes from tr[1] to tr[166] rest everything is same

Not all seasons have 166 episodes, so we can take 1 to 200 to be on the safe side for each season. Now you might be thinking that code will give an error if we go beyond the limit of no. of episodes in any season. Yes!!, it will give NoSuchElementException but we will handle it using try-except, so if an exception occurs we continue(python keyword) from the next iteration of the same loop till it finishes all the iterations(without giving any error). You will understand more when you see the code.

Okay, so we can access each data column of each episode, now what’s next?

Now our code should be able to read the text associated with each XPATH so we can store them. Let’s work on that too

code snippet to loop through all the episodes and read the associated text

Final scrapper code:

Date = [] #to store date data
Season_ep = [] #to store season data
Guests = [] #to store Guests list


for i in range(3,21):

element=WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH ,'//*[@id="page-series"]/div[3]/div[2]/div[2]/div[6]/div[1]/ul/li['+str(i)+']/h4/a')))
element.click()

for j in range(1,200):
try:

Season_ep.append(driver.find_element_by_xpath('//*[@id="page-season"]/div[4]/div[2]/div[1]/table/tbody/tr['+str(j)+']/td[1]').text)
Guests.append(driver.find_element_by_xpath('//*[@id="page-season"]/div[4]/div[2]/div[1]/table/tbody/tr['+str(j)+']/td[2]/a').text)
Date.append(driver.find_element_by_xpath('//*[@id="page-season"]/div[4]/div[2]/div[1]/table/tbody/tr['+str(j)+']/td[3]/div').text)

except NoSuchElementException:

continue
print('Season:'+str(i-2)+' done')

element=WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH ,'//*[@id="page-season"]/div[3]/div/a[3]')))
element.click()

Keep the chrome browser launched by this notebook open, in front of you, you will see the magic of selenium and python.

Combining everything together and making a final dataframe

d = zip(Season_ep,Date,Guests)
mapped = list(d)
df = pd.DataFrame(mapped, columns =['Season', 'Date','Guests'])

Let’s verify the dataset and see if it is accurate or not

So I remember Deepika Padukone went to Ellen's show. If this is a complete database as I believe. It should contain that entry. Let's also see who were other guests that day.

df[df['Guests'].str.contains('Deepika Padukone')]
Season 14 Episode 84, on January 18, 2017, was around the release of her Hollywood debut movie XXX: Return of Xander Cage with Vin Diesel.
#Converting date to a much favourable format

df['Date'] = df['Date'].apply(lambda x: datetime.datetime.strptime(x, '%B %d, %Y').date())
df['Date'] = pd.to_datetime(df['Date'])
df.tail(10)

Github Link : Code_and_Extracted_Dataset

As the case with almost everything, the above results can be obtained in multiple ways this is one of those ways. In the end, I just hope you learned something, it was my first article and I am happy to be able to contribute to this community. Please do give your valuable feedback, I will surely improve on my mistakes.

--

--

Shubham Pandey
Analytics Vidhya

Machine Learning enthusiast, post blogs on Deep Learning techniques, tips and tricks.trying to make a name for myself in the community.