How to scrape Facebook with Soax Residential Proxy Service

How to scrape Facebook with Soax Residential Proxy Service

ยท

4 min read

Hey everyone, ever encountered a great facebook page and wanted to get the images posted on the profile? To be honest my favourites are the meme pages on facebook and getting the images from the pages will help me share them better with my friends across different socials ๐Ÿ˜†. So in this tutorial, we are going to build a facebook scraper which visits the profile specified by you and scrapes all the images and stores it in a list for you. (You can additionally go ahead and append code for downloading them to your local computer memory) Lets start gif

Reason of using Soax as proxy service

Hmm.. this is bit unusual step while starting but an important one let me explain. While this is a small application but we are creating a script which crawls the facebook website and by obvious reason facebook prevents bots and other scripts to run on it's platform. Also when you run such scripts directly from your machine it highly vulnerable as some attacker might get access to your machine information. This is where Soax comes into picture. You can read more about Soax and it's services here. Now that we know why we need soax let's go ahead and set it up

Setting up Soax

  1. Singup to the website and complete your authentication and purchase of the residential proxy plan (You can do it as per your choice and requirements)
  2. Once you signup and have a package it's time to set up your proxy server. There are mainly two methods to set up a proxy server. Both of these setup methods are nicely explained here
    • Setting up proxy server by whitelisting your IP address
    • Setting up proxy server using username and password authentication
  3. Once you have setup a proxy server you are done with your first step and now have a 100% secure and anonymous connection.

Now that we have a secure connection, let's actually go ahead and build our application/script for scraping the facebook

Creating Scraper Application

Great, so as you have set up the soax proxy, we are ready to create our scraper application which scrapes a particular profile of Facebook and saves all the image urls into a list (In this case we are scraping Mark Zuckerburg's profile). We will be using python packages like Selenium and chromedriver to connect with the proxy (though you have setup and connected to the proxy, there is another method where you can directly connect to the soax proxy server using python code), crawl the facebook website and scrape the data. Following is the code written for scraping the data: Also on GitHub

#imports here
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
import time


# Soax Residential proxy connection
PROXY = "proxy.soax.com:10002" 

#specify the path to chromedriver and preferences
PATH = "./chromedriver"
prefs = {"profile.default_content_setting_values.notifications" : 2}

options = webdriver.ChromeOptions()
options.add_argument('proxy.soax.com'.format(PROXY)) #Adding proxy argument to the driver options
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(service=Service(PATH), options=options)


#opening facebook's login page
driver.get("http://www.facebook.com")

username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='email']"))) #targetting username field
password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='pass']"))) #targetting password field

username.clear()
username.send_keys("") # Add your account username
password.clear()
password.send_keys("") # Add your account password

#logging in to the facebook
button = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']"))).click()

time.sleep(5)
images = [] 

#itterate over both uploaded and tagged images respectively
for i in ["photos_all", "photos_of"]:
    #Scraping Mark Zuckerburg profile for all photos
    driver.get("https://www.facebook.com/zuck/" + i + "/")
    time.sleep(5)

    #scrolling down the screen
    for j in range(0,1):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(10)

    #targeting all the link elements on the page
    anchors = driver.find_elements(By.TAG_NAME, 'a')
    anchors = [a.get_attribute('href') for a in anchors]
    #filtering out image links only
    anchors = [a for a in anchors if str(a).startswith("https://www.facebook.com/photo")]

    print('Found ' + str(len(anchors)) + ' links to images')

    for a in anchors:
        time.sleep(5)
        img = driver.find_elements(By.TAG_NAME, "img")
        images.append(img[1].get_attribute("src")) #Can change in future to img[?]

print('Scraped '+ str(len(images)) + ' images!')
print('Displaying image: ', images[0])
driver.get(images[0])

Scraping and testing

It's finally time to test and see how our application scrapes the data from facebook and how it gets logged into the proxy server. Following is the terminal image that shows the scraped information about the facebook profile Image

Wonder how it would have logged into the soax proxy server. Well the soax dashboard helps you analyze the monthly traffic by your application. Additionally you can also set up ip rotation which spins your ip location after the specified time. Take a look at the image below Monthly traffic spent Ip rotation

Resources followed

Thank you so much for reading โค๏ธ

Thank you gif

Did you find this article valuable?

Support WeMakeDevs by becoming a sponsor. Any amount is appreciated!

ย