How to scrape Instagram with Soax Residential Proxy Service

How to scrape Instagram with Soax Residential Proxy Service

Hey everyone, Love cats? what if I told you that you can get cat images as many as you want from instagram? Yes in this tutorial, we are going to build a instagram scraper which visits the cats hashtag on instagram (You can add your customized hashtag too) and scrapes all the images and stores it in a list for you. (You can additionally go ahead and append code for downloading them to your local computer memory) Lets start gif

Reason of using Soax as proxy service

Hmm.. this is bit unusual step while starting but an important one let me explain. While this is a small application but we are creating a script which crawls the instagram website and by obvious reason instagram prevents bots and other scripts to run on it's platform. Also when you run such scripts directly from your machine it highly vulnerable as some attacker might get access to your machine information. This is where Soax comes into picture. You can read more about Soax and it's services here. Now that we know why we need soax let's go ahead and set it up

Setting up Soax

  1. Singup to the website and complete your authentication and purchase of the residential proxy plan (You can do it as per your choice and requirements)
  2. Once you signup and have a package it's time to set up your proxy server. There are mainly two methods to set up a proxy server. Both of these setup methods are nicely explained here
    • Setting up proxy server by whitelisting your IP address
    • Setting up proxy server using username and password authentication
  3. Once you have setup a proxy server you are done with your first step and now have a 100% secure and anonymous connection.

Now that we have a secure connection, let's actually go ahead and build our application/script for scraping the instagram

Creating Scraper Application

Great, so as you have set up the soax proxy, we are ready to create our scraper application which scrapes images from a particular hashtag of Instagram and saves all the image urls into a list (In this case we are scraping cats hastag) . We will be using python packages like Selenium and chromedriver to connect with the proxy (though you have setup and connected to the proxy, there is another method where you can directly connect to the soax proxy server using python code), crawl the instagram website and scrape the data. Following is the code written for scraping the data: Also on GitHub


#imports here
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait

# Soax Residential proxy connection
PROXY = "proxy.soax.com:10002" 

PATH = "./chromedriver"
options = webdriver.ChromeOptions()
options.add_argument('proxy.soax.com'.format(PROXY))
driver = webdriver.Chrome(service=Service(PATH), options=options)

#open the webpage
driver.get("http://www.instagram.com/login")

username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))) #targetting username field
password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']"))) #targetting password field

#enter username and password
username.clear()
username.send_keys("")
password.clear()
password.send_keys("")

#target the login button and click it
button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']"))).click()

#target the search input field
searchbox = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//input[@placeholder='Search']")))
searchbox.clear()

#search for the hashtag cat
keyword = "#cat"
searchbox.send_keys(keyword)

# Wait for 5 seconds
time.sleep(3)
searchbox.send_keys(Keys.ENTER)
time.sleep(3)
searchbox.send_keys(Keys.ENTER)
time.sleep(3)

#scroll down to scrape more images
driver.execute_script("window.scrollTo(0, 4000);")

#target all images on the page
time.sleep(5)
images = driver.find_elements(By.TAG_NAME,'img')
images = [image.get_attribute('src') for image in images]
images = images[:-2]

# No of image links scraped
print('Number of scraped images: ', len(images))

# Displaying one of the scrapped links --> Can be downloaded too
print("Displaying image: ", images[4])
driver.get(images[4])

Scraping and testing

It's finally time to test and see how our application scrapes the data from instagram and how it gets logged into the proxy server. Following is the terminal image that shows the scraped information about the instagram hashtag Image

Wonder how it would have logged into the soax proxy server. Well the soax dashboard helps you analyze the monthly traffic by your application. Additionally you can also set up ip rotation which spins your ip location after the specified time. Take a look at the image below Monthly traffic spent Ip rotation

Resources followed

Thank you so much for reading ❤️

Thank you gif

Did you find this article valuable?

Support WeMakeDevs by becoming a sponsor. Any amount is appreciated!