Scraping with chromedriver in python

environment

Thing you want to do

For scraping web articles that have been scraped I want to scrape using chrome driver headless with python

Prerequisite knowledge

  1. About the browser driver In short, it seems that it is a necessary tool to handle the browser with CUI instead of GUI.

  2. Relationship between DNS server and local hosts When we access from a domain name in a browser, it asks the DNS server for that information, returns as an IP address, and the PC uses it to access the website and the site is displayed in the browser. However, if you put the domain and IP address in the hosts file of Mac, you can get the IP address without connecting to the DNS server.

Reference article [Selenium and Google Spreadsheets (4) "Until you start using Chrome Driver" (https://bitwave.showcase-tv.com/selenium%E3%81%A8google-spreadsheets4-%E3%80%8Cchrome-driver%E3 % 82% 92% E4% BD% BF% E3% 81% 84% E3% 81% AF% E3% 81% 98% E3% 82% 81% E3% 82% 8B% E3% 81% BE% E3% 81 % A7% E7% B7% A8% E3% 80% 8D /) This article about DNS servers, [Illustration] What is a DNS server? How to set / change and check This article is recommended for the hosts file. How to rewrite / edit hosts file on Mac! What should I do if it is not reflected?

Preparation

Check the contents of the hosts file

Open the file.

$sudo vi /etc/hosts

Next, check that the contents of the hosts file look like this.

##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1       localhost
255.255.255.255 broadcasthost
::1             localhost

Also, install the same version of driver as the chrome version included in the application from the selenium site. (In my case it was 78.0.3904.97.) ChromeDriver - WebDriver for Chrome

Source code

# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
if __name__ == '__main__':
    base = "Scraped site url"
    options = Options()
    #headless designation
    options.add_argument('--headless')
    driver = webdriver.Chrome(executable_path='Absolute path to the directory where the chrome driver is', chrome_options=options)
    driver.get(url)
    #Encode
    html = driver.page_source.encode('utf-8')
    #Instantiation
    soup = BeautifulSoup(html, 'html.parser')

I usually use urllib.request It may be possible to solve it by using this selenium for sites that are anti-scraping. !

Recommended Posts

Scraping with chromedriver in python
Scraping with selenium in Python
Scraping with Selenium in Python
Scraping with Tor in Python
Scraping with Python
Scraping with Python
Scraping with Selenium in Python (Basic)
Scraping with Python, Selenium and Chromedriver
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
Achieve scraping with Python & CSS selector in 1 minute
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with Selenium + Python Part 1
Working with LibreOffice in Python
Web scraping notes in python3
Festive scraping with Python, scrapy
Debugging with pdb in Python
Working with sounds in Python
Tweet with image in Python
Combined with permutations in Python
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
I tried scraping with python
Web scraping beginner with python
I was addicted to scraping with Selenium (+ Python) in 2020
[Scraping] Python scraping
Number recognition in images with Python
Try scraping with Python + Beautiful Soup
Testing with random numbers in Python
Scraping with Node, Ruby and Python
GOTO in Python with Sublime Text 3
Working with LibreOffice in Python: import
Web scraping with Python ① (Scraping prior knowledge)
CSS parsing with cssutils in Python
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
Numer0n with items made in Python
Open UTF-8 with BOM in Python
Scraping with Beautiful Soup in 10 minutes
Use rospy with virtualenv in Python3
Let's do image scraping with Python
Use Python in pyenv with NeoVim
Heatmap with Dendrogram in Python + matplotlib
Get Qiita trends with Python scraping
Read files in parallel with Python
Password generation in texto with python
Use OpenCV with Python 3 in Window
Until dealing with python in Atom
"Scraping & machine learning with Python" Learning memo
Get started with Python in Blender
Get weather information with Python & scraping
Working with DICOM images in Python
Try scraping the data of COVID-19 in Tokyo with Python
Write documentation in Sphinx with Python Livereload
Get additional data in LDAP with python