[PYTHON] Introduction to Web Scraping

Scraping

A little primitive method

When extracting a specific part of a web page

Python3


import requests
import json

r = requests.get('https://nikkei225jp.com/chart/')
text = r.text #Returns an html object

date    = text.split('<div class=wtimeT>')[1].split('</div>')[0]
nikkei   = text.split('<div class=if_cur>')[1].split('</div>')[0].replace(',','')
dau       = text.split('<div class=if_cur>')[2].split('</div>')[0].replace(',','')
kawase = text.split('<div class=if_cur>')[3].split('</div>')[0].replace(',','')

print('today',date,'is')
print ('Nikkei Stock Average',nikkei, 'It's a yen')
print ('Dow Jones Industrial Average', dau, 'It's a yen')
print ('Currency dollar', kawase,'It's a yen')

a=open('shares.csv','w')
a.write('Date and time,Nikkei Stock Average,Dow Jones Industrial Average,Currency dollar\n')
a.write(date+','+nikkei+','+dau+','+kawase+'\n')
a.close()

Result (command line)


Today is 2019/03/23
Nikkei Stock Average is 21627.It's 34 yen
Dow Jones Industrial Average is 25502.32 yen
Exchange dollar is 109.It's 93 yen

I think it was printed like this

Results (shares.csv)


Date and time,Nikkei Stock Average,Dow Jones Industrial Average,Currency dollar
2019/03/23,21627.34,25502.32,109.93

I confirmed that a file like this has been created.

What the program did

Of this Nikkei Stock Average *** Date and time, Nikkei Stock Average, Dow Jones Industrial Average, Forex Dollar *** Information such as was extracted, printed, and saved.

スクリーンショット 2019-03-23 2.37.50.png Quote: https://nikkei225jp.com/chart/

Of this page スクリーンショット 2019-03-23 2.38.10.png I'm extracting the information of this part

Web scraping also has a convenient way to use *** Beautiful Soup *** or *** Selenium ***

This time, we adopted the primitive method of *** requests *** only ~

As a flow

r = requests.get('URL of the page you want to scrape')

The response (page information) returned in is stored in the variable *** r ***

text = r.text

Getch in text format with the body (HTML body) of the response *** r *** returned in step as *** text ***

スクリーンショット 2019-03-23 3.26.59.png

For example, *** Nikkei Stock Average ***, *** 21,627.34 ***

If you want to extract

Select the information you want to extract as shown above Search for "** Validate " or " View Page Source **"

<div class="if_cur">21,627.34</div><img width="674" alt="Screenshot 2020-04-14 18.27.54.png " src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/377231/64fddc0e-29c5-8d2a-09ee-bb3656fc0895.png ">

In this way, the class *** if_cur *** is sandwiched.

nikkei   = text.split('<div class=if_cur>')[1].split('</div>')[0].replace(',','')

The contents are extracted as *** nikkei *** in this one line.

Example 2) Example of date and time extraction スクリーンショット 2020-04-14 18.27.54.png

.replace (',','') just removes the comma (,). Because the comma in the number is an obstacle. (If you want to convert to int type later and perform the operation, you cannot convert to int type if commas remain)

*** Dow Jones Industrial Average ***, *** Currency Dollar ***, *** Time *** is the same way

The most primitive but method that can handle various patterns.

Smart method

This time, ** extract all the links (URLs) that exist on the Nikkei Stock Average page.

This time, if I draw the code to extract one by one like before, there is no sharpness

For the time being, a pattern that extracts all *** a *** tags (tags with URLs) and stores them in an array

*** Beautiful Soup *** is now

$ pip install beautifulsoup4

I just added the *** a *** tag acquisition code to the previous code

import requests
import json
from bs4 import BeautifulSoup

r = requests.get('https://nikkei225jp.com/chart/')
text = r.text
date    = text.split('<div class=wtimeT>')[1].split('</div>')[0]
nikkei   = text.split('<div class=if_cur>')[1].split('</div>')[0].replace(',','')
dau       = text.split('<div class=if_cur>')[2].split('</div>')[0].replace(',','')
kawase = text.split('<div class=if_cur>')[3].split('</div>')[0].replace(',','')

print('today',date,'is')
print ('Nikkei Stock Average',nikkei, 'It's a yen')
print ('Dow Jones Industrial Average', dau, 'It's a yen')
print ('Currency dollar', kawase,'It's a yen')

a=open('shares.csv','w')
a.write('Date and time,Nikkei Stock Average,Dow Jones Industrial Average,Currency dollar\n')
a.write(date+','+nikkei+','+dau+','+kawase+'\n')
a.close()


#Below is the additional amount
soup = BeautifulSoup(r.text , "html.parser") 

for a in soup.find_all('a'):
    if 'http' in str(a): #This time it is limited to the a tag with http
        #print(a.text) #Contents of a tag (title)
        print(a.attrs['href']) #URL

Result (command line)


Today is 2019/03/23
Nikkei Stock Average is 21627.It's 34 yen
Dow Jones Industrial Average is 25502.It's 32 yen
Exchange dollar is 109.It's 93 yen
http://xn--u9jt60g57a227ciso.com/
http://quote.jpx.co.jp/jpx/template/quote.cgi?F=tmp/real_index&QCODE=155
http://klug-fx.jp/holiday/
https://jp.investing.com/holiday-calendar/
https://db.225225.jp/
https://nikkei225jp.com/chart/
https://nikkei225jp.com/nasdaq/
https://nikkei225jp.com/fx/
https://ch225.com/
https://225225.jp/
https://nikkei225jp.com/cme/
https://adr-stock.com/
http://fx.minkabu.jp/indicators/calendar
http://jp.reuters.com/investing/news/economic
http://www3.nhk.or.jp/news/html/20190323/k10011858101000.html
http://moneyzine.jp/article/detail/215915
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/19cDqM88PGE/graphics-frb-idJPKCN1R30VK
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/wqkbZgbeMMA/asia-companies-outlook-analysis-idJPKCN1R30Y2
http://www.asahi.com/articles/ASM3D3S9TM3DULFA00N.html?ref=rss
http://diamond.jp/articles/-/197806
http://www.asahi.com/articles/ASM3R1SPZM3RUHBI003.html?ref=rss
https://zai.diamond.jp/list/fxnews/detail?id=312805&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
https://zai.diamond.jp/list/fxnews/detail?id=312804&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
https://zai.diamond.jp/list/fxnews/detail?id=312803&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
https://zai.diamond.jp/list/fxnews/detail?id=312802&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
https://zai.diamond.jp/list/fxnews/detail?id=312801&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
http://www3.nhk.or.jp/news/html/20190322/k10011857501000.html
http://diamond.jp/articles/-/197800
http://feeds.reuters.com/~r/reuters/JPMarketNews/~3/k7hVYUD0Rlw/usa-trump-russia-idJPL3N21949Q
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/7SX2E12xQqA/ny-market-summary-0322-idJPKCN1R32TP
https://www.nikkei.com/article/DGXLASM7IAA05_T20C19A3000000/
http://feeds.reuters.com/~r/reuters/JPCompanyNews/~3/vP8IyPxDb_w/EU-HUAWEI-TECH--idJPL3N21946T
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/dhPXy0bfxg8/ny-stx-us-idJPKCN1R32SJ
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/f5sMkCorXO8/ny-forex-idJPKCN1R32SB
http://feeds.reuters.com/~r/reuters/JPMarketNews/~3/fYZ-Sat0U3Y/ny-markets-summary-idJPL3N2194BF
http://feeds.reuters.com/~r/reuters/JPCompanyNews/~3/unMMYgBSv38/ny-stx-us-idJPL3N21946R
http://feeds.reuters.com/~r/reuters/JPCompanyNews/~3/FIxyRRMByHY/pinterest-ipo-idJPL3N21949V
https://zai.diamond.jp/list/fxnews/detail?id=312800&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
https://www.nikkei.com/article/DGXLASH2ICE01_T20C19A3000000/
http://www.traders.co.jp/foreign_stocks/market_s.asp#today
http://www.gaitame.com/market/yosoku.html
http://market.fisco.co.jp/update/index.jsp
http://www.traderswebfx.jp/news/default.aspx?ID=7#newslist
http://kabuyoho.ifis.co.jp/
http://www.tokyoipo.com/top/iposche/index.php?j_e=J
http://klug-fx.jp/holiday/
https://jp.investing.com/holiday-calendar/
http://world.honda.com/worldclock/
https://news.yahoo.co.jp/search?p=%E6%97%A5%E7%B5%8C%E5%B9%B3%E5%9D%87&ei=utf-8&fr=news_sw
https://www.google.co.jp/search?hl=ja&gl=jp&tbm=nws&authuser=0&q=%E6%97%A5%E7%B5%8C%E5%B9%B3%E5%9D%87&oq=%E6%97%A5%E7%B5%8C%E5%B9%B3%E5%9D%87&gs_l=news-cc.1.0.43j43i53.2284.2284.0.5545.1.1.0.0.0.0.56.56.1.1.0...0.0...1ac.1.oMorwBF68ss#q=%E6%97%A5%E7%B5%8C%E5%B9%B3%E5%9D%87&hl=ja&gl=jp&authuser=0&tbm=nws&tbs=sbd:1
http://chart.fisco.co.jp/fisco/cgi-bin/index.cgi
http://chart.fisco.co.jp/fisco/cgi-bin/index.cgi
https://www.dukascopy.jp/

You can see that the links in the page are taken.

#print(a.text) #Contents of a tag (title)

If you comment out the part, you can get the title of the *** a *** tag ~

Today is 2019/03/23
Nikkei Stock Average is 21627.It's 34 yen
Dow Jones Industrial Average is 25502.32 yen
Exchange dollar is 109.It's 93 yen
World stock prices.com
http://xn--u9jt60g57a227ciso.com/
east
http://quote.jpx.co.jp/jpx/template/quote.cgi?F=tmp/real_index&QCODE=155
[Klug]
http://klug-fx.jp/holiday/
[Investing]
https://jp.investing.com/holiday-calendar/
Real-time market conditions Parts
https://db.225225.jp/
Nikkei Stock Average
https://nikkei225jp.com/chart/
Dow Jones Industrial Average
https://nikkei225jp.com/nasdaq/
Exchange dollar yen
https://nikkei225jp.com/fx/
World stock price
https://ch225.com/
Mobile phone
https://225225.jp/
CME
https://nikkei225jp.com/cme/
ADR
https://adr-stock.com/
Everyone's exchange
http://fx.minkabu.jp/indicators/calendar
Reuters
http://jp.reuters.com/investing/news/economic
Movement to strengthen life support services by expanding acceptance of foreign human resources
http://www3.nhk.or.jp/news/html/20190323/k10011858101000.html
Bank deposits, bank earnings cycle after the introduction of negative interest rates, which exceeded the same month last year for 149 consecutive months ...
http://moneyzine.jp/article/detail/215915
Angle:Fed dovish shift, positive impact on US households
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/19cDqM88PGE/graphics-frb-idJPKCN1R30VK
focus:Capital investment by Asian companies to decline for the first time in 3 years due to slowdown in China
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/wqkbZgbeMMA/asia-companies-outlook-analysis-idJPKCN1R30Y2
Subsidy system for nuclear power support Ministry of Economy, Trade and Industry aims to establish in 2020
http://www.asahi.com/articles/ASM3D3S9TM3DULFA00N.html?ref=rss
NY market fell sharply on 22nd-Latest stock news
http://diamond.jp/articles/-/197806
NY Dow plunges, 460 dollars depreciation fears of slowing global economy
http://www.asahi.com/articles/ASM3R1SPZM3RUHBI003.html?ref=rss
Risk-averse funds flow in from continued growth, low stock prices and high bond prices
https://zai.diamond.jp/list/fxnews/detail?id=312805&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
NY gold futures continue to grow, risk-averse funds flow in from stock prices and bond prices
https://zai.diamond.jp/list/fxnews/detail?id=312804&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
Concerns that NY crude oil futures will continue to fall and the deterioration of the world economy will worsen
https://zai.diamond.jp/list/fxnews/detail?id=312803&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
NY market trends(End of transaction):Dow 460.19 dollars cheap(Breaking news), Crude oil futures 0.94 dollars cheap
https://zai.diamond.jp/list/fxnews/detail?id=312802&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
Yen against world currencies:Against dollar 0.81%High, against Euro 1.43%High
https://zai.diamond.jp/list/fxnews/detail?id=312801&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
Creating a text for acquiring a new status of residence for foreign human resources Industry group of restaurant companies
http://www3.nhk.or.jp/news/html/20190322/k10011857501000.html
ECB has no intention of issuing digital currency=Director of Melshu [Fisco Bitcoin New ...
http://diamond.jp/articles/-/197800
UPDATE 1-U.S. Special Prosecutor Submits Russian Suspicion Investigation Report No Further Prosecution Proposal
http://feeds.reuters.com/~r/reuters/JPMarketNews/~3/k7hVYUD0Rlw/usa-trump-russia-idJPL3N21949Q
NY Market Summary(The 22nd)
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/7SX2E12xQqA/ny-market-summary-0322-idJPKCN1R32TP
NY yen, repulsion 1 dollar=109.90 yen?Ends at 110.00 yen, the yen strengthens for the first time in a month
https://www.nikkei.com/article/DGXLASM7IAA05_T20C19A3000000/
resend-EXCLUSIVE-European Commission to make data sharing proposal without eliminating Huawei from 5G=Seki ...
http://feeds.reuters.com/~r/reuters/JPCompanyNews/~3/vP8IyPxDb_w/EU-HUAWEI-TECH--idJPL3N21946T
US stock market plunges, global economic downturn intensifies
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/dhPXy0bfxg8/ny-stx-us-idJPKCN1R32SJ
The dollar fell against the yen, and economic concerns increased due to the reversal of US long-term interest rates=NY market
http://feeds.reuters.com/~r/reuters/JPBusinessNews/~3/f5sMkCorXO8/ny-forex-idJPKCN1R32SB
NY Market Summary(The 22nd)
http://feeds.reuters.com/~r/reuters/JPMarketNews/~3/fYZ-Sat0U3Y/ny-markets-summary-idJPL3N2194BF
U.S. stock market=Sudden fall, global economic downturn Anxiety grows stronger
http://feeds.reuters.com/~r/reuters/JPCompanyNews/~3/unMMYgBSv38/ny-stx-us-idJPL3N21946R
UPDATE 1-US Image Sharing Pinterest Apply for IPO, $ 100 Million?
http://feeds.reuters.com/~r/reuters/JPCompanyNews/~3/FIxyRRMByHY/pinterest-ipo-idJPL3N21949V
NY Marquette Digest ・ 22nd Stocks fell sharply ・ Euro fell ・ Lira plunged
https://zai.diamond.jp/list/fxnews/detail?id=312800&utm_source=zaifxrss&utm_medium=rss&utm_term=zaifxnews&utm_campaign=zaifxrss
Chicago Japanese Equity Futures Overview 22nd
https://www.nikkei.com/article/DGXLASH2ICE01_T20C19A3000000/
Schedule
http://www.traders.co.jp/foreign_stocks/market_s.asp#today
Economic indicator schedule
http://www.gaitame.com/market/yosoku.html
Strength materials / notes
http://market.fisco.co.jp/update/index.jsp
VIP remarks
http://www.traderswebfx.jp/news/default.aspx?ID=7#newslist
Settlement schedule
http://kabuyoho.ifis.co.jp/
IPO schedule
http://www.tokyoipo.com/top/iposche/index.php?j_e=J
Market holiday
http://klug-fx.jp/holiday/
Market holiday
https://jp.investing.com/holiday-calendar/
World clock
http://world.honda.com/worldclock/
Yahoo!News "Nikkei 225"
https://news.yahoo.co.jp/search?p=%E6%97%A5%E7%B5%8C%E5%B9%B3%E5%9D%87&ei=utf-8&fr=news_sw
Google News "Nikkei 225"
https://www.google.co.jp/search?hl=ja&gl=jp&tbm=nws&authuser=0&q=%E6%97%A5%E7%B5%8C%E5%B9%B3%E5%9D%87&oq=%E6%97%A5%E7%B5%8C%E5%B9%B3%E5%9D%87&gs_l=news-cc.1.0.43j43i53.2284.2284.0.5545.1.1.0.0.0.0.56.56.1.1.0...0.0...1ac.1.oMorwBF68ss#q=%E6%97%A5%E7%B5%8C%E5%B9%B3%E5%9D%87&hl=ja&gl=jp&authuser=0&tbm=nws&tbs=sbd:1
FISCO
http://chart.fisco.co.jp/fisco/cgi-bin/index.cgi

http://chart.fisco.co.jp/fisco/cgi-bin/index.cgi

https://www.dukascopy.jp/

You extracted the URL and title part like this.

There may be a link part of *** http *** but no link title.

Recommended Posts

Introduction to Web Scraping
web scraping
10 questions to check before web scraping
Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
web scraping (prototype)
Introduction to PyQt
[Python] Introduction to scraping | Program to open web pages (selenium webdriver)
Introduction to Scrapy (2)
[Linux] Introduction to Linux
Introduction to Scrapy (4)
Introduction to discord.py (2)
Introduction to discord.py
[Introduction to WordCloud] Let's play with scraping ♬
I tried web scraping to analyze the lyrics.
Scraping 2 How to scrape
Introduction to Lightning pytorch
Introduction to Nonparametric Bayes
Introduction to EV3 / MicroPython
Introduction to Python language
Introduction to TensorFlow-Image Recognition
Introduction to OpenCV (python)-(2)
[Python] Flow from web scraping to data analysis
[Introduction to Python3 Day 20] Chapter 9 Unraveling the Web (9.1-9.4)
Introduction to PyQt4 Part 1
Introduction to Dependency Injection
Introduction to Private Chainer
Python web scraping selenium
Introduction to machine learning
AOJ Introduction to Programming Topic # 1, Topic # 2, Topic # 3, Topic # 4
Introduction to electronic paper modules
A quick introduction to pytest-mock
Introduction to dictionary lookup algorithm
Introduction to Monte Carlo Method
Web scraping with python + JupyterLab
Introduction to Python Django (2) Win
Introduction to Cython Writing [Notes]
An introduction to private TensorFlow
Kubernetes Scheduler Introduction to Homebrew
Web scraping notes in python3
A super introduction to Linux
AOJ Introduction to Programming Topic # 7, Topic # 8
Scraping Go To Travel Accommodation
Introduction to Anomaly Detection 1 Basics
Introduction to RDB with sqlalchemy Ⅰ
Web scraping technology and concerns
[Introduction to Systre] Fibonacci Retracement ♬
Introduction to Nonlinear Optimization (I)
Introduction to serial communication [Python]
Trade-offs in web scraping & crawling
Easy web scraping with Scrapy
AOJ Introduction to Programming Topic # 5, Topic # 6
[Python] Introduction to web scraping | Summary of methods that can be used with webdriver
Image collection by web scraping
Web scraping using Selenium (Python)
Introduction to Deep Learning ~ Learning Rules ~
[Introduction to Python] <list> [edit: 2020/02/22]
Introduction to Python (Python version APG4b)
Web scraping using AWS lambda