[PYTHON] How to scrape horse racing data with BeautifulSoup

Purpose

Predict horse racing with machine learning and aim for a recovery rate of 100%.

What to do this time

In Previous article, I scraped the data of all race results in 2019 from netkeiba.com. スクリーンショット 2020-07-04 22.19.07.png This time, in addition to this, I would like to scrape data such as race date information and riding conditions. スクリーンショット 2020-07-05 14.11.44.png

Source code

Like last time, if you put a list of race_id, create a function that returns the scraping result in dictionary type for each race.

import requests
from bs4 import BeautifulSoup
import time
from tqdm.notebook import tqdm
import re

def scrape_race_info(race_id_list):
    race_infos = {}
    for race_id in tqdm(race_id_list):
        try:
            url = "https://db.netkeiba.com/race/" + race_id
            html = requests.get(url)
            html.encoding = "EUC-JP"
            soup = BeautifulSoup(html.text, "html.parser")

            texts = (
                soup.find("div", attrs={"class": "data_intro"}).find_all("p")[0].text
                + soup.find("div", attrs={"class": "data_intro"}).find_all("p")[1].text
            )
            info = re.findall(r'\w+', texts) #Hitting a backslash in Qiita causes a bug, so it is capitalized.
            info_dict = {}
            for text in info:
                if text in ["Turf", "dirt"]:
                    info_dict["race_type"] = text
                if "Obstacle" in text:
                    info_dict["race_type"] = "Obstacle"
                if "m" in text:
                    info_dict["course_len"] = int(re.findall(r"\d+", text)[0]) #This is also capitalized.
                if text in ["Good", "Going", "Heavy", "不Good"]:
                    info_dict["ground_state"] = text
                if text in ["Cloudy", "Fine", "rain", "小rain", "Koyuki", "snow"]:
                    info_dict["weather"] = text
                if "Year" in text:
                    info_dict["date"] = text
            race_infos[race_id] = info_dict
            time.sleep(1)
        except IndexError:
            continue
        except Exception as e:
            print(e)
            break
    return race_infos

Create race_id_list from Last scraped data, make it DataFrame type like last time, and merge it with the original data.

race_id_list = results.index.unique()
race_infos = scrape_race_info(race_id_list)
for key in race_infos:
    race_infos[key].index = [key] * len(race_infos[key])
race_infos = pd.concat([pd.DataFrame(race_infos[key], index=[key]) for key in race_infos])
results = results.merge(race_infos, left_index=True, right_index=True, how='left')

The completed data looks like this. スクリーンショット 2020-07-05 14.31.39.png

We have a detailed explanation in the video! Data analysis and machine learning starting with horse racing prediction

Recommended Posts

How to scrape horse racing data with BeautifulSoup
How to scrape image data from flickr with python
How to deal with imbalanced data
How to deal with imbalanced data
How to Data Augmentation with PyTorch
How to scrape websites created with SPA
How to read problem data with paiza
How to create sample CSV data with hypothesis
Scraping 2 How to scrape
How to use xgboost: Multi-class classification with iris data
How to get more than 1000 data with SQLAlchemy + MySQLdb
How to extract non-missing value nan data with pandas
How to extract non-missing value nan data with pandas
How to update with SQLAlchemy?
How to cast with Theano
Horse Racing Data Scraping Flow
How to Alter with SQLAlchemy?
How to separate strings with','
How to RDP with Fedora31
How to handle data frames
How to Delete with SQLAlchemy?
How to scrape at speed per second with Python Selenium
[Introduction to Python] How to get data with the listdir function
How to cancel RT with tweepy
[Python] How to FFT mp3 data
Python: How to use async with
How to read e-Stat subregion data
How to use virtualenv with PowerShell
How to install python-pip with ubuntu20.04LTS
How to get started with Scrapy
How to get started with Python
How to deal with DistributionNotFound errors
How to get started with Django
How to use FTP with Python
How to calculate date with python
How to install mysql-connector with pip3
How to INNER JOIN with SQLAlchemy
How to install Anaconda with pyenv
How to collect machine learning data
How to authenticate with Django Part 2
How to authenticate with Django Part 3
How to extract features of time series data with PySpark Basics
[Memo] How to use BeautifulSoup4 (2) Display the article headline with Requests
[Memo] How to use BeautifulSoup4 (3) Display the article headline with class_
How to do arithmetic with Django template
[Blender] How to set shape_key with script
How to title multiple figures with matplotlib
How to speed up instantiation of BeautifulSoup
How to collect Twitter data without programming
How to get parent id with sqlalchemy
[Memo] How to use BeautifulSoup4 (1) Display html
How to add a package with PyCharm
How to install DLIB with 2020 / CUDA enabled
How to use ManyToManyField with Django's Admin
How to use OpenVPN with Ubuntu 18.04.3 LTS
How to use Cmder with PyCharm (Windows)
How to prevent package updates with apt
How to work with BigQuery in Python
How to use Ass / Alembic with HtoA
Send data to DRF API with Vue.js
Convert FX 1-minute data to 5-minute data with Python