I was interested in it as a data analysis theme, so I tried it.
The site I referred to is here.
If you want to build a predictive model from scratch, you need to take the following steps:
This time, I will briefly summarize the scraping related items in 1.
net.keiba.com I scraped from this site.
important point
Retrieving a large amount of data at one time puts a load on the server. By inserting time.sleep (1), it waits when requesting race_id_list every second. It is etiquette to reduce the server load by this.
import pandas pd
from tqdm import tqdm_notebook as tqdm
import time
def scrape_race_results(race_id_list):
    race_results={}
    for race_id in tqdm(race_id_list):
        try:
            url = 'https://db.netkeiba.com/race/'+ race_id
            race_results[race_id]= pd.read_html(url)[0]
            time.sleep(1)
        except IndexError:
            continue
        except:
            break
    return race_results
Put the race you want to check in this race_id. For example, suppose you have an ID of 202009020611.
this is,
2020 → Number of years
09 → Location(If it is 09, it is Hanshin, if it is 10, it is Kokura, etc.)
02 → month
06 → Sun
11 → Number of races
Is shown.
You can see it in this way as a trial.

We will analyze the data using basic pandas. For peace of mind, save it as a pickle file and csv.
Assuming that the acquired data is stored in resluts_new, it will be as follows.
results_new.to_pickle('results_new2017-2020')
results_new.to_csv('results_new2017-2020.csv',encoding="SHIFT-JIS")
We have summarized the data acquisition method easily.
Recommended Posts