[PYTHON] Collect images for machine learning (Bing Search API)

Introduction

It will be a memorandum of what you have done.

things to do

A large number of images may be required when trying to study machine learning. Bing seems to be the most suitable for image collection, and Microsoft Azure has never done it, so I tried this as a study. If you stumble when collecting images, it's a simple post with a reference URL, but I strongly agree with you.

[Reference URL] Summary of image collection circumstances on Yahoo, Bing, and Google https://qiita.com/ysdyt/items/565a0bf3228e12a2c503

Premise

Microsoft: Get the Bing Search API key (check the reference URL for how to get it) https://azure.microsoft.com/ja-jp/

Expiration date: 30 days for free

Reference URL

・ Create an automatic image collection program with the Bing Web Search API https://blog.wackwack.net/entry/2017/12/27/223755

-Collect a large number of images using Bing's image search API https://qiita.com/ysdyt/items/49e99416079546b65dfc

· Official: Quick Start: Search for images using the Bing Image Search REST API and Python https://docs.microsoft.com/ja-jp/azure/cognitive-services/bing-image-search/quickstarts/python

code

-** I wanted to have multiple search words, so upload locally ** (Upload the name of the folder to store with the search words)

--Only the upload part is added to the reference URL code.

import math
import requests
import time
import OpenSSL
import urllib
import hashlib
import sha3
import os
import csv

# Split the argument f into the file name and extension (not including.)
def split_filename(f):
    split_name = os.path.splitext(f)
    file_name =split_name[0]
    extension = split_name[-1].replace(".","")
    return file_name,extension

def download_img(path,url):
    _,extension  = split_filename(url)
    if extension.lower() in ('jpg','jpeg','gif','png','bmp'):
        encode_url = urllib.parse.unquote(url).encode('utf-8')
        hashed_name = hashlib.sha3_256(encode_url).hexdigest()
        full_path = os.path.join(path,hashed_name + '.' + extension.lower())

        r = requests.get(url)
        if r.status_code == requests.codes.ok:
            with open(full_path,'wb') as f:
                f.write(r.content)
            print('saved image...{}'.format(url))
        else:
            print("HttpError:{0}  at{1}".format(r.status_code,image_url))

 Endpoint URL
url = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"

 Bing Search API Key
APIKey = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

 Parameter
headers = {'Ocp-Apim-Subscription-Key':APIKey}
 count = 10 # Maximum number of acquisitions per request default: 30 max: 150
 mkt = "ja-JP" # Country code of acquisition source
 num_per = 2 # number of requests (count * num_per = number of acquired images)
 offset = math.floor (count / num_per) # loop count

with open("./list.txt", "r", encoding="utf-8_sig") as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        keyword = row[0]
        pathname = row[1]

 #Specify save destination
        path = "./" + pathname
 #If the save destination does not exist
        if not os.path.exists(path):
            os.makedirs(path)

        for offset_num in range(offset):
            params = {'q':keyword, 'count':count, 'offset':offset_num*offset, 'mkt':mkt}
            r = requests.get(url, headers=headers, params=params)
            data = r.json()
            for values in data['value']:
                image_url = values['contentUrl']
                try:
                    download_img(path, image_url)
                except Exception as e:
                    print("failed to download image at {}".format(image_url))
                    print(e)
            time.sleep(0.5)

--Upload file: Search word and storage folder name (list.txt) png1.png

--Download image (fujisan) png2.png

Other

--Installation: pip install pysha3 failed in python version 3.7. Since it was installed without error in version 3.6, this program is executed by python3.6.

Summary

――I was able to avoid having to stumble at the beginning when studying image-based machine learning. (Thanks)

――Since the paid fee of MS Azure is not high, I thought that it may be used depending on the situation after the free tier ends. Price: https://azure.microsoft.com/ja-jp/pricing/details/cognitive-services/search-api/

Recommended Posts

Collect images for machine learning (Bing Search API)
[Python] Collect images with Icrawler for machine learning [1000 images]
Amplify images for machine learning with python
Data set for machine learning
Japanese preprocessing for machine learning
Collect machine learning training image data on your own (Google Custom Search API Pikachu)
Collect large numbers of images using Bing's image search API
<For beginners> python library <For machine learning>
Machine learning meeting information for HRTech
How to use bing search api
[Recommended tagging for machine learning # 4] Machine learning script ...?
How to collect machine learning data
First Steps for Machine Learning (AI) Beginners
Why Python is chosen for machine learning
"Usable" one-hot Encoding method for machine learning
Machine learning
[Shakyo] Encounter with Python for machine learning
[Python] Web application design for machine learning
An introduction to Python for machine learning
Creating a development environment for machine learning
[Reinforcement learning] Search for the best route
An introduction to machine learning for bot developers
Classification of guitar images by machine learning Part 1
Recommended study order for machine learning / deep learning beginners
Machine learning starting from 0 for theoretical physics students # 1
Upgrade the Azure Machine Learning SDK for Python
Machine learning starting from 0 for theoretical physics students # 2
[For beginners] Introduction to vectorization in machine learning
Collect machine learning training image data on your own (Tumblr API Yoshioka Riho ed.)
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Create a dataset of images to use for learning
Image collection Python script for creating datasets for machine learning
Build an interactive environment for machine learning in Python
[Recommended tagging for machine learning # 2] Extension of scraping script
[Recommended tagging for machine learning # 2.5] Modification of scraping script
Python learning memo for machine learning by Chainer from Chapter 2
Python learning memo for machine learning by Chainer Chapters 1 and 2
celery
[Python] Web application design for machine learning
[Python] I made a classifier for irises [Machine learning]
Machine Learning with Caffe -1-Category images using reference model
Study method for learning machine learning from scratch (March 2020 version)
[Machine learning] Try to detect objects using Selective Search
14 e-mail newsletters useful for gathering information on machine learning
Memo for building a machine learning environment using Python
xgboost: A valid machine learning model for table data
Everything for beginners to be able to do machine learning