[PYTHON] Convert PDF of Go To Eat Hokkaido campaign dealer list to CSV

Convert PDF of dealer list of Go To Eat Hokkaido Campaign to CSV

The letters disappear

cubepdf.png

font.png

shin.png

kuma.png

program

import camelot

import requests
from bs4 import BeautifulSoup

from urllib.parse import urljoin

import pandas as pd

url = "https://gotoeat-hokkaido.jp/general/particStores/"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
}

r = requests.get(url, headers=headers)
r.raise_for_status()

soup = BeautifulSoup(r.content, "html.parser")

dfs = []

for li in soup.select("ul.cf > li > a"):

    link = urljoin(url, li.get("href"))

    area = li.get_text(strip=True)

    tables = camelot.read_pdf(link, split_text=True, pages="all", strip_text="\n", )

    for table in tables:

        df_tmp = pd.DataFrame(table.data[1:], columns=table.data[0])
        df_tmp.columns = df_tmp.columns.map(lambda s: "".join(s.split()))
        df_tmp["area"] = area

        dfs.append(df_tmp)

df

df = pd.concat(dfs)

df = df.fillna("").applymap(
    lambda s: s.replace("(cid:1279)", "Yue")
    .replace("(cid:1535)", "Han")
    .replace("(cid:1791)", "bear")
    .replace("(cid:2303)", "Boiled")
    .replace("(cid:2559)", "new")
    .replace("(cid:2815)", "Noisy")
    .replace("(cid:3071)", "crane")
)

#CJK radical/Replace Kangxi radical
tbl = str.maketrans(
    "⺃_⺅ ⺉_⺋ ⺎_⺏ ⺐_⺒ ⺓_⺔ ⺖_⺘ ⺙_⺛ ⺟_⺠ ⺡_⺢ ⺣_⺦ ⺨_⺫ ⺬_⺭ ⺱_⺲ ⺹_⺾ ⻁_⻂ ⻃_⻄ ⻍_⻏ ⻑_⻒ ⻖_⻘ ⻟_⻤ ⻨_⻩ ⻫_⻭ ⻯_⻲ ⼀_⼁ ⼂_⼃ ⼄_⼅ ⼆_⼇ ⼈_⼉ ⼊_⼋ ⼌_⼍ ⼎_⼏ ⼐_⼑ ⼒_⼓ ⼔_⼕ ⼖_⼗ ⼘_⼙ ⼚_⼛ ⼜_⼝ ⼞_⼟ ⼠_⼡ ⼢_⼣ ⼤_⼥ ⼦_⼧ ⼨_⼩ ⼪_⼫ ⼬_⼭ ⼮_⼯ ⼰_⼱ ⼲_⼳ ⼴_⼵ ⼶_⼷ ⼸_⼹ ⼺_⼻ ⼼_⼽ ⼾_⼿ ⽀_⽁ ⽂_⽃ ⽄_⽅ ⽆_⽇ ⽈_⽉ ⽊_⽋ ⽌_⽍ ⽎_⽏ ⽐_⽑ ⽒_⽓ ⽔_⽕ ⽖_⽗ ⽘_⽙ ⽚_⽛ ⽜_⽝ ⽞_⽟ ⽠_⽡ ⽢_⽣ ⽤_⽥ ⽦_⽧ Water ⾚_⾛ ⾜_⾝ ⾞_⾟ ⾠_⾡ ⾢_⾣ ⾤_⾥ ⾦_⾧ ⾨_⾩ ⾪_⾫ ⾬_⾭ ⾮_⾯ ⾰_⾱ ⾲_⾳ ⾴_⾵ ⾶_⾷ ⾸_⾹ ⾺_⾻ ⾼_⾽ ⾾_⾿ ⿀_⿁ ⿂_⿃ ⿄_⿅ ⿆_⿇ ⿈_⿉ ⿊_⿋ ⿌⿍⿎⿏⿐⿑⿒⿓⿔⿕ 戶 黑",
    "乚 亻 刂 㔾 兀 尣 尢 巳 幺 彑 忄 扌 攵 旡 Mother 氵 氺 灬 丬 犭 罒 礻 罓 轒 耂 艹 衤 衤 亅 庠 儿 儿 儿 儿 儿 儿 夊 凵 冖 冫 几 凵 sword power 勹 匕 匚 匸 10 卜 卩 厂 厶 厶 囗 囗 夂 夊 夊 夊 夊 彐 彡 彳 戈 戈 蔴 蔴 蔴 虤 kata 曰 曰 曰 曰 歹 殳 毋 诋 视 视 舻 Spear Yaishi 禸 禾 禾 覾 覾 覾 覾 覾 覾 耒 耒 耒 耒 耒 耒 聿 聿 聿 聿 聿 辵 酉 釆 釆 臆 辆 隶 隹 隹 靹 蟋 韭 蟭 蟭 蟭 蟭 蟭 觥 鬥 鬯 鬲 demon fish bird 鹵 deer 麥 黃 黍 black 黹 黽 鼎 Udo black",
)

df = df.applymap(lambda s: s.translate(tbl))

df.reset_index(drop=True, inplace=True)

df.index += 1

df.to_csv("gotoeat_hokkaido.csv", encoding="utf_8_sig")

Recommended Posts

Convert PDF of Go To Eat Hokkaido campaign dealer list to CSV
Convert PDF of Kumamoto Prefecture Go To EAT member store list to CSV
Convert PDF of Chiba Prefecture Go To EAT member store list to CSV (command)
Convert PDF of list of Go To EAT member stores in Niigata prefecture to CSV
Convert PDF of available stores of Go To EAT in Kagoshima prefecture to CSV
Convert PDF of Go To EAT member stores in Ishikawa prefecture to CSV
Convert PDF of product list containing effective surfactants for new coronavirus to CSV
Convert from PDF to CSV with pdfplumber
Scraping the list of Go To EAT member stores in Fukuoka prefecture and converting it to CSV
Scraping the list of Go To EAT member stores in Niigata prefecture and converting it to CSV
Convert PDF of new corona outbreak case in Aichi prefecture to CSV
COCO'S Breakfast Buffet List PDF Converted to CSV
Convert a slice object to a list of index numbers
[Python] Convert PDF text to CSV page by page (2/24 postscript)
Scraping the member stores of Go To EAT in Osaka Prefecture and converting them to CSV
[Python] Convert list to Pandas [Pandas]
Convert SDF to CSV quickly
Convert PDF of Sagamihara City presentation materials (occurrence status, etc.) regarding new coronavirus infection to CSV
Convert PDF of the progress of the division of labor (trends in insurance dispensing) of the Japan Pharmaceutical Association to CSV
Convert a large number of PDF files to text files using pdfminer
Convert PDF to Documents by OCR
Convert markdown to PDF in Python
Convert A4 PDF to A3 every 2 pages
Convert list to DataFrame with python
Python> list> Convert double list to single list
Convert from pdf to txt 2 [pyocr]
Convert PDF to image with ImageMagick
I want to convert a table converted to PDF in Python back to CSV