[PYTHON] Specify the cost limit when querying BigQuery casually from Jupyter

Specify the cost limit when querying BigQuery using Pandas on Jupyter

Task

When you are writing code interactively on Jupyter and querying BigQuery, typo causes unexpectedly high charges. For example, when you want to get data for 3 days, set the date to be passed to TABLE_DATE_RANGE

-(Correct) TABLE_DATE_RANGE (xxx_, TIMESTAMP ('2017-05-01'), TIMETAMP ('2017-05-03')) -(Error) TABLE_DATE_RANGE (xxx_, TIMESTAMP ('2016-05-01'), TIMETAMP ('2017-05-03'))

And typo will scan the table for 1 year and 3 days. I notice this when I review the query because I'm distrustful because the results aren't returned after I submit the query job.

Countermeasures

To prevent this without compromising the appeal of an interactive environment where you can write and run it instantly, you can check it at run time. Fortunately, from v0.20 of Pandas, it can be specified as an option.

It is good to prepare the following wrapper of pd.read_gbq so that you can specify the cost upper limit per query.

COST_LIMIT_PER_QUERY = 1 # USD
GBYTE = 1000000000
maximum_bytes_billed = int(COST_LIMIT_PER_QUERY * 200 * GBYTE)

def read_gbq(query):
    return pd.read_gbq(
        query, project_id=PROJECT_ID, verbose=False,
        configuration={'query':{
            'maximumBytesBilled': maximum_bytes_billed,
            'maximumBillingTier': 1 #I've never seen it become 2, so it's fixed at 1.
        }})

If you hit the upper limit of maximumBytesBilled, the error message will contain the values needed to execute the query.

Example

GenericGBQException: Reason: bytesBilledLimitExceeded, Message: Query exceeded limit for bytes billed: 2000000. 20971520 or higher required.

Recommended Posts

Specify the cost limit when querying BigQuery casually from Jupyter
Use BigQuery from your local Jupyter Notebook