[Python] Scraping in AWS Lambda

Prepare an environment for scraping with the following stack to support sites with Javascript

Required library Local execution

pip install python-lambda-local

Deploy

pip install lambda-uploader

If you want to try it quickly, please visit the repository below. https://github.com/akichim21/python_scraping_in_lambda

Execution script

I am doing to generate html with js executed by selenium (driver: phantomjs) and extract the title. Finally, a script that just kills phantomjs with close () and quit ().

lambda_function.py


#!/usr/bin/env python

import time # for sleep
import os   # for path
import signal
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def lambda_handler(event, context):
  # set user agent
  user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")

  dcap = dict(DesiredCapabilities.PHANTOMJS)
  dcap["phantomjs.page.settings.userAgent"] = user_agent
  dcap["phantomjs.page.settings.javascriptEnabled"] = True

  browser = webdriver.PhantomJS(
              service_log_path=os.path.devnull,
              executable_path="./phantomjs",
              service_args=['--ignore-ssl-errors=true', '--load-images=no', '--ssl-protocol=any'],
              desired_capabilities=dcap
            )

  browser.get('http://google.com')
  title = browser.title
  browser.close()
  browser.quit()
  return title

Json passed as an argument

Not used this time, but used when passing arguments like event ["key1"]

event.json


{
  "key3": "value3",
  "key2": "value2",
  "key1": "value1"
}

Lambda settings json

Create & replace role. This time it is not 50M or more, so I will not use s3, but if I use other libraries, it will soon be 50MB or more, so s3 is required. If you set s3_bucket, the file will be uploaded via s3.

Since name is often used at runtime, use an appropriate name in production. Set memory, timeout, etc. appropriately according to the script.

lambda.json


{
  "name": "python_scraping_test",
  "description": "python_scraping_test",
  "region": "ap-northeast-1",
  "runtime": "python2.7",
  "handler": "lambda_function.lambda_handler",
  "role": "arn:aws:iam::00000000:role/lambda_basic_execution",
  "timeout": 60,
  "memory": 128,
  "variables": {
    "production": "True"
  },
  "ignore": [
    "\\.git.*",
    "/.*\\.pyc$",
    "/.*\\.zip$"
  ]
}

Dependencies

Insert only selenium with pip. Save phantomjs as a binary.

requirements.txt


selenium

Local execution command

Execute using python-lambda-local f is the function name and t is the timeout (s).

python-lambda-local -f lambda_handler -l ./ -t 60 lambda_function.py event.json

result


[root - INFO - 2017-03-19 08:16:05,271] Event: {u'test': u'test'}
[root - INFO - 2017-03-19 08:16:05,271] START RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6
[root - INFO - 2017-03-19 08:16:06,766] END RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6
[root - INFO - 2017-03-19 08:16:06,766] RESULT:
Google
[root - INFO - 2017-03-19 08:16:06,766] REPORT RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6  Duration: 1494.06 ms

deploy command

Since ~ / .aws / credentials are used, if it is not set, set it

Install if you don't have aws-cli

pip install awscli
aws configure

If virtualenv such as conda is in a special place, keep track of the place. Location

pip show virtualenv
lambda-uploader

##If virtualenv is a special location, take the Location directory as an argument(When vagrant, anaconda3, env is py2, it looks like this
lambda-uploader --virtualenv=/home/vagrant/.pyenv/versions/anaconda3-4.1.0/envs/py2/lib/python2.7/site-packages

Recommended Posts

[Python] Scraping in AWS Lambda
Write AWS Lambda function in Python
Scraping with selenium in Python
Web scraping notes in python3
Scraping with chromedriver in python
Scraping with Selenium in Python
Scraping with Tor in Python
Web scraping using AWS lambda
[Scraping] Python scraping
[Python] Retry process (Exponential Backoff) memo in AWS Lambda
Summary if using AWS Lambda (Python)
Scraping with Selenium in Python (Basic)
Run Python on Schedule on AWS Lambda
Notify HipChat with AWS Lambda (Python)
[AWS Lambda] Use any container Image in Python very simply
Python in optimization
CURL in python
Python scraping notes
Python Scraping get_ranker_categories
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
Use print in a Python2 lambda expression
Scraping with Python
Meta-analysis in Python
Unittest in python
Scraping with Python
Obtaining temporary AWS credentials in PHP, Python
Epoch in Python
Discord in Python
Python Scraping eBay
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Best practice for logging in JSON format on AWS Lambda / Python
Programming in python
Plink in Python
[Python / AWS Lambda layers] I want to reuse only module in AWS Lambda Layers
Constant in python
Python Scraping get_title
Scraping a website using JavaScript in Python
Lifegame in Python.
FizzBuzz in Python
I compared Node.js and Python in creating thumbnails using AWS Lambda
Sqlite in python
Python: Scraping Part 1
StepAIC in Python
[Python] Run Headless Chrome on AWS Lambda
Install pip in Serverless Framework and AWS Lambda with Python environment
N-gram in python
LINE-Bot [0] in Python
Connect to s3 with AWS Lambda Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3