主页 > IT业界  > 

Python爬虫selenium

Python爬虫selenium
1.selenium自动化

selenium可以操作浏览器,在浏览器页面上实现:点击、输入、滑动 等操作。

不同于selenium自动化,逆向本质是:

分析请求,例如:请求方法、请求参数、加密方式等。用代码模拟请求去实现同等功能。

逆向 vs 自动化Selenium

Selenium,【优】简单不需要逆向,只需要控制浏览器去执行预设的操作即可;【缺点】性能差,不利于批量实现逆向, 【优】算法逆向出来后,性能好且利于批量实现; 【缺点】语法难搞的js加密算法,不容易逆向 2.必备操作 2.1 模块 & 驱动

安装模块

pip install selenium

下载驱动

Selenium想要控制谷歌、火狐、IE、Edage等浏览器,必须要使用对应的驱动才行。【Selenium】->【驱动】->【浏览器】 【Selenium】->【火狐驱动】->【火狐浏览器】 【Selenium】->【谷歌驱动】->【谷歌浏览器】 谷歌驱动的下载: 114及之前版本: http://chromedriver.storage.googleapis /index.html 117/118/119版本: googlechromelabs.github.io/chrome-for-testing/ 浏览器版本的获取: 在谷歌浏览器上访问 chrome://version/ 例如:119.0.6045.200 (正式版本) (64 位) (cohort: Stable)

快速使用

import time from selenium import webdriver from selenium.webdriver.chrome.service import Service service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' passport.bilibili /login') time.sleep(5) driver.close() 2.2 寻找标签 import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get('打开网址') # find_element find_elements tag = driver.find_element(By.ID, "user") tag = driver.find_element(By.CLASS_NAME, "c1") tag = driver.find_element(By.TAG_NAME, "div") tag = driver.find_element(By.XPATH, "/html/body/div[1]/div/div[2]/div[3]/div[3]/div/div/div/div[1]/span[2]") tag = driver.find_element(By.XPATH, '//*[@id="geetest-wrap"]//input[@name="tel"]') tag_list = driver.find_elements(By.XPATH, "/html/body/div/div[2]/div/div[2]/div/div[2]/div[2]/div/div/div/div/div[2]/a") for tag in tag_list: print(tag) time.sleep(5) driver.close()

示例:5xclass

import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' .5xclass /') # 根据ID寻找 tag = driver.find_element(By.ID, "bs-example-navbar-collapse-1") print(tag.text) print(10 * "-") # 根据类名寻找 tags = driver.find_elements(By.CLASS_NAME, "panel-heading") for tag in tags: print(tag.text) print(10 * "-") # 根据标签名称寻找 tags = driver.find_elements(By.TAG_NAME, "li") for tag in tags: print(tag.text) print(10 * "-") # 根据XPATH寻找 tag = driver.find_element(By.XPATH, "/html/body/div/div[2]/div/div[2]/div/div[2]/div[1]") print(tag.text) print(10 * "-") # 根据XPATH寻找 tag = driver.find_element(By.XPATH, '//*[@id="bs-example-navbar-collapse-1"]/ul[1]/li[1]/a') print(tag.text) print(10 * "-") # 根据XPATH寻找多个 tags = driver.find_elements(By.XPATH, '/html/body/div/div[2]/div/div[2]/div/div[2]/div[2]/div/div/div/div/div[2]/a') for tag in tags: print(tag.text) print(10 * "-") # 根据父子关系嵌套寻找 parent = driver.find_element(By.XPATH, '/html/body/div/div[2]/div/div[2]/div/div[2]/div[2]/div/div/div/div') tags = parent.find_elements(By.XPATH, "div[@class='course']/a") for tag in tags: print(tag.text) time.sleep(5) driver.close() 2.3 执行操作

常见的执行操作:点击、输入

import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' passport.bilibili /login') # 1.点击短信登录 time.sleep(3) sms_btn = driver.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' ) sms_btn.click() # 点击 # 2.输入账号 phone_txt = driver.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[2]/div[1]/div[1]/input' ) phone_txt.send_keys("18630087660") # 输入 time.sleep(55) driver.close() 2.4 执行JavaScript

如果【选择标签】【执行操作】这种操作起来比较繁琐,也可以直接在页面上去执行js代码实现功能。

import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' passport.bilibili /login') # ############# 1.点击短信登录 ############# time.sleep(3) sms_btn = driver.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' ) sms_btn.click() # ############# 2.输入账号 ############# phone_txt = driver.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[2]/div[1]/div[1]/input' ) phone_txt.send_keys("18630087660") # ############# 3.选择国家 ############# time.sleep(2) driver.execute_script('document.querySelector(".area-code-select").children[18].click()') # ############# 4.读取cookie ############# data_string = driver.execute_script('return document.cookie;') # return document.title; print(data_string) # ############# 5.读取cookie ############# cookie_list = driver.get_cookies() print(cookie_list) time.sleep(2550) driver.close() 2.5 等待

如果页面加载比较慢,需要等待某个元素加载成功后,再执行某些操作。

示例1:基于lambda表达式

import time from selenium import webdriver from selenium.webdriver mon.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.wait import WebDriverWait service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' passport.bilibili /login') # ############# 方式1:点击短信登录 ############# time.sleep(3) sms_btn = driver.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' ) sms_btn.click() # ############# 方式2:点击短信登录(推荐) ############# sms_btn = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' )) sms_btn.click()

示例2:自定义函数

import time from selenium import webdriver from selenium.webdriver mon.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.wait import WebDriverWait service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' passport.bilibili /login') def func(dv): print("无返回值,则间隔0.5s执行一次此函数;如有返回值,则复制给sms_btn变量") # <div xxx="123" id="uuu"></div> # <img src="..."/> tag = dv.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' ) img_src = tag.get_attribute("xxx") if img_src: return tag return sms_btn = WebDriverWait(driver, 30, 0.5).until(func) sms_btn.click() time.sleep(250) driver.close() 2.4 执行JavaScript

如果【选择标签】【执行操作】这种操作起来比较繁琐,也可以直接在页面上去执行js代码实现功能。

import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' passport.bilibili /login') # ############# 1.点击短信登录 ############# time.sleep(3) sms_btn = driver.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' ) sms_btn.click() # ############# 2.输入账号 ############# phone_txt = driver.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[2]/div[1]/div[1]/input' ) phone_txt.send_keys("18630087660") # ############# 3.选择国家 ############# time.sleep(2) driver.execute_script('document.querySelector(".area-code-select").children[18].click()') # ############# 4.读取cookie ############# data_string = driver.execute_script('return document.cookie;') # return document.title; print(data_string) # ############# 5.读取cookie ############# cookie_list = driver.get_cookies() print(cookie_list) time.sleep(2550) driver.close() 2.5 等待

如果页面加载比较慢,需要等待某个元素加载成功后,再执行某些操作。

示例1:基于lambda表达式

import time from selenium import webdriver from selenium.webdriver mon.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.wait import WebDriverWait service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' passport.bilibili /login') # ############# 方式1:点击短信登录 ############# time.sleep(3) sms_btn = driver.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' ) sms_btn.click() # ############# 方式2:点击短信登录(推荐) ############# sms_btn = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' )) sms_btn.click()

示例2:自定义函数

import time from selenium import webdriver from selenium.webdriver mon.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.wait import WebDriverWait service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' passport.bilibili /login') def func(dv): print("无返回值,则间隔0.5s执行一次此函数;如有返回值,则复制给sms_btn变量") # <div xxx="123" id="uuu"></div> # <img src="..."/> tag = dv.find_element( By.XPATH, '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' ) img_src = tag.get_attribute("xxx") if img_src: return tag return sms_btn = WebDriverWait(driver, 30, 0.5).until(func) sms_btn.click() time.sleep(250) driver.close()

示例3:全局配置

import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) # 后续找元素时,没找到时则等待10去寻找(一旦找到则继续) driver.implicitly_wait(30) driver.get(' passport.bilibili /login') sms_btn = driver.find_element( By.XPATH, # '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]' '//*[@id="xxxxxxxxxapp"]/div[2]/div[2]/div[3]/div[1]/div[3]' ) sms_btn.click() print("找到了") time.sleep(250) driver.close() 2.6 获取值

当找到某个标签之后,想要获取标签内部值。

示例1:文本和属性

例如:<a id='x1' class="info mine" href="5xclass ">武沛齐</a>

from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.implicitly_wait(10) driver.get(' .5xclass ') tag = driver.find_element( By.XPATH, '/html/body/div/div[2]/div/div[2]/div/div[2]/div[2]/div/div/div/div/div[2]/a[1]' ) print(tag.text) print(tag.get_attribute("target")) print(tag.get_attribute("data-toggle")) driver.close()

示例2:获取值

例如:<input type='text' value="?" placeholder="?" />

例如:<select ><option value='1'>北京</option> </option value='2'>上海</option> </select> ,获取select标签的value属性

import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.implicitly_wait(10) driver.get(' .bilibili /') time.sleep(10) tag = driver.find_element( By.XPATH, '//*[@id="nav-searchform"]/div[1]/input' ) print(tag) print(tag.text) print(tag.get_attribute("placeholder")) print(tag.get_attribute("value")) time.sleep(1000) driver.close()

示例3:选择相关

<input type="radio" name="findcar" value="1" checked="">新车 <input type="radio" name="findcar" value="2">二手机 import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.implicitly_wait(10) driver.get(' .autohome /beijing/') # ############### 1.单独找到每一个 ############### tag = driver.find_element( By.XPATH, '/html/body/div[1]/div[11]/div[2]/div[1]/div[1]/label[1]/span/input' ) print(tag.get_property("checked")) # True tag = driver.find_element( By.XPATH, '/html/body/div[1]/div[11]/div[2]/div[1]/div[1]/label[2]/span/input' ) print(tag.get_property("checked")) # False # ############### 2.循环找到每一个 ############### parent = driver.find_element( By.XPATH, '/html/body/div[1]/div[11]/div[2]/div[1]/div[1]' ) tag_list = parent.find_elements( By.XPATH, 'label/span/input' ) for tag in tag_list: print( tag.get_property("checked"), tag.get_attribute("value") ) driver.close() 2.7 源码+bs4

打开页面后,如果基于selenium不太容易定位和寻找,也可以结合bs4来进行寻找。

from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver mon.by import By from bs4 import BeautifulSoup service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.implicitly_wait(10) driver.get(' car.yiche /') html_string = driver.page_source soup = BeautifulSoup(html_string, features="html.parser") tag_list = soup.find_all(name="div", attrs={"class": "item-brand"}) for tag in tag_list: child = tag.find(name='div', attrs={"class": "brand-name"}) print(child.text) driver.close() 2.8 携带Cookie driver.add_cookie({'name': 'foo', 'value': 'bar'}) import time from selenium import webdriver from selenium.webdriver.chrome.service import Service service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) # 注意:一定要先访问,不然Cookie无法生效 driver.get(' dig.chouti /about') # 加cookie driver.add_cookie({ 'name': 'token', 'value': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjZHVfNDU3OTI2NDUxNTUiLCJleHBpcmUiOiIxNzA0MzI5NDY5OTMyIn0.8n_tWcEHXsBSXWIY9rBoGWwaLPF8iWIruryhKTe5_ks' }) # 再访问 driver.get(' dig.chouti /') time.sleep(2000) driver.close() 2.9 IP检测和代理

如果网站进行了IP访问限制,例如:每个IP每天只能操作5次。此时可以选择购买IP,然后在请求时添加代理IP即可,具体步骤:

购买IP登录购买IP渠道的后台,配置自己IP白名单代码携带代理 import time import requests from selenium import webdriver from selenium.webdriver.chrome.service import Service # 换成自己生成的代理 res = requests.get(url=" dps.kdlapi /api/getdps/?secret_id=o60wwtxvs5ukaqqz18ai&num=1&signature=i6s9shfjfiogat5ijecbyfwwc5grwrzj&pt=1&format=json&sep=1") proxy_string = res.json()['data']['proxy_list'][0] print(f"获取代理:{proxy_string}") # "182.106.136.218:40192" service = Service("driver/chromedriver.exe") opt = webdriver.ChromeOptions() # opt.add_argument(f'--proxy-server=222.89.70.40:40001') # 代理 opt.add_argument(f'--proxy-server={proxy_string}') # 代理 driver = webdriver.Chrome(service=service, options=opt) driver.get(' myip.ipip.net/') time.sleep(2000) driver.close() 2.10 特征检测

有些网站为了防止selenium,会检测特征,并禁止访问。

如果想要正常使用selenium访问,那就需要隐藏浏览器相关的特征。

import time import requests from selenium import webdriver from selenium.webdriver.chrome.service import Service service = Service("driver/chromedriver.exe") opt = webdriver.ChromeOptions() opt.add_argument('--disable-infobars') opt.add_experimental_option("excludeSwitches", ["enable-automation"]) opt.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(service=service, options=opt) # Selenium在打开任何页面之前,先运行这个Js文件。 with open('driver/hide.js') as f: driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": f.read()}) driver.get(' .5xclass ') time.sleep(2000) driver.close() 2.11 无头和其他

如果不想显示展示在浏览器上的操作,只想偷偷的在后台运行。

opt.add_argument('--headless') import time from selenium import webdriver from selenium.webdriver mon.by import By from selenium.webdriver.chrome.service import Service service = Service("driver/chromedriver.exe") opt = webdriver.ChromeOptions() opt.add_argument('--headless') driver = webdriver.Chrome(service=service, options=opt) driver.get(' .5xclass ') tag = driver.find_element( By.XPATH, '/html/body/div/div[2]/div/div[2]/div/div[2]/div[2]/div/div/div/div/div[2]/a[1]' ) print(tag.text) print(tag.get_attribute("target")) print(tag.get_attribute("data-toggle")) driver.close()

其他配置:

opt.add_argument('--disable-infobars') # 禁止策略化 opt.add_argument('--no-sandbox') # 解决DevToolsActivePort文件不存在的报错 opt.add_argument('window-size=1920x3000') # 指定浏览器分辨率 opt.add_argument('--disable-gpu') # 谷歌文档提到需要加上这个属性来规避bug opt.add_argument('--incognito') # 隐身模式(无痕模式) opt.add_argument('--disable-javascript') # 禁用javascript opt.add_argument('--start-maximized') # 最大化运行(全屏窗口),不设置,取元素会报错 opt.add_argument('--hide-scrollbars') # 隐藏滚动条, 应对一些特殊页面 opt.add_argument('lang=en_US') # 设置语言 opt.add_argument('blink-settings=imagesEnabled=false') # 不加载图片, 提升速度 opt.add_argument('User-Agent=Mozilla/5.0 (Linux; U; Androi....') # 设置User-Agent opt.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" # 手动指定使用的浏览器位置 2.12 截屏

找到某个标签后,可以通过截图的形式保存图片。

import time from selenium import webdriver from selenium.webdriver mon.by import By from selenium.webdriver.chrome.service import Service service = Service("driver/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(' .5xclass ') tag = driver.find_element( By.XPATH, '/html/body/div/div[2]/div/div[2]/div/div[2]' ) # 截图&保存 tag.screenshot("demo.png") # 截图&图片内容 body = tag.screenshot_as_png print(body) # 截图&Base64编码格式图片内容 b64_body = tag.screenshot_as_base64 print(b64_body) driver.close() 3.案例:x东搜索 import requests from selenium import webdriver from selenium.webdriver mon.by import By from selenium.webdriver.chrome.service import Service # 换成自己生成的代理 res = requests.get(url=" dps.kdlapi /api/getdps/?secret_id=o60wwtxvs5ukaqqz18ai&num=1&signature=i6s9shfjfiogat5ijecbyfwwc5grwrzj&pt=1&format=json&sep=1") proxy_string = res.json()['data']['proxy_list'][0] print(f"获取代理:{proxy_string}") service = Service("driver/chromedriver.exe") opt = webdriver.ChromeOptions() opt.add_argument(f'--proxy-server={proxy_string}') # 代理 opt.add_argument('blink-settings=imagesEnabled=false') # 不加载图片 opt.add_argument('--disable-infobars') opt.add_experimental_option("excludeSwitches", ["enable-automation"]) opt.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(service=service, options=opt) driver.implicitly_wait(10) with open('driver/hide.js') as f: driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": f.read()}) # 1.打开京东 driver.get(' .jd /') # 2.搜索框+输入 tag = driver.find_element( By.XPATH, '//*[@id="key"]' ) tag.send_keys("iphone手机") # 3.点击搜索 tag = driver.find_element( By.XPATH, '//*[@id="search"]/div/div[2]/button' ) tag.click() # 4.查询列表 tag_list = driver.find_elements( By.XPATH, '//*[@id="J_goodsList"]/ul/li' ) for tag in tag_list: # title = tag.find_element(By.XPATH, 'div/div[@class="p-name p-name-type-2"]//em').text title = tag.find_element(By.XPATH, 'div/div[@class="p-name p-name-type-2"]/a/em').text print(title) driver.close() 4.案例:x麦网 import time import requests from selenium import webdriver from selenium.webdriver mon.by import By from selenium.webdriver.chrome.service import Service # 换成自己生成的代理 res = requests.get( url=" dps.kdlapi /api/getdps/?secret_id=o60wwtxvs5ukaqqz18ai&num=1&signature=i6s9shfjfiogat5ijecbyfwwc5grwrzj&pt=1&format=json&sep=1") proxy_string = res.json()['data']['proxy_list'][0] print(f"获取代理:{proxy_string}") service = Service("driver/chromedriver.exe") opt = webdriver.ChromeOptions() opt.add_argument(f'--proxy-server={proxy_string}') # 代理 opt.add_argument('blink-settings=imagesEnabled=false') opt.add_argument('--disable-infobars') opt.add_experimental_option("excludeSwitches", ["enable-automation"]) opt.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(service=service, options=opt) driver.implicitly_wait(10) with open('driver/hide.js') as f: driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": f.read()}) # 1.打开大麦网 driver.get(' .damai /') # 2.搜索框+输入 tag = driver.find_element( By.XPATH, '//input[@class="input-search"]' ) tag.send_keys("周杰伦") # 3.点击搜索 tag = driver.find_element( By.XPATH, '//div[@class="btn-search"]' ) tag.click() # 4.查询列表 tag_list = driver.find_elements( By.XPATH, '//div[@class="search__itemlist"]//div[@class="items"]' ) for tag in tag_list: title = tag.find_element(By.XPATH, 'div[@class="items__txt"]/div[1]/a').text print(title) time.sleep(2000) driver.close()

如果不加代理,访问频繁时会提示验证码

标签:

Python爬虫selenium由讯客互联IT业界栏目发布,感谢您对讯客互联的认可,以及对我们原创作品以及文章的青睐,非常欢迎各位朋友分享到个人网站或者朋友圈,但转载请说明文章出处“Python爬虫selenium