python京东手机爬虫 - Python论坛

编程小猪

等　级：新手上路
帖　子：33
专家分：4
注　册：2022-10-17
结帖率：100%

楼主

问题点数：0 回复次数：6

python京东手机爬虫

图片附件: 游客没有浏览图片的权限，请登录或注册

代码如下：
import parsel as parsel
import requests
import csv  # 内置模块
from lxml import etree

def get_html(page):
    url = 'https://search.'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 SLBrowser/8.0.0.9231 SLBChan/30'
    }
    response = requests.get(url=url, headers=headers)
    # print(response.text)
    html = parsel.Selector(response.text)
    # print(html)
    return html

def parse_data(selector):
    href = selector.css('li::attr(data-sku) ')
    # href = selector.css('.p-img a::attr(href)').getall()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 SLBrowser/8.0.0.9231 SLBChan/30'
    }
    res = []
    for index in zip(href):
        # https://item.
        index_url = 'https://item.'+str(index[0])+'.html'
        response_1 = requests.get(url=index_url, headers=headers)
        selector_1 = parsel.Selector(response_1.text)
        print(selector_1)
        price = selector_1.css('ul.parameter2.p-parameter-list li:nth-child(1)::text').get()
        # #detail > div.tab-con > div:nth-child(1) > div.p-parameter > ul.parameter2.p-parameter-list > li:nth-child(1)
        print(price)

if __name__=='__main__':
    page = 1
    html = get_html(page)
    res = parse_data(html)

求教：按照我的方法爬取<li>标签下title的值，该怎么写？或者是不是我根本没有爬取到这个页面的html?
京东的价格和评论怎么爬取不到啊？

[此贴子已经被作者于2022-11-2 22:44编辑过]

搜索更多相关主题的帖子: print　html　url　text　href　

2022-11-02 16:33

fall_bernana

等　级：贵宾
威　望：17
帖　子：244
专家分：2106
注　册：2019-8-16

第 3 楼

得分:0

回复楼主编程小猪

你可以看response_1.text 里没有这些内容的。因为你要的内容是通过js动态加载的。你需要通过selenium 来抓取

2022-11-10 09:51

时光流逝

来　自：北京
等　级：职业侠客
威　望：8
帖　子：102
专家分：317
注　册：2019-11-16

第 5 楼

得分:0

回复楼主编程小猪

程序代码：

from bs4 import BeautifulSoup
buf = resp.text
soup = BeautifulSoup(buf, 'html.parser')
for src in soup.find_all('li'):
    title = src.get("title")
    if type(title)!="NoneType":
        #做你想做的