| 网站首页 | 业界新闻 | 小组 | 威客 | 人才 | 下载频道 | 博客 | 代码贴 | 在线编程 | 编程论坛
欢迎加入我们,一同切磋技术
用户名:   
 
密 码:  
共有 8859 人关注过本帖
标题:python爬虫框架搭建中的问题
只看楼主 加入收藏
廉价的咖啡
Rank: 2
来 自:湖北 荆门
等 级:论坛游民
威 望:2
帖 子:53
专家分:17
注 册:2014-10-9
结帖率:72.73%
收藏
已结贴  问题点数:20 回复次数:4 
python爬虫框架搭建中的问题
给位朋友你们好:
我是一名努力学习python的小白,有一些搞不明白的问题还想请教各位:
自己在摸索着pytharm的安装与应用,在用到scrapy框架爬图的编译过程中出现了这样的问题 正是搞不明白是怎么一回事。
程序代码:
C:\Python27\python.exe C:/first/main.py
2016-10-09 23:19:48 [scrapy] INFO: Scrapy 1.2.0 started (bot: first)
2016-10-09 23:19:48 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'first.spiders', 'SPIDER_MODULES': ['first.spiders'], 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0', 'BOT_NAME': 'first'}
2016-10-09 23:19:48 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',

 'scrapy.extensions.telnet.TelnetConsole',

 'scrapy.extensions.corestats.CoreStats']
2016-10-09 23:19:49 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',

 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-09 23:19:49 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-09 23:19:49 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-09 23:19:49 [scrapy] INFO: Spider opened
2016-10-09 23:19:49 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-09 23:19:49 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-09 23:19:49 [scrapy] ERROR: Error downloading <GET https:///robots.txt>: Empty domain
Traceback (most recent call last):
  File "C:\Python27\Lib\site-packages\twisted\internet\defer.py", line 1105, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "C:\Python27\Lib\site-packages\twisted\python\failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\utils\defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\handlers\__init__.py", line 65, in download_request
    return handler.download_request(request, spider)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\handlers\http11.py", line 60, in download_request
    return agent.download_request(request)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\handlers\http11.py", line 285, in download_request
    method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "C:\Python27\Lib\site-packages\twisted\web\client.py", line 1470, in request
    parsedURI.port)
  File "C:\Python27\Lib\site-packages\twisted\web\client.py", line 1450, in _getEndpoint
    tlsPolicy = self._policyForHTTPS.creatorForNetloc(host, port)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\contextfactory.py", line 57, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
  File "C:\Python27\Lib\site-packages\twisted\internet\_sslverify.py", line 1059, in __init__
    self._hostnameBytes = _idnaBytes(hostname)
  File "C:\Python27\Lib\site-packages\twisted\internet\_sslverify.py", line 86, in _idnaBytes
    return idna.encode(text).encode("ascii")
  File "C:\Python27\Lib\site-packages\idna\core.py", line 350, in encode
    raise IDNAError('Empty domain')
IDNAError: Empty domain
2016-10-09 23:19:49 [scrapy] ERROR: Error downloading <GET https:///%20//%20www. (most recent call last):
  File "C:\Python27\Lib\site-packages\twisted\internet\defer.py", line 1105, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "C:\Python27\Lib\site-packages\twisted\python\failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\utils\defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\handlers\__init__.py", line 65, in download_request
    return handler.download_request(request, spider)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\handlers\http11.py", line 60, in download_request
    return agent.download_request(request)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\handlers\http11.py", line 285, in download_request
    method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "C:\Python27\Lib\site-packages\twisted\web\client.py", line 1470, in request
    parsedURI.port)
  File "C:\Python27\Lib\site-packages\twisted\web\client.py", line 1450, in _getEndpoint
    tlsPolicy = self._policyForHTTPS.creatorForNetloc(host, port)
  File "C:\Python27\lib\site-packages\scrapy-1.2.0-py2.7.egg\scrapy\core\downloader\contextfactory.py", line 57, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
  File "C:\Python27\Lib\site-packages\twisted\internet\_sslverify.py", line 1059, in __init__
    self._hostnameBytes = _idnaBytes(hostname)
  File "C:\Python27\Lib\site-packages\twisted\internet\_sslverify.py", line 86, in _idnaBytes
    return idna.encode(text).encode("ascii")
  File "C:\Python27\Lib\site-packages\idna\core.py", line 350, in encode
    raise IDNAError('Empty domain')
IDNAError: Empty domain
2016-10-09 23:19:49 [scrapy] INFO: Closing spider (finished)
2016-10-09 23:19:49 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,

 'downloader/exception_type_count/idna.core.IDNAError': 2,

 'downloader/request_bytes': 539,

 'downloader/request_count': 2,

 'downloader/request_method_count/GET': 2,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2016, 10, 9, 15, 19, 49, 706000),

 'log_count/DEBUG': 1,

 'log_count/ERROR': 2,

 'log_count/INFO': 7,

 'scheduler/dequeued': 1,

 'scheduler/dequeued/memory': 1,

 'scheduler/enqueued': 1,

 'scheduler/enqueued/memory': 1,

 'start_time': datetime.datetime(2016, 10, 9, 15, 19, 49, 256000)}
2016-10-09 23:19:49 [scrapy] INFO: Spider closed (finished)
在pycharm中出现这样的回复到底是什么问题呢?弄了两天了依旧没有弄明白,都快被这些弄疯了,求各位帮帮忙?

代码链接在http://www.




搜索更多相关主题的帖子: started python 努力学习 朋友 
2016-10-09 23:38
cpxuvs
Rank: 3Rank: 3
等 级:论坛游侠
威 望:3
帖 子:45
专家分:142
注 册:2015-12-22
收藏
得分:20 
上面的INFO应该是正常的
IDNAError: Empty domain
检查一下你写的网址
2016-10-11 22:02
廉价的咖啡
Rank: 2
来 自:湖北 荆门
等 级:论坛游民
威 望:2
帖 子:53
专家分:17
注 册:2014-10-9
收藏
得分:0 
回复 2楼 cpxuvs
你好!  真是感谢你的回复,让我又看到了希望,我来回换了好几个网址,还是不行,嗯我想问的是我是通过cmd中的scrapy startproject doubanbook建立的文件夹,同时还是通过cmd在建立的文件夹中建立了一个含有网址的文件夹,但在写程序用的是另外的地址,这个会影响出这样的结果吗?

就在这一刻,你的对手依旧不停的翻动书页。
2016-10-12 20:54
XHunter
Rank: 1
等 级:新手上路
帖 子:1
专家分:0
注 册:2017-10-1
收藏
得分:0 
回复 3楼 廉价的咖啡
楼主,我也遇到这个问题了,楼主快一年时间了 还在吗??这个问题解决了吗???跪求解答????
2017-10-01 19:51
ipaomi
Rank: 2
等 级:新手上路
威 望:3
帖 子:11
专家分:0
注 册:2017-10-18
收藏
得分:0 
看这一行 log
2016-10-09 23:19:49 [scrapy] ERROR: Error downloading <GET https:///%20//%20www.
爬虫请求的是 https:///%20//%20www.
你把这个 URL粘到浏览器里,你也是访问不了的。
事实上,你想访问的是 https://www.,但不知道什么原因在 URL 里面多了一些没用的空格(编码之后就是 %20),
所以你需要在代码当中把空格去掉
假定
url = 'https:// www. / doulist / 1264675 /'
url = url.replace(' ', '') # 去掉 URL 当中多余的空格

2017-10-18 17:23
快速回复:python爬虫框架搭建中的问题
数据加载中...
 
   



关于我们 | 广告合作 | 编程中国 | 清除Cookies | TOP | 手机版

编程中国 版权所有,并保留所有权利。
Powered by Discuz, Processed in 0.031145 second(s), 8 queries.
Copyright©2004-2024, BCCN.NET, All Rights Reserved