如何用Python爬取网页数据-图灵python

使用Python爬取网页数据的方法：

一、使用webbrowser.open()打开网站:

>>>importwebbrowser
>>>webbrowser.open('http://i.firefoxchina.cn/?from=worldindex')
True

1.从sys.读取命令行参数：打开一个新的文件编辑器窗口，输入下面的代码，并将其保存为map。.py。

2.读取剪贴板内容：

3.调用webbrowser.open()函数打开外部浏览:

#!python3
importwebbrowser,sys,pyperclip
iflen(sys.argv)>1:
mapAddress=''.join(sys.argv[1:])
else:
mapAddress=pyperclip.paste()
webbrowser.open('http://map.baidu.com/?newmap=1&ie=utf-8&s=s%39wd%3D;+mapAddress

二、用requests模块从Web下载文件:requests模块不是Python带来的，而是通过命令运行pips install 安装request。

>>>importrequests
>>>res=requests.get('http://i.firefoxchina.cn/?from=worldindex')#将一个网站传输到get中
>>>type(res)#响应对象
<class'requests.models.Response'>
>>>print(res.status_code)#响应码
200
>>>res.text#返回的文本

在下载文件的过程中，raise_for_status()方法可以保证下载真的成功，然后让程序继续做其他事情。

importrequests
res=requests.get('http://i.firefoxchina.cn/?from=worldindex')
try:
res.raise_for_status()
exceptExceptionasexc:
print('Therewasaproblem:%s'%(exc))

将下载的文件保存到本地：

>>>importrequests
>>>res=requests.get('http://tech.firefox.sina.com/17/0820/10/6DKQALVRW5JHGE.html##0-tsina-1-13074-397232819ff9a7a7e80a40613cfe1')
>>>res.raise_for_status()
>>>file=open('1.txt','wb')#以写二进制模式打开文件，其目的是保存文本中的“Unicode编码”
>>>forwordinres.iter_content(100000):#<spanclass="fontstyle"><spanclass="fontstyle">iter_content()</span><spanclass="fontstyle1">该方法在循环的每次迭代中返回一段</span><spanclass="fontstyle">bytes</span><spanclass="fontstyle1">数据</span><spanclass="fontstyle1">类型内容，您需要指定其中包含的字节数</span></span>
file.write(word)


16997
>>>file.close()

四、用BeautifulSoup模块分析HTML：在命令行中使用pip install 安装beautifulsoup4。.bs4.BeautifulSoup()函数可以分析HTML网站链接requestss.get()，还可以分析本地保存的HTML文件，直接open()本地HTML页面。

>>>importrequests,bs4
>>>res=requests.get('http://i.firefoxchina.cn/?from=worldindex')
>>>res.raise_for_status()
>>>soup=bs4.BeautifulSoup(res.text)

Warning(fromwarningsmodule):
File"C:\Users\King\AppData\Local\Programs\PythonPython36-32lib\site-packagesbeautifulsoup4-4.6.0-py3.6.egg\bs4______init__.py",line181
markup_type=markup_type))
UserWarning:Noparserwasexplicitlyspecified,soI'musingthebestavailableHTMLparserforthissystem("html.parser").Thisusuallyisn'taproblem,butifyourunthiscodeonanothersystem,orinadifferentvirtualenvironment,itmayuseadifferentparserandbehavedifferently.

thecodethatcausedthiswarninginlinininlinininlininline1ofthefile<string>.Togetridofthiswarning,changecodethatlookslikethis:

BeautifulSoup(YOUR_MARKUP})

tothis:

BeautifulSoup(YOUR_MARKUP,"html.parser")

>>>soup=bs4.BeautifulSoup(res.text,'html.parser')
>>>type(soup)
<class'bs4.BeautifulSoup'>

我在这里有错误的提示，所以我添加了第二个参数。

>>>importbs4
>>>html=open('C:\\Users\\King\\Desktop\\1.htm')
>>>exampleSoup=bs4.BeautifulSoup(html)
>>>exampleSoup=bs4.BeautifulSoup(html,'html.parser')
>>>type(exampleSoup)
<class'bs4.BeautifulSoup'>

2.用select()寻找元素的方法:需要将字符串作为CSS的“选择器”输入到Web页面的相应元素中，例如:soup.select('p')：所有名为<p>的元素；

soup.select('#author')：author元素具有id属性；

soup.select('.notice')：所有使用CSS class属性称为notice元素；

soup.select('p span')：所有在<p>元素之内的<span>元素；

soup.select('input[name]')：所有名为<input>并且有一个name属性，其值不重要的元素；

soup.select('input[type="button"]')：所有名为<input>还有一个type属性，其值为button元素。

>>>importrequests,bs4
>>>res=requests.get('http://i.firefoxchina.cn/?from=worldindex')
>>>res.raise_for_status()
>>>soup=bs4.BeautifulSoup(res.text,'html.parser')
>>>author=soup.select('#author')
>>>print(author)
[]
>>>type(author)
<class'list'>
>>>link=soup.select('link')
>>>print(link)
[<linkhref="css/mozMainStyle-min.css?v=20170705"rel="externalnofollow"rel="externalnofollow"rel="stylesheet"type="text/css"/>,<linkhref=""id="rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"moz-skin"rel="stylesheet"type="text/css"/>,<linkhref=""id="rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"moz-dir"rel="stylesheet"type="text/css"/>,<linkhref=""id="rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"moz-ver"rel="stylesheet"type="text/css"/>]
>>>type(link)
<class'list'>
>>>len(link)
4
>>>type(link[0])
<class'bs4.element.Tag'>
>>>link[0]
<linkhref="css/mozMainStyle-min.css?v=20170705"rel="externalnofollow"rel="externalnofollow"rel="stylesheet"type="text/css"/>
>>>link[0].attrs
{'rel':['stylesheet'],'type':'text/css','href':'css/mozMainStyle-min.css?v=20170705"rel="externalnofollow"rel="externalnofollow"rel="stylesheet"type="text/css"/>
>>>link[0].attrs
{'rel':['stylesheet'],'type':'text/css','href':'css/mozMainStyle-min.css?v=20170705'}

3.通过元素的属性获取数据:然后写上面的代码。

>>>link[0].get('href')
'css/mozMainStyle-min.css?v=20170705

更多技术请关注Python视频教程。