爬虫入门教程（一）————requests与xpath

前言：
本博客为python爬虫入门教程，学习需前初步了解python基本语法，http协议

一：requests简介
requests最常用的两个方法，对应http协议的get和post

1 2	requests.get(url) requests.post(url, datas = data)

获取html文本: requests.get(url).text

二：xpath
1.简介
用来解析网页，提取你想要的内容
最简单的方法为选中后右键检查，再右键copy选xpath，但很多时候会有问题存在，提取到空内容，所以最好还是自学一下xpath语法。
// ：查找所有html文本中符合条件的
/ ：在当前节点的子节点里面查找
text() ：提取文本
@ ：提取元素

举例如下：

<bookstore>

<book>
  <title lang="edc">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>
</bookstore>

如要提取价格，写为//price/text()或者/bookstore/book/price/text()
提取lang元素，写为//price/@lang或者/bookstore/book/price/@lang
当有多个title标签容易混淆时，可以这样写//title[@lang=”edc”]/text(),这样就提取了Harry Potter。若直接写//title/text(),则两本书名都提取。
2.用法

1
2
3

res = requests.get(url)
html = etree.HTML(res.text)
aim = html.xpath('填入你想爬取的内容对应的xpath')

三：举例
爬取蚂蜂窝的游记标题，并打印输出

url = http://www.mafengwo.cn/i/11243250.html #蚂蜂窝的一篇游记
res = requests.get(url)
html = etree.HTML(res.text)
title = html.xpath('//div[@class="vi_con"]/h1/text()')
print(title)

我们一起来让这个世界有趣一点