文章详情页

python 如何获取页面所有a标签下href的值

浏览：85日期：2022-06-20 14:31:38

看代码吧~

# -*- coding:utf-8 -*-#python 2.7#http://tieba.baidu.com/p/2460150866#标签操作 from bs4 import BeautifulSoupimport urllib.requestimport re #如果是网址，可以用这个办法来读取网页#html_doc = 'http://tieba.baidu.com/p/2460150866'#req = urllib.request.Request(html_doc) #webpage = urllib.request.urlopen(req) #html = webpage.read() html='''<html><head><title>The Dormouse’s story</title></head><body>The Dormouse’s storyOnce upon a time there were three little sisters; and their names were<a href='http://example.com/elsie' rel='external nofollow' rel='external nofollow' id='xiaodeng'></a>,<a href='http://example.com/lacie' rel='external nofollow' rel='external nofollow' id='link2'>Lacie</a> and<a href='http://example.com/tillie' rel='external nofollow' id='link3'>Tillie</a>;<a href='http://example.com/lacie' rel='external nofollow' rel='external nofollow' id='xiaodeng'>Lacie</a>and they lived at the bottom of a well....'''soup = BeautifulSoup(html, ’html.parser’) #文档对象 #查找a标签,只会查找出一个a标签#print(soup.a)#<a href='http://example.com/elsie' rel='external nofollow' rel='external nofollow' id='xiaodeng'></a> for k in soup.find_all(’a’): print(k) print(k[’class’])#查a标签的class属性 print(k[’id’])#查a标签的id值 print(k[’href’])#查a标签的href值 print(k.string)#查a标签的string

如果，标签<a>中含有其他标签，比如..，此时要提取<a>中的数据，需要用k.get_text()

soup = BeautifulSoup(html, ’html.parser’) #文档对象#查找a标签,只会查找出一个a标签for k in soup.find_all(’a’): print(k) print(k[’class’])#查a标签的class属性 print(k[’id’])#查a标签的id值 print(k[’href’])#查a标签的href值 print(k.string)#查a标签的string

如果，标签<a>中含有其他标签，比如..，此时要提取<a>中的数据，需要用k.get_text()

通常我们使用下面这种模式也是能够处理的，下面的方法使用了get()。

html = urlopen(url) soup = BeautifulSoup(html, ’html.parser’) t1 = soup.find_all(’a’) print t1 href_list = [] for t2 in t1: t3 = t2.get(’href’) href_list.append(t3)

补充：python爬虫获取任意页面的标签和属性（包括获取a标签的href属性）

看代码吧~

# coding=utf-8 from bs4 import BeautifulSoup import requests # 定义一个获取url页面下label标签的attr属性的函数 def getHtml(url, label, attr): response = requests.get(url) response.encoding = ’utf-8’ html = response.text soup = BeautifulSoup(html, ’html.parser’); for target in soup.find_all(label): try: value = target.get(attr) except: value = ’’ if value: print(value) url = ’https://baidu.com/’ label = ’a’ attr = ’href’ getHtml(url, label, attr)

python 如何获取页面所有a标签下href的值

以上为个人经验，希望能给大家一个参考，也希望大家多多支持好吧啦网。如有错误或未考虑完全的地方，望不吝赐教。

Python 编程

上一条：Python 如何安装Selenium(推荐)下一条：Python基础之hashlib模块详解

相关文章：

1. Vue项目中如何封装axios（统一管理http请求）2. spring是如何实现声明式事务的3. PHP 命名空间原理与用法详解4. IntelliJ IDEA导入jar包的方法5. IntelliJ IDEA恢复删除文件的方法6. 基于idea Maven中的redis配置使用详解7. IntelliJ IDEA 下载安装超详细教程(推荐)8. vue 组件简介9. 如何实现axios的自定义适配器adapter10. JS实现手写 forEach算法示例

排行榜

					
					IntelliJ IDEA导入jar包的方法
IntelliJ IDEA 下载安装超详细教程(推荐)
vue 组件简介
如何实现axios的自定义适配器adapter
IntelliJ IDEA恢复删除文件的方法
PHP 命名空间原理与用法详解
JS实现手写 forEach算法示例
基于idea Maven中的redis配置使用详解
Vue项目中如何封装axios（统一管理http请求）
spring是如何实现声明式事务的
IntelliJ IDEA设置编码格式的方法
				

热门标签