标题:爬虫非法多字节问题
只看楼主
往生
Rank: 2
等 级:论坛游民
帖 子:19
专家分:20
注 册:2023-1-11
结帖率:50%
 问题点数:0 回复次数:0 
爬虫非法多字节问题
bs4_text.py

from bs4 import BeautifulSoup
file=open("text.txt",'r')
context=file.read()
soup=BeautifulSoup(context,"html.parser")

links=soup.find_all("a")
for link in links:
     print(link.name,link["href"],link.get_text())
file.close()


text.txt
<html>
<head>
    <meta http-equiv=Content-Type content="text/html;charset=utf-8">
</head>
<body>
    <h1>标题1</h1>
    <h2>标题2</h2>
    <h3>标题3</h3>
    <h4>标题4</h4>

<div id="content" class="default">
    <p>段落</p>
    <a,href="https://www.baidu.com">百度</a>
    <ima scr="https://www.,png"/>
</div>

</body>

</html>

结果
Traceback (most recent call last):
  File "C:\Users\86177\Desktop\didi\编程\爬虫\bs4_text.py", line 3, in <module>
    context=file.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa2 in position 109: illegal multibyte sequence
[Finished in 437ms]
搜索更多相关主题的帖子: file 爬虫 html 标题 link 
2023-01-21 14:50



参与讨论请移步原网站贴子:https://bbs.bccn.net/thread-511142-1-1.html




关于我们 | 广告合作 | 编程中国 | 清除Cookies | TOP | 手机版

编程中国 版权所有,并保留所有权利。
Powered by Discuz, Processed in 0.635599 second(s), 7 queries.
Copyright©2004-2025, BCCN.NET, All Rights Reserved