基于Python的电力行业历史新闻数据爬取与Markdown存储实践-前端教程-前军教程网

近期在做一些电力行业的数据分析工作，需要借助相关门户网站的历史资讯数据，然后今天写了一个用于爬取该网站历史新闻数据的脚本。

该网站的新闻都是静态网页，是将内容写在HTML里的，不需要使用JavaScript或者Vue来加载数据。所以理论上使用request库来获取HTML内容，然后使用BeautifulSoup来进行HTML内容提取即可。

首先导入requests和BeautifulSoup库。

相关参数设置。包括设置目标网页URL，设置headers User-Agent用于模拟浏览器访问，设置Markdown文件保存路径。

使用requests发送GET请求，获取响应结果后检查响应状态，若成功则继续。

此处先用BeautifulSoup解析HTML内容，然后根据网页实际的DIV内容来修改class类名，最后提取其内部的HTML。

处理提取的内容，针对图片、超链接，将其格式进行转换后，方便存储在Markdown中。这里要注意是，如果HTML中的图片或链接是相对路径，需要把它转换为绝对路径。例如，将转换为完整的URL，如http://example.com/images/abc.jpg，这里可以通过urljoin函数处理。

保存数据到文件。

整体效果

整体代码

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
import html2text

# 配置参数
# 目标网页URL
url = 'https://news.bjx.com.cn/html/20250226/1429144.shtml'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
# 输出到本地的文件名
output_md = 'output.md'

# 发送请求
response = requests.get(url, headers=headers)
if response.status_code != 200:
    print(f'请求失败，状态码：{response.status_code}')
    exit()

# 解析HTML
soup = BeautifulSoup(response.text, 'html')
# 此处需要根据网页实际的内容来修改class类名
content_div = soup.find('div', class_='cc-article')
if not content_div:
    print('未找到内容区域')
    exit()

# 处理相对URL
base_url = response.url  # 获取基础URL用于拼接
for tag in content_div.find_all(['a', 'img', 'link']):
    if tag.name == 'a' and tag.has_attr('href'):
        tag['href'] = urljoin(base_url, tag['href'])
    elif tag.name == 'img' and tag.has_attr('src'):
        tag['src'] = urljoin(base_url, tag['src'])

# 提取处理后的HTML内容
processed_html = str(content_div)

# 转换为Markdown
converter = html2text.HTML2Text()
converter.ignore_links = False  # 保留链接
converter.wrap_links = False  # 防止链接换行
markdown_content = converter.handle(processed_html)

# 保存结果
with open(output_md, 'w', encoding='utf-8') as md_file:
    md_file.write(markdown_content)

print(f'成功保存Markdown至{output_md}')

前军教程网

中小站长与DIV+CSS网页布局开发技术人员的首选CSS学习平台

基于Python的电力行业历史新闻数据爬取与Markdown存储实践