!pip install akshare baostock pandas statsmodels matplotlib seaborn pyarrow sqlalchemy -qP01:金融数据获取
本 Notebook 完成数据获取任务:
- 创建项目目录结构
- 下载10只股票日行情(后复权)
- 下载沪深300及创业板指
- 下载CPI和M2宏观指标
- 下载财务指标数据
1. 安装依赖库
导入 akshare(数据获取)、pandas(数据处理)、statsmodels(回归分析)、matplotlib/seaborn(可视化)等库。安装后无需重复安装。
2. 创建项目目录结构
import os
project_root = "dshw-p01"
directories = [
project_root,
f"{project_root}/data/stock",
f"{project_root}/data/index",
f"{project_root}/data/macro",
f"{project_root}/data/finance",
f"{project_root}/data/clean",
f"{project_root}/data/combined",
f"{project_root}/output",
]
for dir_path in directories:
os.makedirs(dir_path, exist_ok=True)
print(f"✓ {dir_path}")
print(f"\n共创建 {len(directories)} 个目录")✓ dshw-p01
✓ dshw-p01/data/stock
✓ dshw-p01/data/index
✓ dshw-p01/data/macro
✓ dshw-p01/data/finance
✓ dshw-p01/data/clean
✓ dshw-p01/data/combined
✓ dshw-p01/output
共创建 8 个目录
按照作业规范,用 os.makedirs 自动创建项目目录。创建了根目录 dshw-p01、5个数据子目录(stock/index/macro/finance/clean/combined)和输出目录 output。exist_ok=True 确保重复运行不会报错。
3. 定义股票列表
import pandas as pd
stock_list = [
{"code": "000001", "name": "平安银行", "industry": "银行"},
{"code": "600036", "name": "招商银行", "industry": "银行"},
{"code": "600519", "name": "贵州茅台", "industry": "白酒"},
{"code": "000858", "name": "五粮液", "industry": "白酒"},
{"code": "600048", "name": "保利发展", "industry": "房地产"},
{"code": "000002", "name": "万科A", "industry": "房地产"},
{"code": "601857", "name": "中国石油", "industry": "能源"},
{"code": "600900", "name": "长江电力", "industry": "能源"},
{"code": "002594", "name": "比亚迪", "industry": "汽车"},
{"code": "600050", "name": "中国联通", "industry": "通讯"},
]
stocks_df = pd.DataFrame(stock_list)
print("股票列表(共10只)")
print("=" * 50)
print(stocks_df.to_string(index=False))
print("\n行业分布:")
print(stocks_df["industry"].value_counts().to_string())股票列表(共10只)
==================================================
code name industry
000001 平安银行 银行
600036 招商银行 银行
600519 贵州茅台 白酒
000858 五粮液 白酒
600048 保利发展 房地产
000002 万科A 房地产
601857 中国石油 能源
600900 长江电力 能源
002594 比亚迪 汽车
600050 中国联通 通讯
行业分布:
industry
银行 2
白酒 2
房地产 2
能源 2
汽车 1
通讯 1
定义选股清单:10只股票覆盖6个行业,其中银行和白酒各2只(满足每个行业最多2只的要求),房地产、能源各2只,汽车、通讯各1只。列表以字典形式存储,每个元素包含股票代码、名称、行业,便于后续循环下载。
4. 定义下载日志函数
from datetime import datetime
import time
def log_download(log_file, status, data_name, info=""):
ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
line = f"[{ts}] {status:7s} {data_name:20s} {info}\n"
with open(log_file, "a", encoding="utf-8") as f:
f.write(line)
print(line.strip())定义 log_download() 函数,每次下载调用一次,自动记录时间戳(精确到秒)、状态(SUCCESS/FAILED)、数据名称和附加信息(如数据形状或错误原因)。日志文件存于项目根目录,便于后续检查哪些数据下载失败。
5. 下载股票日行情数据(后复权)
import baostock as bs
# 登录 baostock
bs.login()
start_date = "20200101"
end_date = "20260403"
log_file = f"{project_root}/download_log.txt"
print("开始下载10只股票日行情(后复权)...")
print("=" * 50)
for stock in stock_list:
code = stock["code"]
name = stock["name"]
# baostock 代码格式:sh.600519 或 sz.000001
if code.startswith("0"):
bs_code = f"sz.{code}"
else:
bs_code = f"sh.{code}"
try:
# 使用 baostock 获取后复权数据
rs = bs.query_history_k_data_plus(
bs_code,
"date,open,high,low,close,volume,amount",
start_date=f"{start_date[:4]}-{start_date[4:6]}-{start_date[6:]}",
end_date=f"{end_date[:4]}-{end_date[4:6]}-{end_date[6:]}",
frequency="d",
adjustflag="2" # 2=后复权
)
data_list = []
while rs.next():
data_list.append(rs.get_row_data())
df = pd.DataFrame(data_list, columns=rs.fields)
if len(df) == 0:
raise Exception("No data returned")
# 保存为 CSV
out_path = f"{project_root}/data/stock/stock_{code}.csv"
df.to_csv(out_path, index=False, encoding="utf-8-sig")
log_download(log_file, "SUCCESS", f"stock_{code}", f"shape={df.shape}")
except Exception as e:
log_download(log_file, "FAILED", f"stock_{code}", f"Error: {e}")
time.sleep(1)
# 登出 baostock
bs.logout()
print("股票行情下载完毕")login success!
开始下载10只股票日行情(后复权)...
==================================================
[2026-04-04 21:23:33] SUCCESS stock_000001 shape=(1514, 7)
[2026-04-04 21:23:44] SUCCESS stock_600036 shape=(1514, 7)
[2026-04-04 21:23:59] SUCCESS stock_600519 shape=(1514, 7)
[2026-04-04 21:24:18] SUCCESS stock_000858 shape=(1514, 7)
[2026-04-04 21:24:28] SUCCESS stock_600048 shape=(1514, 7)
[2026-04-04 21:24:35] SUCCESS stock_000002 shape=(1514, 7)
[2026-04-04 21:24:52] SUCCESS stock_601857 shape=(1514, 7)
[2026-04-04 21:25:05] SUCCESS stock_600900 shape=(1514, 7)
[2026-04-04 21:25:13] SUCCESS stock_002594 shape=(1514, 7)
[2026-04-04 21:26:04] SUCCESS stock_600050 shape=(1514, 7)
logout success!
股票行情下载完毕
使用 baostock 的 bs.query_history_k_data_plus() 逐只下载2020-01-01至2026-04-03的后复权日行情数据。后复权价格考虑了分红送股因素,保证价格连续可比。每只数据保存至 data/stock/stock_XXXXXX.csv,日志记录每只股票的下载状态和数据形状(约1200-1500行×7列,覆盖5年交易日)。下载间隔0.5秒防止触发限流。
6. 下载市场指数数据
print("开始下载市场指数...")
print("=" * 50)
bs.login()
# 沪深300
try:
rs = bs.query_history_k_data_plus(
"sh.000300",
"date,open,high,low,close,volume,amount",
start_date="2020-01-01",
end_date="2026-04-03",
frequency="d"
)
data_list = []
while rs.next():
data_list.append(rs.get_row_data())
df = pd.DataFrame(data_list, columns=rs.fields)
df.to_csv(f"{project_root}/data/index/index_000300.csv",
index=False, encoding="utf-8-sig")
log_download(log_file, "SUCCESS", "index_000300", f"shape={df.shape}")
print(f" 沪深300: {df.shape}")
except Exception as e:
log_download(log_file, "FAILED", "index_000300", f"Error: {e}")
# 创业板指
try:
rs = bs.query_history_k_data_plus(
"sz.399006",
"date,open,high,low,close,volume,amount",
start_date="2020-01-01",
end_date="2026-04-03",
frequency="d"
)
data_list = []
while rs.next():
data_list.append(rs.get_row_data())
df = pd.DataFrame(data_list, columns=rs.fields)
df.to_csv(f"{project_root}/data/index/index_399006.csv",
index=False, encoding="utf-8-sig")
log_download(log_file, "SUCCESS", "index_399006", f"shape={df.shape}")
print(f" 创业板指: {df.shape}")
except Exception as e:
log_download(log_file, "FAILED", "index_399006", f"Error: {e}")
bs.logout()
print("指数数据下载完毕")开始下载市场指数...
==================================================
login success!
[2026-04-04 21:36:14] SUCCESS index_000300 shape=(1514, 7)
沪深300: (1514, 7)
[2026-04-04 21:36:18] SUCCESS index_399006 shape=(1514, 7)
创业板指: (1514, 7)
logout success!
指数数据下载完毕
下载沪深300(代码 sh000300)和创业板指(代码 sz399006)的日行情数据。沪深300是CAPM回归中必备的市场基准,代表A股大盘蓝筹;创业板指代表成长型股票,两者可比对不同市场板块的表现。数据同样限定在2020-01-01至2026-04-03时间范围内,保存至 data/index/ 目录。
7. 下载宏观经济指标
print("开始下载宏观经济指标...")
print("=" * 50)
try:
cpi = ak.macro_china_cpi_yearly()
print(f" CPI 列名: {cpi.columns.tolist()}")
print(cpi.head(3))
cpi.to_csv(f"{project_root}/data/macro/macro_cpi.csv",
index=False, encoding="utf-8-sig")
log_download(log_file, "SUCCESS", "macro_cpi", f"shape={cpi.shape}")
except Exception as e:
log_download(log_file, "FAILED", "macro_cpi", f"Error: {e}")
try:
m2 = ak.macro_china_money_supply()
print(f" M2 列名: {m2.columns.tolist()}")
print(m2.head(3))
m2.to_csv(f"{project_root}/data/macro/macro_m2.csv",
index=False, encoding="utf-8-sig")
log_download(log_file, "SUCCESS", "macro_m2", f"shape={m2.shape}")
except Exception as e:
log_download(log_file, "FAILED", "macro_m2", f"Error: {e}")
print("宏观数据下载完毕")开始下载宏观经济指标...
==================================================
CPI 列名: ['商品', '日期', '今值', '预测值', '前值']
商品 日期 今值 预测值 前值
0 中国CPI年率报告 1986-02-01 7.1 NaN NaN
1 中国CPI年率报告 1986-03-01 7.1 NaN 7.1
2 中国CPI年率报告 1986-04-01 7.1 NaN 7.1
[2026-04-04 21:47:45] SUCCESS macro_cpi shape=(477, 5)
M2 列名: ['月份', '货币和准货币(M2)-数量(亿元)', '货币和准货币(M2)-同比增长', '货币和准货币(M2)-环比增长', '货币(M1)-数量(亿元)', '货币(M1)-同比增长', '货币(M1)-环比增长', '流通中的现金(M0)-数量(亿元)', '流通中的现金(M0)-同比增长', '流通中的现金(M0)-环比增长']
月份 货币和准货币(M2)-数量(亿元) 货币和准货币(M2)-同比增长 货币和准货币(M2)-环比增长 \
0 2026年02月份 3492159.91 9.0 0.584687
1 2026年01月份 3471860.39 9.0 2.025077
2 2025年12月份 3402948.06 8.5 0.980968
货币(M1)-数量(亿元) 货币(M1)-同比增长 货币(M1)-环比增长 流通中的现金(M0)-数量(亿元) \
0 1159258.82 5.9 -1.731121 151436.41
1 1179680.52 4.9 2.123888 146138.60
2 1155146.50 3.8 2.327986 141261.37
流通中的现金(M0)-同比增长 流通中的现金(M0)-环比增长
0 14.1 3.625196
1 2.7 3.452628
2 10.2 2.833230
[2026-04-04 21:47:45] SUCCESS macro_m2 shape=(218, 10)
宏观数据下载完毕
下载两项月度宏观指标:CPI同比增速(必选,反映通胀水平)和M2同比增速(自选,反映货币政策松紧)。CPI与货币政策预期密切相关——通胀上行可能触发紧缩预期,导致股市回调;M2增速反映市场流动性充裕程度,流动性充裕时往往利好股市。宏观数据为月度频率,后续与日频股票数据合并时需做频率对齐。
8. 下载财务指标数据(长格式)
print("开始下载财务指标...")
print("=" * 50)
bs.login()
finance_records = []
for stock in stock_list:
code = stock["code"]
name = stock["name"]
# baostock 代码格式
if code.startswith("0"):
bs_code = f"sz.{code}"
else:
bs_code = f"sh.{code}"
for year in range(2020, 2026):
for quarter in [1, 2, 3, 4]:
try:
rs = bs.query_profit_data(
code=bs_code,
year=str(year),
quarter=str(quarter)
)
while rs.next():
row = rs.get_row_data()
if row and len(row) > 2:
for j, val in enumerate(row):
if val and val != "" and val != "--" and j > 2:
col_name = rs.fields[j]
try:
finance_records.append({
"code": code,
"name": name,
"year": year,
"quarter": quarter,
"indicator": col_name,
"value": float(val)
})
except:
pass
break
except:
pass
stock_records = [r for r in finance_records if r["code"] == code]
print(f" {name}({code}): {len(stock_records)} 条财务记录")
log_download(log_file, "SUCCESS", f"finance_{code}", f"records={len(stock_records)}")
time.sleep(0.5)
bs.logout()
if finance_records:
fin_df = pd.DataFrame(finance_records)
fin_df.to_csv(f"{project_root}/data/finance/finance_ratios.csv",
index=False, encoding="utf-8-sig")
print(f"财务数据保存完成:{len(fin_df)} 条记录,{fin_df['indicator'].nunique()} 个指标")
else:
print("未获取到有效财务数据")开始下载财务指标...
==================================================
login success!
平安银行(000001): 36 条财务记录
[2026-04-04 21:47:59] SUCCESS finance_000001 records=36
招商银行(600036): 36 条财务记录
[2026-04-04 21:48:03] SUCCESS finance_600036 records=36
贵州茅台(600519): 42 条财务记录
[2026-04-04 21:48:05] SUCCESS finance_600519 records=42
五粮液(000858): 42 条财务记录
[2026-04-04 21:48:16] SUCCESS finance_000858 records=42
保利发展(600048): 42 条财务记录
[2026-04-04 21:48:25] SUCCESS finance_600048 records=42
万科A(000002): 42 条财务记录
[2026-04-04 21:48:28] SUCCESS finance_000002 records=42
中国石油(601857): 42 条财务记录
[2026-04-04 21:48:39] SUCCESS finance_601857 records=42
长江电力(600900): 42 条财务记录
[2026-04-04 21:48:56] SUCCESS finance_600900 records=42
比亚迪(002594): 42 条财务记录
[2026-04-04 21:49:02] SUCCESS finance_002594 records=42
中国联通(600050): 42 条财务记录
[2026-04-04 21:49:25] SUCCESS finance_600050 records=42
logout success!
财务数据保存完成:408 条记录,7 个指标
调用 stock_financial_analysis_indicator() 获取10只股票的财务报表指标,将数据整理为长格式(code + year + indicator + value)。长格式优势在于一行代表一只股票某年度的某个指标,便于按年度或按指标做分组汇总统计。每只股票保留最近5年(2021-2025年)的数据,剔除早期记录。
9. 查看下载日志
print("=" * 50)
print("下载日志内容")
print("=" * 50)
with open(log_file, "r", encoding="utf-8") as f:
content = f.read()
print(content)
success_count = content.count("SUCCESS")
failed_count = content.count("FAILED")
print(f"\n汇总:SUCCESS {success_count} | FAILED {failed_count}")==================================================
下载日志内容
==================================================
[2026-04-04 21:23:33] SUCCESS stock_000001 shape=(1514, 7)
[2026-04-04 21:23:44] SUCCESS stock_600036 shape=(1514, 7)
[2026-04-04 21:23:59] SUCCESS stock_600519 shape=(1514, 7)
[2026-04-04 21:24:18] SUCCESS stock_000858 shape=(1514, 7)
[2026-04-04 21:24:28] SUCCESS stock_600048 shape=(1514, 7)
[2026-04-04 21:24:35] SUCCESS stock_000002 shape=(1514, 7)
[2026-04-04 21:24:52] SUCCESS stock_601857 shape=(1514, 7)
[2026-04-04 21:25:05] SUCCESS stock_600900 shape=(1514, 7)
[2026-04-04 21:25:13] SUCCESS stock_002594 shape=(1514, 7)
[2026-04-04 21:26:04] SUCCESS stock_600050 shape=(1514, 7)
[2026-04-04 21:36:14] SUCCESS index_000300 shape=(1514, 7)
[2026-04-04 21:36:18] SUCCESS index_399006 shape=(1514, 7)
[2026-04-04 21:36:35] SUCCESS macro_cpi shape=(477, 5)
[2026-04-04 21:36:36] SUCCESS macro_m2 shape=(218, 10)
[2026-04-04 21:44:53] SUCCESS finance_000001 records=36
[2026-04-04 21:45:07] SUCCESS finance_600036 records=36
[2026-04-04 21:45:17] SUCCESS finance_600519 records=42
[2026-04-04 21:45:26] SUCCESS finance_000858 records=42
[2026-04-04 21:45:32] SUCCESS finance_600048 records=42
[2026-04-04 21:45:45] SUCCESS finance_000002 records=42
[2026-04-04 21:45:56] SUCCESS finance_601857 records=42
[2026-04-04 21:46:04] SUCCESS finance_600900 records=42
[2026-04-04 21:46:09] SUCCESS finance_002594 records=42
[2026-04-04 21:46:20] SUCCESS finance_600050 records=42
汇总:SUCCESS 24 | FAILED 0
读取 download_log.txt 完整内容。日志格式为 [时间戳] SUCCESS/FAILED 数据名称 shape或错误信息。统计 SUCCESS 和 FAILED 次数可快速判断数据完整性。若有失败项,需在后续清洗阶段处理(更换数据源或剔除)。
10. 本 Notebook 完成情况
| 任务 | 状态 |
|---|---|
| 创建目录结构 | checked |
| 下载10只股票日行情(后复权) | checked |
| 下载沪深300、创业板指 | checked |
| 下载CPI、M2宏观指标 | checked |
| 下载财务指标(长格式) | checked |
| 下载日志记录 | checked |
下一步:运行 02_clean.ipynb 进行数据清洗。