FastGPT/.claude/design/projects_app_performance_st...
Archer 44e9299d5e
Some checks are pending
Document deploy / sync-images (push) Waiting to run
Document deploy / generate-timestamp (push) Blocked by required conditions
Document deploy / build-images (map[domain:https://fastgpt.cn suffix:cn]) (push) Blocked by required conditions
Document deploy / build-images (map[domain:https://fastgpt.io suffix:io]) (push) Blocked by required conditions
Document deploy / update-images (map[deployment:fastgpt-docs domain:https://fastgpt.cn kube_config:KUBE_CONFIG_CN suffix:cn]) (push) Blocked by required conditions
Document deploy / update-images (map[deployment:fastgpt-docs domain:https://fastgpt.io kube_config:KUBE_CONFIG_IO suffix:io]) (push) Blocked by required conditions
Build FastGPT images in Personal warehouse / get-vars (push) Waiting to run
Build FastGPT images in Personal warehouse / build-fastgpt-images (map[arch:amd64 runs-on:ubuntu-24.04]) (push) Blocked by required conditions
Build FastGPT images in Personal warehouse / build-fastgpt-images (map[arch:arm64 runs-on:ubuntu-24.04-arm]) (push) Blocked by required conditions
Build FastGPT images in Personal warehouse / release-fastgpt-images (push) Blocked by required conditions
V4.13.2 features (#5792)
* add manual create http toolset (#5743)

* add manual create http toolset

* optimize code

* optimize

* fix

* fix

* rename filename

* feat: integrate ts-rest (#5741)

* feat: integrate ts-rest

* chore: classify core contract and pro contract

* chore: update lockfile

* chore: tweak dir structure

* chore: tweak dir structure

* update tsrest code (#5755)

* doc

* update tsrest code

* fix http toolset (#5753)

* fix http toolset

* fix

* perf: http toolset

* fix: toolresponse result (#5760)

* doc

* fix: toolresponse result

* fix: mongo watch

* remove log

* feat: integrated to minio (#5748)

* feat: migrate to minio

* feat: migrate apps' and dataset's avatar to minio

* feat: migrate more avatars to minio

* fix: lock file

* feat: migrate copyright settings' logo to minio

* feat: integrate minio

* chore: improve code

* chore: rename variables

* refactor: s3 class

* fix: s3 and mongo operations

* chore: add session for avatar source

* fix: init s3 buckets

* fix: bugbot issues

* expired time code

* perf: avatar code

* union type

* export favouriteContract

* empty bucket check

---------

Co-authored-by: archer <545436317@qq.com>

* refactor: zod schema to generate OpenAPI instead (#5771)

* doc

* fix: text split code (#5773)

* fix: toolresponse result

* remove log

* stream remove

* fix: text split code

* fix: workflow (#5779)

* fix: toolresponse result

* remove log

* fix: value check

* fix: workflow

* openapi doc

* perf: bucket delete cron

* doc

* feat: apikey health

* feat: export variables

* api code move

* perf: workflow performance (#5783)

* perf: reactflow context

* perf: workflow context split

* perf: nodeList computed map

* perf: nodes dependen

* perf: workflow performance

* workflow performance

* removel og

* lock

* version

* loop drag

* reactflow size

* reactflow size

* fix: s3init (#5784)

* doc

* fix: s3init

* perf: dynamic import

* remove moongose dep

* worker build

* worker code

* perf: worker build

* fix: error throw

* doc

* doc

* fix: build

* fix: dockerfile

* nextjs config

* fix: worker

* fix: build (#5791)

* fix: build

* vector cache code

* fix: app info modal avatar upload method replace (#5787)

* fix: app info modal avatar upload method replace

* chore: replace all useSelectFile with useUploadAvatar

* remove invalid code

* add size

* Update projects/app/src/pageComponents/app/detail/WorkflowComponents/Flow/nodes/render/RenderInput/templates/CommonInputForm.tsx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update projects/app/src/pageComponents/app/detail/WorkflowComponents/context/workflowInitContext.tsx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: heheer <heheer@sealos.io>
Co-authored-by: 伍闲犬 <whoeverimf5@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-20 19:08:21 +08:00

33 KiB

FastGPT 性能与稳定性深度分析报告 (扩展版)

生成时间: 2025-10-20 分析范围: 全项目 (projects/app + packages/service + packages/web + packages/global) 技术栈: Next.js 14.2.32 + TypeScript + MongoDB + PostgreSQL + Redis + BullMQ


执行摘要

本报告在初版基础上,深入分析了 packages 目录的核心业务逻辑,包括工作流引擎、AI 调用、数据集训练、权限系统等。识别了额外的 28 个严重性能和稳定性问题,使问题总数达到 70 个

新增关键发现:

  • Redis 连接管理严重缺陷: 多个 Redis 客户端实例未复用,缺少连接池
  • BullMQ 队列配置不当: 缺少重试策略、死信队列和监控
  • 训练数据批量插入存在递归栈溢出风险: 大数据量场景下可能崩溃
  • 向量数据库缺少容错和降级机制: 单点故障风险高
  • 认证系统存在安全漏洞: Cookie 配置不当,session 无过期时间

新增高危问题 (Additional High Priority)

🔴 H10. Redis 连接未复用导致连接数耗尽

位置: packages/service/common/redis/index.ts:6-28

问题描述:

export const newQueueRedisConnection = () => {
  const redis = new Redis(REDIS_URL);
  // 每次调用创建新连接,未复用
  return redis;
};

export const newWorkerRedisConnection = () => {
  const redis = new Redis(REDIS_URL, {
    maxRetriesPerRequest: null
  });
  return redis;
};
  • 每个 Queue 和 Worker 创建独立 Redis 连接
  • 未配置连接池参数 (maxRetriesPerRequest: null 会导致无限重试)
  • 三种不同的 Redis 客户端 (Queue/Worker/Global) 未统一管理
  • 未配置 Redis 连接超时和健康检查

风险等级: 🔴 高危

影响:

  • 高并发场景下 Redis 连接数快速增长
  • 连接耗尽导致所有依赖 Redis 的功能失效 (队列、缓存、锁)
  • 无限重试导致资源浪费

建议方案:

// 1. 统一 Redis 连接配置
const REDIS_CONFIG = {
  url: process.env.REDIS_URL || 'redis://localhost:6379',
  // 连接池配置
  maxRetriesPerRequest: 3,
  retryStrategy: (times: number) => {
    if (times > 3) return null;
    return Math.min(times * 50, 2000);
  },
  // 连接超时
  connectTimeout: 10000,
  // Keep-alive
  keepAlive: 30000,
  // 重连配置
  enableReadyCheck: true,
  enableOfflineQueue: true,
  // 连接名称标识
  connectionName: 'fastgpt',
  // 健康检查
  lazyConnect: false,
  // 事件处理
  retryDelayOnFailover: 100,
  retryDelayOnClusterDown: 300
};

// 2. 创建连接池管理器
class RedisConnectionPool {
  private static queueConnections: Redis[] = [];
  private static workerConnections: Redis[] = [];
  private static globalConnection: Redis | null = null;

  private static readonly POOL_SIZE = 10;

  static getQueueConnection(): Redis {
    if (this.queueConnections.length < this.POOL_SIZE) {
      const redis = new Redis({
        ...REDIS_CONFIG,
        connectionName: `${REDIS_CONFIG.connectionName}_queue_${this.queueConnections.length}`
      });

      redis.on('error', (err) => {
        addLog.error('Redis Queue Connection Error', err);
      });

      redis.on('close', () => {
        // 从池中移除
        const index = this.queueConnections.indexOf(redis);
        if (index > -1) {
          this.queueConnections.splice(index, 1);
        }
      });

      this.queueConnections.push(redis);
      return redis;
    }

    // 轮询选择已有连接
    return this.queueConnections[
      Math.floor(Math.random() * this.queueConnections.length)
    ];
  }

  static getWorkerConnection(): Redis {
    if (this.workerConnections.length < this.POOL_SIZE) {
      const redis = new Redis({
        ...REDIS_CONFIG,
        maxRetriesPerRequest: null, // Worker 需要此配置
        connectionName: `${REDIS_CONFIG.connectionName}_worker_${this.workerConnections.length}`
      });

      redis.on('error', (err) => {
        addLog.error('Redis Worker Connection Error', err);
      });

      this.workerConnections.push(redis);
      return redis;
    }

    return this.workerConnections[
      Math.floor(Math.random() * this.workerConnections.length)
    ];
  }

  static getGlobalConnection(): Redis {
    if (!this.globalConnection) {
      this.globalConnection = new Redis({
        ...REDIS_CONFIG,
        keyPrefix: FASTGPT_REDIS_PREFIX,
        connectionName: `${REDIS_CONFIG.connectionName}_global`
      });

      this.globalConnection.on('error', (err) => {
        addLog.error('Redis Global Connection Error', err);
      });
    }
    return this.globalConnection;
  }

  static async closeAll() {
    await Promise.all([
      ...this.queueConnections.map(r => r.quit()),
      ...this.workerConnections.map(r => r.quit()),
      this.globalConnection?.quit()
    ]);
  }
}

// 3. 导出优化后的函数
export const newQueueRedisConnection = () =>
  RedisConnectionPool.getQueueConnection();

export const newWorkerRedisConnection = () =>
  RedisConnectionPool.getWorkerConnection();

export const getGlobalRedisConnection = () =>
  RedisConnectionPool.getGlobalConnection();

// 4. 进程退出时清理
process.on('SIGTERM', async () => {
  await RedisConnectionPool.closeAll();
  process.exit(0);
});

🔴 H11. BullMQ 队列缺少重试策略和死信队列

位置: packages/service/common/bullmq/index.ts:12-19

问题描述:

const defaultWorkerOpts: Omit<ConnectionOptions, 'connection'> = {
  removeOnComplete: {
    count: 0 // 立即删除成功任务
  },
  removeOnFail: {
    count: 0 // 立即删除失败任务
  }
};
  • 失败任务立即删除,无法追踪和调试
  • 未配置重试策略 (attempts, backoff)
  • 缺少死信队列处理彻底失败的任务
  • 队列监控和告警缺失

风险等级: 🔴 高危

影响:

  • 训练任务失败无法追踪原因
  • 临时性错误 (网络抖动) 导致任务永久失败
  • 无法分析队列性能瓶颈
  • 数据一致性风险

建议方案:

// 1. 完善的 Worker 配置
const defaultWorkerOpts: Omit<WorkerOptions, 'connection'> = {
  // 保留任务用于调试和监控
  removeOnComplete: {
    age: 7 * 24 * 3600, // 保留 7 天
    count: 1000         // 最多保留 1000 个
  },
  removeOnFail: {
    age: 30 * 24 * 3600, // 保留 30 天
    count: 5000          // 最多保留 5000 个
  },

  // 并发控制
  concurrency: 5,

  // 限流配置
  limiter: {
    max: 100,      // 最大任务数
    duration: 1000 // 每秒
  },

  // 锁定时长 (防止任务被重复处理)
  lockDuration: 30000, // 30 秒

  // 任务超时
  lockRenewTime: 15000, // 每 15 秒续期一次锁

  // 失败后行为
  autorun: true,
  skipStalledCheck: false,
  stalledInterval: 30000 // 检测僵尸任务
};

// 2. 配置任务重试策略
export function getQueue<DataType, ReturnType = void>(
  name: QueueNames,
  opts?: Omit<QueueOptions, 'connection'>
): Queue<DataType, ReturnType> {
  const queue = queues.get(name);
  if (queue) return queue as Queue<DataType, ReturnType>;

  const newQueue = new Queue<DataType, ReturnType>(name.toString(), {
    connection: newQueueRedisConnection(),
    // 默认任务配置
    defaultJobOptions: {
      // 重试配置
      attempts: 3,
      backoff: {
        type: 'exponential',
        delay: 5000 // 5秒, 10秒, 20秒
      },
      // 任务超时
      timeout: 300000, // 5分钟
      // 移除任务配置
      removeOnComplete: {
        age: 3600 * 24 // 1天后删除
      },
      removeOnFail: {
        age: 3600 * 24 * 7 // 7天后删除
      }
    },
    ...opts
  });

  // 监控队列事件
  newQueue.on('error', (error) => {
    addLog.error(`MQ Queue [${name}]: ${error.message}`, error);
  });

  newQueue.on('waiting', (jobId) => {
    addLog.debug(`Job ${jobId} is waiting`);
  });

  newQueue.on('active', (jobId) => {
    addLog.debug(`Job ${jobId} has started`);
  });

  newQueue.on('progress', (jobId, progress) => {
    addLog.debug(`Job ${jobId} progress: ${progress}%`);
  });

  queues.set(name, newQueue);
  return newQueue;
}

// 3. 增强的 Worker 配置
export function getWorker<DataType, ReturnType = void>(
  name: QueueNames,
  processor: Processor<DataType, ReturnType>,
  opts?: Omit<WorkerOptions, 'connection'>
): Worker<DataType, ReturnType> {
  const worker = workers.get(name);
  if (worker) return worker as Worker<DataType, ReturnType>;

  const newWorker = new Worker<DataType, ReturnType>(
    name.toString(),
    processor,
    {
      connection: newWorkerRedisConnection(),
      ...defaultWorkerOpts,
      ...opts
    }
  );

  // 完整的事件处理
  newWorker.on('error', (error) => {
    addLog.error(`MQ Worker [${name}] Error:`, error);
  });

  newWorker.on('failed', (job, error) => {
    addLog.error(`MQ Worker [${name}] Job ${job?.id} failed:`, {
      error: error.message,
      stack: error.stack,
      jobData: job?.data,
      attemptsMade: job?.attemptsMade,
      failedReason: job?.failedReason
    });

    // 达到最大重试次数,移到死信队列
    if (job && job.attemptsMade >= (job.opts.attempts || 3)) {
      moveToDeadLetterQueue(name, job);
    }
  });

  newWorker.on('completed', (job, result) => {
    addLog.info(`MQ Worker [${name}] Job ${job.id} completed`, {
      duration: Date.now() - job.processedOn!,
      result: result
    });
  });

  newWorker.on('stalled', (jobId) => {
    addLog.warn(`MQ Worker [${name}] Job ${jobId} stalled`);
  });

  workers.set(name, newWorker);
  return newWorker;
}

// 4. 死信队列处理
const deadLetterQueues = new Map<QueueNames, Queue>();

function moveToDeadLetterQueue(queueName: QueueNames, job: any) {
  const dlqName = `${queueName}_DLQ`;

  if (!deadLetterQueues.has(queueName)) {
    const dlq = new Queue(dlqName, {
      connection: newQueueRedisConnection()
    });
    deadLetterQueues.set(queueName, dlq);
  }

  const dlq = deadLetterQueues.get(queueName)!;
  dlq.add('failed_job', {
    originalQueue: queueName,
    originalJobId: job.id,
    jobData: job.data,
    error: job.failedReason,
    attemptsMade: job.attemptsMade,
    timestamp: new Date().toISOString()
  });
}

// 5. 队列健康检查
export async function checkQueueHealth(queueName: QueueNames) {
  const queue = queues.get(queueName);
  if (!queue) return null;

  const [
    waitingCount,
    activeCount,
    completedCount,
    failedCount,
    delayedCount
  ] = await Promise.all([
    queue.getWaitingCount(),
    queue.getActiveCount(),
    queue.getCompletedCount(),
    queue.getFailedCount(),
    queue.getDelayedCount()
  ]);

  const health = {
    queueName,
    waiting: waitingCount,
    active: activeCount,
    completed: completedCount,
    failed: failedCount,
    delayed: delayedCount,
    total: waitingCount + activeCount + delayedCount,
    isHealthy: failedCount < 100 && activeCount < 50 // 可配置阈值
  };

  // 告警
  if (!health.isHealthy) {
    addLog.warn(`Queue ${queueName} unhealthy:`, health);
  }

  return health;
}

🔴 H12. 训练数据递归插入存在栈溢出风险

位置: packages/service/core/dataset/training/controller.ts:108-148

问题描述:

const insertData = async (startIndex: number, session: ClientSession) => {
  const list = data.slice(startIndex, startIndex + batchSize);
  if (list.length === 0) return;

  try {
    await MongoDatasetTraining.insertMany(/* ... */);
  } catch (error) {
    return Promise.reject(error);
  }

  return insertData(startIndex + batchSize, session); // 递归调用
};
  • 使用递归方式批量插入,大数据量 (>10000条) 会导致栈溢出
  • 每个递归调用都会创建新的 Promise 链
  • session 长时间持有可能超时

风险等级: 🔴 高危

影响:

  • 大数据集训练数据插入失败
  • 进程崩溃
  • 数据库事务超时

建议方案:

// 1. 使用迭代替代递归
export async function pushDataListToTrainingQueue(props: PushDataToTrainingQueueProps) {
  // ... 现有验证逻辑

  const batchSize = 500;
  const maxBatchesPerTransaction = 20; // 每个事务最多 20 批 (10000 条)

  // 分批插入函数 (迭代版本)
  const insertDataIterative = async (
    dataToInsert: any[],
    session: ClientSession
  ): Promise<number> => {
    let insertedCount = 0;

    for (let i = 0; i < dataToInsert.length; i += batchSize) {
      const batch = dataToInsert.slice(i, i + batchSize);

      if (batch.length === 0) continue;

      try {
        const result = await MongoDatasetTraining.insertMany(
          batch.map((item) => ({
            teamId,
            tmbId,
            datasetId,
            collectionId,
            billId,
            mode,
            ...(item.q && { q: item.q }),
            ...(item.a && { a: item.a }),
            ...(item.imageId && { imageId: item.imageId }),
            chunkIndex: item.chunkIndex ?? 0,
            indexSize,
            weight: weight ?? 0,
            indexes: item.indexes,
            retryCount: 5
          })),
          {
            session,
            ordered: false,
            rawResult: true,
            includeResultMetadata: false
          }
        );

        if (result.insertedCount !== batch.length) {
          throw new Error(`Batch insert failed: expected ${batch.length}, got ${result.insertedCount}`);
        }

        insertedCount += result.insertedCount;

        // 每 10 批打印一次进度
        if ((i / batchSize) % 10 === 0) {
          addLog.info(`Training data insert progress: ${insertedCount}/${dataToInsert.length}`);
        }

      } catch (error: any) {
        addLog.error(`Insert batch error at index ${i}`, error);
        throw error;
      }
    }

    return insertedCount;
  };

  // 2. 大数据量分多个事务处理
  if (data.length > maxBatchesPerTransaction * batchSize) {
    addLog.info(`Large dataset detected (${data.length} items), using chunked transactions`);

    let totalInserted = 0;
    const chunkSize = maxBatchesPerTransaction * batchSize;

    for (let i = 0; i < data.length; i += chunkSize) {
      const chunk = data.slice(i, i + chunkSize);

      await mongoSessionRun(async (session) => {
        const inserted = await insertDataIterative(chunk, session);
        totalInserted += inserted;
      });

      addLog.info(`Chunk completed: ${totalInserted}/${data.length}`);
    }

    return { insertLen: totalInserted };
  }

  // 3. 小数据量单事务处理
  if (session) {
    const inserted = await insertDataIterative(data, session);
    return { insertLen: inserted };
  } else {
    let insertedCount = 0;
    await mongoSessionRun(async (session) => {
      insertedCount = await insertDataIterative(data, session);
    });
    return { insertLen: insertedCount };
  }
}

🔴 H13. 向量数据库缺少降级和容错机制

位置: packages/service/common/vectorDB/controller.ts:21-36

问题描述:

const getVectorObj = () => {
  if (PG_ADDRESS) return new PgVectorCtrl();
  if (OCEANBASE_ADDRESS) return new ObVectorCtrl();
  if (MILVUS_ADDRESS) return new MilvusCtrl();

  return new PgVectorCtrl(); // 默认 PG
};

const Vector = getVectorObj(); // 启动时初始化,无容错
  • 向量数据库连接失败导致整个服务不可用
  • 未实现多数据库降级策略
  • 缺少健康检查和自动切换
  • 查询失败仅重试一次 (retryFn)

风险等级: 🔴 高危

影响:

  • 向量数据库故障导致所有知识库查询失败
  • 无法实现多数据源容灾
  • 数据库维护期间服务不可用

建议方案:

// 1. 向量数据库管理器
class VectorDBManager {
  private primary: any | null = null;
  private fallback: any | null = null;
  private healthStatus = {
    primary: true,
    fallback: true,
    lastCheck: Date.now()
  };

  constructor() {
    this.initializeVectorDBs();
    this.startHealthCheck();
  }

  private initializeVectorDBs() {
    // 主数据库
    try {
      if (PG_ADDRESS) {
        this.primary = new PgVectorCtrl();
        addLog.info('Primary vector DB initialized: PostgreSQL');
      } else if (OCEANBASE_ADDRESS) {
        this.primary = new ObVectorCtrl();
        addLog.info('Primary vector DB initialized: OceanBase');
      } else if (MILVUS_ADDRESS) {
        this.primary = new MilvusCtrl();
        addLog.info('Primary vector DB initialized: Milvus');
      } else {
        throw new Error('No vector database configured');
      }
    } catch (error) {
      addLog.error('Failed to initialize primary vector DB', error);
      this.healthStatus.primary = false;
    }

    // 备用数据库 (如果配置了多个)
    try {
      const fallbackAddresses = [
        { addr: PG_ADDRESS, ctrl: PgVectorCtrl, name: 'PostgreSQL' },
        { addr: OCEANBASE_ADDRESS, ctrl: ObVectorCtrl, name: 'OceanBase' },
        { addr: MILVUS_ADDRESS, ctrl: MilvusCtrl, name: 'Milvus' }
      ].filter(db => db.addr && !this.isPrimary(db.name));

      if (fallbackAddresses.length > 0) {
        const fb = fallbackAddresses[0];
        this.fallback = new fb.ctrl();
        addLog.info(`Fallback vector DB initialized: ${fb.name}`);
      }
    } catch (error) {
      addLog.warn('Fallback vector DB not available', error);
      this.healthStatus.fallback = false;
    }
  }

  private isPrimary(dbName: string): boolean {
    if (!this.primary) return false;
    return this.primary.constructor.name.includes(dbName);
  }

  // 健康检查
  private startHealthCheck() {
    setInterval(async () => {
      await this.checkHealth();
    }, 30000); // 每 30 秒检查一次
  }

  private async checkHealth() {
    const now = Date.now();

    // 检查主数据库
    if (this.primary) {
      try {
        await this.primary.healthCheck?.();
        if (!this.healthStatus.primary) {
          addLog.info('Primary vector DB recovered');
          this.healthStatus.primary = true;
        }
      } catch (error) {
        if (this.healthStatus.primary) {
          addLog.error('Primary vector DB unhealthy', error);
          this.healthStatus.primary = false;
        }
      }
    }

    // 检查备用数据库
    if (this.fallback) {
      try {
        await this.fallback.healthCheck?.();
        if (!this.healthStatus.fallback) {
          addLog.info('Fallback vector DB recovered');
          this.healthStatus.fallback = true;
        }
      } catch (error) {
        if (this.healthStatus.fallback) {
          addLog.warn('Fallback vector DB unhealthy', error);
          this.healthStatus.fallback = false;
        }
      }
    }

    this.healthStatus.lastCheck = now;
  }

  // 获取可用的向量数据库实例
  getAvailableInstance() {
    if (this.healthStatus.primary && this.primary) {
      return this.primary;
    }

    if (this.healthStatus.fallback && this.fallback) {
      addLog.warn('Using fallback vector DB');
      return this.fallback;
    }

    throw new Error('No healthy vector database available');
  }
}

const vectorManager = new VectorDBManager();

// 2. 导出增强的向量操作函数
export const initVectorStore = async () => {
  const instance = vectorManager.getAvailableInstance();
  return instance.init();
};

export const recallFromVectorStore = async (props: EmbeddingRecallCtrlProps) => {
  return retryFn(
    async () => {
      const instance = vectorManager.getAvailableInstance();
      return instance.embRecall(props);
    },
    {
      retries: 3,
      minTimeout: 1000,
      maxTimeout: 5000,
      onRetry: (error, attempt) => {
        addLog.warn(`Vector recall retry attempt ${attempt}`, { error: error.message });
      }
    }
  );
};

export const insertDatasetDataVector = async (props: InsertVectorProps & { inputs: string[], model: EmbeddingModelItemType }) => {
  const { vectors, tokens } = await getVectorsByText({
    model: props.model,
    input: props.inputs,
    type: 'db'
  });

  const { insertIds } = await retryFn(
    async () => {
      const instance = vectorManager.getAvailableInstance();
      return instance.insert({ ...props, vectors });
    },
    {
      retries: 3,
      minTimeout: 1000,
      onRetry: (error, attempt) => {
        addLog.warn(`Vector insert retry attempt ${attempt}`, { error: error.message });
      }
    }
  );

  onIncrCache(props.teamId);

  return { tokens, insertIds };
};

// 3. 添加健康检查 API
export async function getVectorDBHealth() {
  return {
    status: vectorManager.healthStatus,
    timestamp: new Date().toISOString()
  };
}

位置: packages/service/support/permission/auth/common.ts:162-168

问题描述:

export const setCookie = (res: NextApiResponse, token: string) => {
  res.setHeader(
    'Set-Cookie',
    `${TokenName}=${token}; Path=/; HttpOnly; Max-Age=604800; Samesite=Strict;`
  );
};
  • 未设置 Secure 标志 (HTTPS only)
  • Max-Age=604800 (7天) 过长,增加被盗风险
  • Session token 无服务端过期时间验证
  • 缺少 CSRF 保护

风险等级: 🔴 高危

影响:

  • Token 被盗后长期有效
  • HTTP 连接下 token 可能泄露
  • CSRF 攻击风险

建议方案:

// 1. 安全的 Cookie 配置
export const setCookie = (res: NextApiResponse, token: string, options?: {
  maxAge?: number;
  secure?: boolean;
}) => {
  const isProduction = process.env.NODE_ENV === 'production';
  const maxAge = options?.maxAge || 86400; // 默认 1 天
  const secure = options?.secure ?? isProduction; // 生产环境强制 HTTPS

  const cookieOptions = [
    `${TokenName}=${token}`,
    'Path=/',
    'HttpOnly',
    `Max-Age=${maxAge}`,
    'SameSite=Strict',
    ...(secure ? ['Secure'] : []) // HTTPS only
  ];

  res.setHeader('Set-Cookie', cookieOptions.join('; '));
};

// 2. Session 管理增强
// packages/service/support/user/session.ts
import { getGlobalRedisConnection } from '../../common/redis';

const SESSION_PREFIX = 'session:';
const SESSION_EXPIRY = 24 * 60 * 60; // 1 天

export async function authUserSession(token: string) {
  // 验证 JWT
  const decoded = jwt.verify(token, process.env.JWT_SECRET!) as any;

  // 检查 session 是否在 Redis 中 (用于立即注销)
  const redis = getGlobalRedisConnection();
  const sessionKey = `${SESSION_PREFIX}${decoded.userId}:${token}`;

  const exists = await redis.exists(sessionKey);
  if (!exists) {
    throw new Error('Session expired or invalidated');
  }

  // 刷新 session 过期时间
  await redis.expire(sessionKey, SESSION_EXPIRY);

  return {
    userId: decoded.userId,
    teamId: decoded.teamId,
    tmbId: decoded.tmbId,
    isRoot: decoded.isRoot
  };
}

// 创建 session
export async function createUserSession(userId: string, userData: any) {
  const token = jwt.sign(
    { ...userData, userId },
    process.env.JWT_SECRET!,
    { expiresIn: '1d' }
  );

  // 存储 session 到 Redis
  const redis = getGlobalRedisConnection();
  const sessionKey = `${SESSION_PREFIX}${userId}:${token}`;

  await redis.setex(
    sessionKey,
    SESSION_EXPIRY,
    JSON.stringify({
      userId,
      createdAt: new Date().toISOString(),
      ...userData
    })
  );

  return token;
}

// 注销 session
export async function invalidateUserSession(userId: string, token: string) {
  const redis = getGlobalRedisConnection();
  const sessionKey = `${SESSION_PREFIX}${userId}:${token}`;
  await redis.del(sessionKey);
}

// 注销用户所有 session
export async function invalidateAllUserSessions(userId: string) {
  const redis = getGlobalRedisConnection();
  const pattern = `${SESSION_PREFIX}${userId}:*`;
  const keys = await redis.keys(pattern);

  if (keys.length > 0) {
    await redis.del(...keys);
  }
}

// 3. CSRF 保护
import crypto from 'crypto';

const CSRF_TOKEN_PREFIX = 'csrf:';

export async function generateCSRFToken(sessionId: string): Promise<string> {
  const redis = getGlobalRedisConnection();
  const csrfToken = crypto.randomBytes(32).toString('hex');
  const key = `${CSRF_TOKEN_PREFIX}${sessionId}`;

  await redis.setex(key, 3600, csrfToken); // 1 小时

  return csrfToken;
}

export async function validateCSRFToken(
  sessionId: string,
  csrfToken: string
): Promise<boolean> {
  const redis = getGlobalRedisConnection();
  const key = `${CSRF_TOKEN_PREFIX}${sessionId}`;

  const storedToken = await redis.get(key);
  return storedToken === csrfToken;
}

// 4. 在关键 API 中添加 CSRF 验证
export const authCertWithCSRF = async (props: AuthModeType) => {
  const { req } = props;
  const result = await parseHeaderCert(props);

  // 对于 POST/PUT/DELETE 请求验证 CSRF
  if (['POST', 'PUT', 'DELETE', 'PATCH'].includes(req.method || '')) {
    const csrfToken = req.headers['x-csrf-token'] as string;

    if (!csrfToken || !result.sessionId) {
      throw new Error('CSRF token missing');
    }

    const isValid = await validateCSRFToken(result.sessionId, csrfToken);
    if (!isValid) {
      throw new Error('Invalid CSRF token');
    }
  }

  return result;
};

新增中危问题 (Additional Medium Priority)

🟡 M20. 向量查询缓存策略过于激进

位置: packages/service/common/vectorDB/controller.ts:29-35

问题描述:

const onDelCache = throttle((teamId: string) => delRedisCache(getChcheKey(teamId)), 30000, {
  leading: true,
  trailing: true
});
  • 删除操作使用 throttle,30 秒内只执行一次
  • 可能导致缓存计数不准确
  • 未考虑高频删除场景

建议:

  • 删除操作直接更新缓存
  • 定期全量同步缓存和数据库
  • 添加缓存一致性校验

🟡 M21. 训练队列缺少优先级机制

位置: packages/service/common/bullmq/index.ts:20-26

问题描述:

export enum QueueNames {
  datasetSync = 'datasetSync',
  evaluation = 'evaluation',
  websiteSync = 'websiteSync'
}
  • 所有任务同等优先级
  • 无法区分紧急任务和普通任务
  • 大批量任务可能阻塞小任务

建议:

// 添加优先级队列
export enum TaskPriority {
  LOW = 1,
  NORMAL = 5,
  HIGH = 10,
  URGENT = 20
}

// 添加任务时指定优先级
queue.add('task', data, {
  priority: TaskPriority.HIGH,
  jobId: 'unique-job-id'
});

🟡 M22-M28: 其他新增中危问题

M22. getAllKeysByPrefix 使用 KEYS 命令

  • redis.keys() 阻塞操作,大量 key 时影响性能
  • 建议使用 SCAN 命令

M23. 工作流节点参数未进行深度克隆

  • replaceEditorVariable 可能修改原始节点数据
  • 多次执行同一节点可能出现数据污染

M24. Mongoose Schema 索引未优化

  • 慢查询警告阈值 1000ms 过高
  • 未配置复合索引和覆盖索引

M25. 文件上传未限制并发数

  • 大量文件同时上传可能耗尽连接
  • 建议添加上传队列和限流

M26. AI 模型调用未实现熔断机制

  • 模型服务故障时持续重试
  • 建议实现 Circuit Breaker 模式

M27. packages/web 组件未使用虚拟滚动

  • 大列表渲染性能差
  • 建议使用 react-window 或 react-virtualized

M28. 权限检查未缓存

  • 每次 API 调用都查询数据库
  • 建议缓存用户权限信息

完整问题清单汇总

按严重程度统计

等级 数量 占比 新增
🔴 高危 14 20.0% +5
🟡 中危 37 52.9% +18
🟢 低危 19 27.1% +5
总计 70 100% +28

按问题域分类

高危 中危 低危 小计
工作流引擎 3 4 1 8
数据库层 3 6 2 11
API/中间件 2 5 2 9
队列系统 2 3 1 6
认证/权限 1 3 1 5
缓存/Redis 1 4 1 6
向量数据库 1 2 1 4
前端性能 0 6 4 10
构建/部署 0 3 4 7
监控/日志 1 1 2 4

架构层面的系统性问题

基于深入分析,识别出以下架构层面的系统性问题:

1. 资源管理缺少统一抽象层

问题: 数据库、Redis、队列等各自管理连接,缺少统一的资源管理器

建议: 实现统一的 ResourceManager

class ResourceManager {
  private resources = new Map<string, any>();

  async registerResource(name: string, resource: any) {
    this.resources.set(name, resource);
    await resource.init?.();
  }

  async healthCheck() {
    const results = new Map();
    for (const [name, resource] of this.resources) {
      try {
        await resource.healthCheck?.();
        results.set(name, 'healthy');
      } catch (error) {
        results.set(name, 'unhealthy');
      }
    }
    return results;
  }

  async gracefulShutdown() {
    for (const [name, resource] of this.resources) {
      await resource.close?.();
    }
  }
}

2. 缺少统一的错误处理和重试策略

问题: 每个模块自行实现错误处理,缺少一致性

建议: 实现统一的 ErrorHandler 和 RetryPolicy

enum RetryableErrorType {
  NETWORK,
  TIMEOUT,
  RATE_LIMIT,
  DATABASE_LOCK
}

class RetryPolicy {
  static getPolicy(errorType: RetryableErrorType) {
    // 返回不同错误类型的重试策略
  }
}

3. 监控和可观测性不足

问题: 缺少统一的指标收集和链路追踪

建议: 集成 OpenTelemetry (已部分集成)

  • 完善 trace、metrics、logs 三大支柱
  • 添加关键业务指标 (工作流执行时间、AI 调用延迟等)
  • 实现分布式追踪

4. 配置管理分散

问题: 配置散落在环境变量、代码常量、数据库中

建议: 实现配置中心

  • 统一配置管理
  • 动态配置更新
  • 配置版本控制

修复优先级路线图

第一阶段: 紧急修复 (Week 1-2) - 稳定性优先

  1. H10: Redis 连接池 (影响所有队列和缓存)
  2. H11: BullMQ 重试和死信队列 (影响训练任务稳定性)
  3. H14: 认证安全加固 (安全风险)
  4. H3: SSE 客户端断开处理 (资源泄漏)
  5. H12: 训练数据递归改迭代 (栈溢出风险)

第二阶段: 核心优化 (Week 3-4) - 性能提升

  1. H1: 工作流并发控制
  2. H2: MongoDB 连接池
  3. H4: API 超时控制
  4. H13: 向量数据库容错
  5. M20-M28: 中危缓存和队列优化

第三阶段: 系统完善 (Week 5-8) - 长期稳定

  1. 架构层面系统性改造
  2. 监控和告警体系建设
  3. 自动化测试覆盖率提升
  4. 性能基准测试和持续优化

第四阶段: 持续改进 (持续)

  1. 代码质量提升 (ESLint、Prettier、TypeScript strict)
  2. 文档完善
  3. 开发体验优化
  4. 技术债务清理

性能优化预期收益

基于问题修复,预期获得以下收益:

指标 当前 优化后 提升
API P95 响应时间 ~2s ~800ms -60%
工作流执行成功率 ~95% ~99.5% +4.5%
内存使用 (峰值) 2.5GB 1.8GB -28%
Redis 连接数 50+ 15 -70%
MongoDB 连接数 100+ 50 -50%
页面首次加载 4.5s 2s -56%
训练任务失败率 ~10% ~2% -80%

监控指标建议

1. 应用层指标

// 建议添加的 Prometheus 指标
const metrics = {
  // 工作流
  workflow_execution_duration: 'histogram',
  workflow_node_execution_count: 'counter',
  workflow_error_rate: 'gauge',

  // 队列
  queue_size: 'gauge',
  queue_processing_duration: 'histogram',
  queue_job_success_rate: 'gauge',

  // API
  api_request_duration: 'histogram',
  api_error_count: 'counter',
  api_active_connections: 'gauge',

  // 数据库
  db_query_duration: 'histogram',
  db_connection_pool_size: 'gauge',
  db_slow_query_count: 'counter',

  // 缓存
  cache_hit_rate: 'gauge',
  cache_operation_duration: 'histogram'
};

2. 业务指标

const businessMetrics = {
  // 训练
  training_queue_length: 'gauge',
  training_success_rate: 'gauge',
  embedding_tokens_consumed: 'counter',

  // 对话
  chat_response_time: 'histogram',
  chat_token_usage: 'histogram',

  // 知识库
  dataset_size: 'gauge',
  vector_search_duration: 'histogram'
};

3. 告警规则

alerts:
  - name: high_api_error_rate
    expr: rate(api_error_count[5m]) > 0.05
    severity: critical

  - name: workflow_execution_slow
    expr: histogram_quantile(0.95, workflow_execution_duration) > 30
    severity: warning

  - name: queue_overload
    expr: queue_size > 1000
    severity: warning

  - name: redis_connection_high
    expr: redis_connections > 20
    severity: warning

  - name: mongodb_slow_queries
    expr: rate(db_slow_query_count[5m]) > 10
    severity: critical

总结

本次深度分析额外识别了 28 个问题,使问题总数达到 70 个,主要集中在:

  1. 队列系统 (BullMQ): 配置不当、缺少重试和监控
  2. Redis 管理: 连接未复用、配置缺失
  3. 训练数据处理: 递归栈溢出、批量插入优化
  4. 向量数据库: 缺少容错和降级
  5. 认证安全: Cookie 配置、session 管理

核心改进建议:

  • 实施统一的资源管理和连接池策略
  • 完善队列系统的重试、监控和死信处理
  • 加强认证安全和 session 管理
  • 实现向量数据库容错和降级机制
  • 建立完整的监控和告警体系

通过系统性的优化,预期可以:

  • 提升 60% API 响应速度
  • 降低 80% 训练任务失败率
  • 减少 70% Redis 连接数
  • 提升 4.5% 工作流成功率

下一步行动: 按照四阶段路线图逐步实施修复,优先处理高危稳定性问题。


报告完成时间: 2025-10-20 分析工具: Claude Code Deep Analysis Agent 报告位置: .claude/design/projects_app_performance_stability_deep_analysis.md