15308555518/ FastGPT

Fork 0

mirror of https://github.com/labring/FastGPT.git synced 2025-12-25 20:02:47 +00:00

Archer 44e9299d5e

Document deploy / sync-images (push) Waiting to run

Details

Document deploy / generate-timestamp (push) Blocked by required conditions

Details

Document deploy / build-images (map[domain:https://fastgpt.cn suffix:cn]) (push) Blocked by required conditions

Details

Document deploy / build-images (map[domain:https://fastgpt.io suffix:io]) (push) Blocked by required conditions

Details

Document deploy / update-images (map[deployment:fastgpt-docs domain:https://fastgpt.cn kube_config:KUBE_CONFIG_CN suffix:cn]) (push) Blocked by required conditions

Details

Document deploy / update-images (map[deployment:fastgpt-docs domain:https://fastgpt.io kube_config:KUBE_CONFIG_IO suffix:io]) (push) Blocked by required conditions

Details

Build FastGPT images in Personal warehouse / get-vars (push) Waiting to run

Details

Build FastGPT images in Personal warehouse / build-fastgpt-images (map[arch:amd64 runs-on:ubuntu-24.04]) (push) Blocked by required conditions

Details

Build FastGPT images in Personal warehouse / build-fastgpt-images (map[arch:arm64 runs-on:ubuntu-24.04-arm]) (push) Blocked by required conditions

Details

Build FastGPT images in Personal warehouse / release-fastgpt-images (push) Blocked by required conditions

Details

V4.13.2 features (#5792 )

* add manual create http toolset (#5743)

* add manual create http toolset

* optimize code

* optimize

* fix

* fix

* rename filename

* feat: integrate ts-rest (#5741)

* feat: integrate ts-rest

* chore: classify core contract and pro contract

* chore: update lockfile

* chore: tweak dir structure

* chore: tweak dir structure

* update tsrest code (#5755)

* doc

* update tsrest code

* fix http toolset (#5753)

* fix http toolset

* fix

* perf: http toolset

* fix: toolresponse result (#5760)

* doc

* fix: toolresponse result

* fix: mongo watch

* remove log

* feat: integrated to minio (#5748)

* feat: migrate to minio

* feat: migrate apps' and dataset's avatar to minio

* feat: migrate more avatars to minio

* fix: lock file

* feat: migrate copyright settings' logo to minio

* feat: integrate minio

* chore: improve code

* chore: rename variables

* refactor: s3 class

* fix: s3 and mongo operations

* chore: add session for avatar source

* fix: init s3 buckets

* fix: bugbot issues

* expired time code

* perf: avatar code

* union type

* export favouriteContract

* empty bucket check

---------

Co-authored-by: archer <545436317@qq.com>

* refactor: zod schema to generate OpenAPI instead (#5771)

* doc

* fix: text split code (#5773)

* fix: toolresponse result

* remove log

* stream remove

* fix: text split code

* fix: workflow (#5779)

* fix: toolresponse result

* remove log

* fix: value check

* fix: workflow

* openapi doc

* perf: bucket delete cron

* doc

* feat: apikey health

* feat: export variables

* api code move

* perf: workflow performance (#5783)

* perf: reactflow context

* perf: workflow context split

* perf: nodeList computed map

* perf: nodes dependen

* perf: workflow performance

* workflow performance

* removel og

* lock

* version

* loop drag

* reactflow size

* reactflow size

* fix: s3init (#5784)

* doc

* fix: s3init

* perf: dynamic import

* remove moongose dep

* worker build

* worker code

* perf: worker build

* fix: error throw

* doc

* doc

* fix: build

* fix: dockerfile

* nextjs config

* fix: worker

* fix: build (#5791)

* fix: build

* vector cache code

* fix: app info modal avatar upload method replace (#5787)

* fix: app info modal avatar upload method replace

* chore: replace all useSelectFile with useUploadAvatar

* remove invalid code

* add size

* Update projects/app/src/pageComponents/app/detail/WorkflowComponents/Flow/nodes/render/RenderInput/templates/CommonInputForm.tsx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update projects/app/src/pageComponents/app/detail/WorkflowComponents/context/workflowInitContext.tsx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: heheer <heheer@sealos.io>
Co-authored-by: 伍闲犬 <whoeverimf5@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2025-10-20 19:08:21 +08:00

33 KiB

Raw Blame History

FastGPT 性能与稳定性深度分析报告 (扩展版)

生成时间: 2025-10-20 分析范围: 全项目 (projects/app + packages/service + packages/web + packages/global) 技术栈: Next.js 14.2.32 + TypeScript + MongoDB + PostgreSQL + Redis + BullMQ

执行摘要

本报告在初版基础上,深入分析了 packages 目录的核心业务逻辑,包括工作流引擎、AI 调用、数据集训练、权限系统等。识别了额外的 28 个严重性能和稳定性问题,使问题总数达到 70 个。

新增关键发现:

Redis 连接管理严重缺陷: 多个 Redis 客户端实例未复用,缺少连接池
BullMQ 队列配置不当: 缺少重试策略、死信队列和监控
训练数据批量插入存在递归栈溢出风险: 大数据量场景下可能崩溃
向量数据库缺少容错和降级机制: 单点故障风险高
认证系统存在安全漏洞: Cookie 配置不当,session 无过期时间

新增高危问题 (Additional High Priority)

🔴 H10. Redis 连接未复用导致连接数耗尽

位置: packages/service/common/redis/index.ts:6-28

问题描述:

export const newQueueRedisConnection = () => {
  const redis = new Redis(REDIS_URL);
  // 每次调用创建新连接,未复用
  return redis;
};

export const newWorkerRedisConnection = () => {
  const redis = new Redis(REDIS_URL, {
    maxRetriesPerRequest: null
  });
  return redis;
};

每个 Queue 和 Worker 创建独立 Redis 连接
未配置连接池参数 (maxRetriesPerRequest: null 会导致无限重试)
三种不同的 Redis 客户端 (Queue/Worker/Global) 未统一管理
未配置 Redis 连接超时和健康检查

风险等级: 🔴 高危

影响:

高并发场景下 Redis 连接数快速增长
连接耗尽导致所有依赖 Redis 的功能失效 (队列、缓存、锁)
无限重试导致资源浪费

建议方案:

// 1. 统一 Redis 连接配置
const REDIS_CONFIG = {
  url: process.env.REDIS_URL || 'redis://localhost:6379',
  // 连接池配置
  maxRetriesPerRequest: 3,
  retryStrategy: (times: number) => {
    if (times > 3) return null;
    return Math.min(times * 50, 2000);
  },
  // 连接超时
  connectTimeout: 10000,
  // Keep-alive
  keepAlive: 30000,
  // 重连配置
  enableReadyCheck: true,
  enableOfflineQueue: true,
  // 连接名称标识
  connectionName: 'fastgpt',
  // 健康检查
  lazyConnect: false,
  // 事件处理
  retryDelayOnFailover: 100,
  retryDelayOnClusterDown: 300
};

// 2. 创建连接池管理器
class RedisConnectionPool {
  private static queueConnections: Redis[] = [];
  private static workerConnections: Redis[] = [];
  private static globalConnection: Redis | null = null;

  private static readonly POOL_SIZE = 10;

  static getQueueConnection(): Redis {
    if (this.queueConnections.length < this.POOL_SIZE) {
      const redis = new Redis({
        ...REDIS_CONFIG,
        connectionName: `${REDIS_CONFIG.connectionName}_queue_${this.queueConnections.length}`
      });

      redis.on('error', (err) => {
        addLog.error('Redis Queue Connection Error', err);
      });

      redis.on('close', () => {
        // 从池中移除
        const index = this.queueConnections.indexOf(redis);
        if (index > -1) {
          this.queueConnections.splice(index, 1);
        }
      });

      this.queueConnections.push(redis);
      return redis;
    }

    // 轮询选择已有连接
    return this.queueConnections[
      Math.floor(Math.random() * this.queueConnections.length)
    ];
  }

  static getWorkerConnection(): Redis {
    if (this.workerConnections.length < this.POOL_SIZE) {
      const redis = new Redis({
        ...REDIS_CONFIG,
        maxRetriesPerRequest: null, // Worker 需要此配置
        connectionName: `${REDIS_CONFIG.connectionName}_worker_${this.workerConnections.length}`
      });

      redis.on('error', (err) => {
        addLog.error('Redis Worker Connection Error', err);
      });

      this.workerConnections.push(redis);
      return redis;
    }

    return this.workerConnections[
      Math.floor(Math.random() * this.workerConnections.length)
    ];
  }

  static getGlobalConnection(): Redis {
    if (!this.globalConnection) {
      this.globalConnection = new Redis({
        ...REDIS_CONFIG,
        keyPrefix: FASTGPT_REDIS_PREFIX,
        connectionName: `${REDIS_CONFIG.connectionName}_global`
      });

      this.globalConnection.on('error', (err) => {
        addLog.error('Redis Global Connection Error', err);
      });
    }
    return this.globalConnection;
  }

  static async closeAll() {
    await Promise.all([
      ...this.queueConnections.map(r => r.quit()),
      ...this.workerConnections.map(r => r.quit()),
      this.globalConnection?.quit()
    ]);
  }
}

// 3. 导出优化后的函数
export const newQueueRedisConnection = () =>
  RedisConnectionPool.getQueueConnection();

export const newWorkerRedisConnection = () =>
  RedisConnectionPool.getWorkerConnection();

export const getGlobalRedisConnection = () =>
  RedisConnectionPool.getGlobalConnection();

// 4. 进程退出时清理
process.on('SIGTERM', async () => {
  await RedisConnectionPool.closeAll();
  process.exit(0);
});

🔴 H11. BullMQ 队列缺少重试策略和死信队列

位置: packages/service/common/bullmq/index.ts:12-19

问题描述:

const defaultWorkerOpts: Omit<ConnectionOptions, 'connection'> = {
  removeOnComplete: {
    count: 0 // 立即删除成功任务
  },
  removeOnFail: {
    count: 0 // 立即删除失败任务
  }
};

失败任务立即删除,无法追踪和调试
未配置重试策略 (attempts, backoff)
缺少死信队列处理彻底失败的任务
队列监控和告警缺失

风险等级: 🔴 高危

影响:

训练任务失败无法追踪原因
临时性错误 (网络抖动) 导致任务永久失败
无法分析队列性能瓶颈
数据一致性风险

建议方案:

// 1. 完善的 Worker 配置
const defaultWorkerOpts: Omit<WorkerOptions, 'connection'> = {
  // 保留任务用于调试和监控
  removeOnComplete: {
    age: 7 * 24 * 3600, // 保留 7 天
    count: 1000         // 最多保留 1000 个
  },
  removeOnFail: {
    age: 30 * 24 * 3600, // 保留 30 天
    count: 5000          // 最多保留 5000 个
  },

  // 并发控制
  concurrency: 5,

  // 限流配置
  limiter: {
    max: 100,      // 最大任务数
    duration: 1000 // 每秒
  },

  // 锁定时长 (防止任务被重复处理)
  lockDuration: 30000, // 30 秒

  // 任务超时
  lockRenewTime: 15000, // 每 15 秒续期一次锁

  // 失败后行为
  autorun: true,
  skipStalledCheck: false,
  stalledInterval: 30000 // 检测僵尸任务
};

// 2. 配置任务重试策略
export function getQueue<DataType, ReturnType = void>(
  name: QueueNames,
  opts?: Omit<QueueOptions, 'connection'>
): Queue<DataType, ReturnType> {
  const queue = queues.get(name);
  if (queue) return queue as Queue<DataType, ReturnType>;

  const newQueue = new Queue<DataType, ReturnType>(name.toString(), {
    connection: newQueueRedisConnection(),
    // 默认任务配置
    defaultJobOptions: {
      // 重试配置
      attempts: 3,
      backoff: {
        type: 'exponential',
        delay: 5000 // 5秒, 10秒, 20秒
      },
      // 任务超时
      timeout: 300000, // 5分钟
      // 移除任务配置
      removeOnComplete: {
        age: 3600 * 24 // 1天后删除
      },
      removeOnFail: {
        age: 3600 * 24 * 7 // 7天后删除
      }
    },
    ...opts
  });

  // 监控队列事件
  newQueue.on('error', (error) => {
    addLog.error(`MQ Queue [${name}]: ${error.message}`, error);
  });

  newQueue.on('waiting', (jobId) => {
    addLog.debug(`Job ${jobId} is waiting`);
  });

  newQueue.on('active', (jobId) => {
    addLog.debug(`Job ${jobId} has started`);
  });

  newQueue.on('progress', (jobId, progress) => {
    addLog.debug(`Job ${jobId} progress: ${progress}%`);
  });

  queues.set(name, newQueue);
  return newQueue;
}

// 3. 增强的 Worker 配置
export function getWorker<DataType, ReturnType = void>(
  name: QueueNames,
  processor: Processor<DataType, ReturnType>,
  opts?: Omit<WorkerOptions, 'connection'>
): Worker<DataType, ReturnType> {
  const worker = workers.get(name);
  if (worker) return worker as Worker<DataType, ReturnType>;

  const newWorker = new Worker<DataType, ReturnType>(
    name.toString(),
    processor,
    {
      connection: newWorkerRedisConnection(),
      ...defaultWorkerOpts,
      ...opts
    }
  );

  // 完整的事件处理
  newWorker.on('error', (error) => {
    addLog.error(`MQ Worker [${name}] Error:`, error);
  });

  newWorker.on('failed', (job, error) => {
    addLog.error(`MQ Worker [${name}] Job ${job?.id} failed:`, {
      error: error.message,
      stack: error.stack,
      jobData: job?.data,
      attemptsMade: job?.attemptsMade,
      failedReason: job?.failedReason
    });

    // 达到最大重试次数,移到死信队列
    if (job && job.attemptsMade >= (job.opts.attempts || 3)) {
      moveToDeadLetterQueue(name, job);
    }
  });

  newWorker.on('completed', (job, result) => {
    addLog.info(`MQ Worker [${name}] Job ${job.id} completed`, {
      duration: Date.now() - job.processedOn!,
      result: result
    });
  });

  newWorker.on('stalled', (jobId) => {
    addLog.warn(`MQ Worker [${name}] Job ${jobId} stalled`);
  });

  workers.set(name, newWorker);
  return newWorker;
}

// 4. 死信队列处理
const deadLetterQueues = new Map<QueueNames, Queue>();

function moveToDeadLetterQueue(queueName: QueueNames, job: any) {
  const dlqName = `${queueName}_DLQ`;

  if (!deadLetterQueues.has(queueName)) {
    const dlq = new Queue(dlqName, {
      connection: newQueueRedisConnection()
    });
    deadLetterQueues.set(queueName, dlq);
  }

  const dlq = deadLetterQueues.get(queueName)!;
  dlq.add('failed_job', {
    originalQueue: queueName,
    originalJobId: job.id,
    jobData: job.data,
    error: job.failedReason,
    attemptsMade: job.attemptsMade,
    timestamp: new Date().toISOString()
  });
}

// 5. 队列健康检查
export async function checkQueueHealth(queueName: QueueNames) {
  const queue = queues.get(queueName);
  if (!queue) return null;

  const [
    waitingCount,
    activeCount,
    completedCount,
    failedCount,
    delayedCount
  ] = await Promise.all([
    queue.getWaitingCount(),
    queue.getActiveCount(),
    queue.getCompletedCount(),
    queue.getFailedCount(),
    queue.getDelayedCount()
  ]);

  const health = {
    queueName,
    waiting: waitingCount,
    active: activeCount,
    completed: completedCount,
    failed: failedCount,
    delayed: delayedCount,
    total: waitingCount + activeCount + delayedCount,
    isHealthy: failedCount < 100 && activeCount < 50 // 可配置阈值
  };

  // 告警
  if (!health.isHealthy) {
    addLog.warn(`Queue ${queueName} unhealthy:`, health);
  }

  return health;
}

🔴 H12. 训练数据递归插入存在栈溢出风险

位置: packages/service/core/dataset/training/controller.ts:108-148

问题描述:

const insertData = async (startIndex: number, session: ClientSession) => {
  const list = data.slice(startIndex, startIndex + batchSize);
  if (list.length === 0) return;

  try {
    await MongoDatasetTraining.insertMany(/* ... */);
  } catch (error) {
    return Promise.reject(error);
  }

  return insertData(startIndex + batchSize, session); // 递归调用
};

使用递归方式批量插入,大数据量 (>10000条) 会导致栈溢出
每个递归调用都会创建新的 Promise 链
session 长时间持有可能超时

风险等级: 🔴 高危

影响:

大数据集训练数据插入失败
进程崩溃
数据库事务超时

建议方案:

// 1. 使用迭代替代递归
export async function pushDataListToTrainingQueue(props: PushDataToTrainingQueueProps) {
  // ... 现有验证逻辑

  const batchSize = 500;
  const maxBatchesPerTransaction = 20; // 每个事务最多 20 批 (10000 条)

  // 分批插入函数 (迭代版本)
  const insertDataIterative = async (
    dataToInsert: any[],
    session: ClientSession
  ): Promise<number> => {
    let insertedCount = 0;

    for (let i = 0; i < dataToInsert.length; i += batchSize) {
      const batch = dataToInsert.slice(i, i + batchSize);

      if (batch.length === 0) continue;

      try {
        const result = await MongoDatasetTraining.insertMany(
          batch.map((item) => ({
            teamId,
            tmbId,
            datasetId,
            collectionId,
            billId,
            mode,
            ...(item.q && { q: item.q }),
            ...(item.a && { a: item.a }),
            ...(item.imageId && { imageId: item.imageId }),
            chunkIndex: item.chunkIndex ?? 0,
            indexSize,
            weight: weight ?? 0,
            indexes: item.indexes,
            retryCount: 5
          })),
          {
            session,
            ordered: false,
            rawResult: true,
            includeResultMetadata: false
          }
        );

        if (result.insertedCount !== batch.length) {
          throw new Error(`Batch insert failed: expected ${batch.length}, got ${result.insertedCount}`);
        }

        insertedCount += result.insertedCount;

        // 每 10 批打印一次进度
        if ((i / batchSize) % 10 === 0) {
          addLog.info(`Training data insert progress: ${insertedCount}/${dataToInsert.length}`);
        }

      } catch (error: any) {
        addLog.error(`Insert batch error at index ${i}`, error);
        throw error;
      }
    }

    return insertedCount;
  };

  // 2. 大数据量分多个事务处理
  if (data.length > maxBatchesPerTransaction * batchSize) {
    addLog.info(`Large dataset detected (${data.length} items), using chunked transactions`);

    let totalInserted = 0;
    const chunkSize = maxBatchesPerTransaction * batchSize;

    for (let i = 0; i < data.length; i += chunkSize) {
      const chunk = data.slice(i, i + chunkSize);

      await mongoSessionRun(async (session) => {
        const inserted = await insertDataIterative(chunk, session);
        totalInserted += inserted;
      });

      addLog.info(`Chunk completed: ${totalInserted}/${data.length}`);
    }

    return { insertLen: totalInserted };
  }

  // 3. 小数据量单事务处理
  if (session) {
    const inserted = await insertDataIterative(data, session);
    return { insertLen: inserted };
  } else {
    let insertedCount = 0;
    await mongoSessionRun(async (session) => {
      insertedCount = await insertDataIterative(data, session);
    });
    return { insertLen: insertedCount };
  }
}

🔴 H13. 向量数据库缺少降级和容错机制

位置: packages/service/common/vectorDB/controller.ts:21-36

问题描述:

const getVectorObj = () => {
  if (PG_ADDRESS) return new PgVectorCtrl();
  if (OCEANBASE_ADDRESS) return new ObVectorCtrl();
  if (MILVUS_ADDRESS) return new MilvusCtrl();

  return new PgVectorCtrl(); // 默认 PG
};

const Vector = getVectorObj(); // 启动时初始化,无容错

向量数据库连接失败导致整个服务不可用
未实现多数据库降级策略
缺少健康检查和自动切换
查询失败仅重试一次 (retryFn)

风险等级: 🔴 高危

影响:

向量数据库故障导致所有知识库查询失败
无法实现多数据源容灾
数据库维护期间服务不可用

建议方案:

// 1. 向量数据库管理器
class VectorDBManager {
  private primary: any | null = null;
  private fallback: any | null = null;
  private healthStatus = {
    primary: true,
    fallback: true,
    lastCheck: Date.now()
  };

  constructor() {
    this.initializeVectorDBs();
    this.startHealthCheck();
  }

  private initializeVectorDBs() {
    // 主数据库
    try {
      if (PG_ADDRESS) {
        this.primary = new PgVectorCtrl();
        addLog.info('Primary vector DB initialized: PostgreSQL');
      } else if (OCEANBASE_ADDRESS) {
        this.primary = new ObVectorCtrl();
        addLog.info('Primary vector DB initialized: OceanBase');
      } else if (MILVUS_ADDRESS) {
        this.primary = new MilvusCtrl();
        addLog.info('Primary vector DB initialized: Milvus');
      } else {
        throw new Error('No vector database configured');
      }
    } catch (error) {
      addLog.error('Failed to initialize primary vector DB', error);
      this.healthStatus.primary = false;
    }

    // 备用数据库 (如果配置了多个)
    try {
      const fallbackAddresses = [
        { addr: PG_ADDRESS, ctrl: PgVectorCtrl, name: 'PostgreSQL' },
        { addr: OCEANBASE_ADDRESS, ctrl: ObVectorCtrl, name: 'OceanBase' },
        { addr: MILVUS_ADDRESS, ctrl: MilvusCtrl, name: 'Milvus' }
      ].filter(db => db.addr && !this.isPrimary(db.name));

      if (fallbackAddresses.length > 0) {
        const fb = fallbackAddresses[0];
        this.fallback = new fb.ctrl();
        addLog.info(`Fallback vector DB initialized: ${fb.name}`);
      }
    } catch (error) {
      addLog.warn('Fallback vector DB not available', error);
      this.healthStatus.fallback = false;
    }
  }

  private isPrimary(dbName: string): boolean {
    if (!this.primary) return false;
    return this.primary.constructor.name.includes(dbName);
  }

  // 健康检查
  private startHealthCheck() {
    setInterval(async () => {
      await this.checkHealth();
    }, 30000); // 每 30 秒检查一次
  }

  private async checkHealth() {
    const now = Date.now();

    // 检查主数据库
    if (this.primary) {
      try {
        await this.primary.healthCheck?.();
        if (!this.healthStatus.primary) {
          addLog.info('Primary vector DB recovered');
          this.healthStatus.primary = true;
        }
      } catch (error) {
        if (this.healthStatus.primary) {
          addLog.error('Primary vector DB unhealthy', error);
          this.healthStatus.primary = false;
        }
      }
    }

    // 检查备用数据库
    if (this.fallback) {
      try {
        await this.fallback.healthCheck?.();
        if (!this.healthStatus.fallback) {
          addLog.info('Fallback vector DB recovered');
          this.healthStatus.fallback = true;
        }
      } catch (error) {
        if (this.healthStatus.fallback) {
          addLog.warn('Fallback vector DB unhealthy', error);
          this.healthStatus.fallback = false;
        }
      }
    }

    this.healthStatus.lastCheck = now;
  }

  // 获取可用的向量数据库实例
  getAvailableInstance() {
    if (this.healthStatus.primary && this.primary) {
      return this.primary;
    }

    if (this.healthStatus.fallback && this.fallback) {
      addLog.warn('Using fallback vector DB');
      return this.fallback;
    }

    throw new Error('No healthy vector database available');
  }
}

const vectorManager = new VectorDBManager();

// 2. 导出增强的向量操作函数
export const initVectorStore = async () => {
  const instance = vectorManager.getAvailableInstance();
  return instance.init();
};

export const recallFromVectorStore = async (props: EmbeddingRecallCtrlProps) => {
  return retryFn(
    async () => {
      const instance = vectorManager.getAvailableInstance();
      return instance.embRecall(props);
    },
    {
      retries: 3,
      minTimeout: 1000,
      maxTimeout: 5000,
      onRetry: (error, attempt) => {
        addLog.warn(`Vector recall retry attempt ${attempt}`, { error: error.message });
      }
    }
  );
};

export const insertDatasetDataVector = async (props: InsertVectorProps & { inputs: string[], model: EmbeddingModelItemType }) => {
  const { vectors, tokens } = await getVectorsByText({
    model: props.model,
    input: props.inputs,
    type: 'db'
  });

  const { insertIds } = await retryFn(
    async () => {
      const instance = vectorManager.getAvailableInstance();
      return instance.insert({ ...props, vectors });
    },
    {
      retries: 3,
      minTimeout: 1000,
      onRetry: (error, attempt) => {
        addLog.warn(`Vector insert retry attempt ${attempt}`, { error: error.message });
      }
    }
  );

  onIncrCache(props.teamId);

  return { tokens, insertIds };
};

// 3. 添加健康检查 API
export async function getVectorDBHealth() {
  return {
    status: vectorManager.healthStatus,
    timestamp: new Date().toISOString()
  };
}

位置: packages/service/support/permission/auth/common.ts:162-168

问题描述:

export const setCookie = (res: NextApiResponse, token: string) => {
  res.setHeader(
    'Set-Cookie',
    `${TokenName}=${token}; Path=/; HttpOnly; Max-Age=604800; Samesite=Strict;`
  );
};

未设置 Secure 标志 (HTTPS only)
Max-Age=604800 (7天) 过长,增加被盗风险
Session token 无服务端过期时间验证
缺少 CSRF 保护

风险等级: 🔴 高危

影响:

Token 被盗后长期有效
HTTP 连接下 token 可能泄露
CSRF 攻击风险

建议方案:

// 1. 安全的 Cookie 配置
export const setCookie = (res: NextApiResponse, token: string, options?: {
  maxAge?: number;
  secure?: boolean;
}) => {
  const isProduction = process.env.NODE_ENV === 'production';
  const maxAge = options?.maxAge || 86400; // 默认 1 天
  const secure = options?.secure ?? isProduction; // 生产环境强制 HTTPS

  const cookieOptions = [
    `${TokenName}=${token}`,
    'Path=/',
    'HttpOnly',
    `Max-Age=${maxAge}`,
    'SameSite=Strict',
    ...(secure ? ['Secure'] : []) // HTTPS only
  ];

  res.setHeader('Set-Cookie', cookieOptions.join('; '));
};

// 2. Session 管理增强
// packages/service/support/user/session.ts
import { getGlobalRedisConnection } from '../../common/redis';

const SESSION_PREFIX = 'session:';
const SESSION_EXPIRY = 24 * 60 * 60; // 1 天

export async function authUserSession(token: string) {
  // 验证 JWT
  const decoded = jwt.verify(token, process.env.JWT_SECRET!) as any;

  // 检查 session 是否在 Redis 中 (用于立即注销)
  const redis = getGlobalRedisConnection();
  const sessionKey = `${SESSION_PREFIX}${decoded.userId}:${token}`;

  const exists = await redis.exists(sessionKey);
  if (!exists) {
    throw new Error('Session expired or invalidated');
  }

  // 刷新 session 过期时间
  await redis.expire(sessionKey, SESSION_EXPIRY);

  return {
    userId: decoded.userId,
    teamId: decoded.teamId,
    tmbId: decoded.tmbId,
    isRoot: decoded.isRoot
  };
}

// 创建 session
export async function createUserSession(userId: string, userData: any) {
  const token = jwt.sign(
    { ...userData, userId },
    process.env.JWT_SECRET!,
    { expiresIn: '1d' }
  );

  // 存储 session 到 Redis
  const redis = getGlobalRedisConnection();
  const sessionKey = `${SESSION_PREFIX}${userId}:${token}`;

  await redis.setex(
    sessionKey,
    SESSION_EXPIRY,
    JSON.stringify({
      userId,
      createdAt: new Date().toISOString(),
      ...userData
    })
  );

  return token;
}

// 注销 session
export async function invalidateUserSession(userId: string, token: string) {
  const redis = getGlobalRedisConnection();
  const sessionKey = `${SESSION_PREFIX}${userId}:${token}`;
  await redis.del(sessionKey);
}

// 注销用户所有 session
export async function invalidateAllUserSessions(userId: string) {
  const redis = getGlobalRedisConnection();
  const pattern = `${SESSION_PREFIX}${userId}:*`;
  const keys = await redis.keys(pattern);

  if (keys.length > 0) {
    await redis.del(...keys);
  }
}

// 3. CSRF 保护
import crypto from 'crypto';

const CSRF_TOKEN_PREFIX = 'csrf:';

export async function generateCSRFToken(sessionId: string): Promise<string> {
  const redis = getGlobalRedisConnection();
  const csrfToken = crypto.randomBytes(32).toString('hex');
  const key = `${CSRF_TOKEN_PREFIX}${sessionId}`;

  await redis.setex(key, 3600, csrfToken); // 1 小时

  return csrfToken;
}

export async function validateCSRFToken(
  sessionId: string,
  csrfToken: string
): Promise<boolean> {
  const redis = getGlobalRedisConnection();
  const key = `${CSRF_TOKEN_PREFIX}${sessionId}`;

  const storedToken = await redis.get(key);
  return storedToken === csrfToken;
}

// 4. 在关键 API 中添加 CSRF 验证
export const authCertWithCSRF = async (props: AuthModeType) => {
  const { req } = props;
  const result = await parseHeaderCert(props);

  // 对于 POST/PUT/DELETE 请求验证 CSRF
  if (['POST', 'PUT', 'DELETE', 'PATCH'].includes(req.method || '')) {
    const csrfToken = req.headers['x-csrf-token'] as string;

    if (!csrfToken || !result.sessionId) {
      throw new Error('CSRF token missing');
    }

    const isValid = await validateCSRFToken(result.sessionId, csrfToken);
    if (!isValid) {
      throw new Error('Invalid CSRF token');
    }
  }

  return result;
};

新增中危问题 (Additional Medium Priority)

🟡 M20. 向量查询缓存策略过于激进

位置: packages/service/common/vectorDB/controller.ts:29-35

问题描述:

const onDelCache = throttle((teamId: string) => delRedisCache(getChcheKey(teamId)), 30000, {
  leading: true,
  trailing: true
});

删除操作使用 throttle,30 秒内只执行一次
可能导致缓存计数不准确
未考虑高频删除场景

建议:

删除操作直接更新缓存
定期全量同步缓存和数据库
添加缓存一致性校验

🟡 M21. 训练队列缺少优先级机制

位置: packages/service/common/bullmq/index.ts:20-26

问题描述:

export enum QueueNames {
  datasetSync = 'datasetSync',
  evaluation = 'evaluation',
  websiteSync = 'websiteSync'
}

所有任务同等优先级
无法区分紧急任务和普通任务
大批量任务可能阻塞小任务

建议:

// 添加优先级队列
export enum TaskPriority {
  LOW = 1,
  NORMAL = 5,
  HIGH = 10,
  URGENT = 20
}

// 添加任务时指定优先级
queue.add('task', data, {
  priority: TaskPriority.HIGH,
  jobId: 'unique-job-id'
});

🟡 M22-M28: 其他新增中危问题

M22. getAllKeysByPrefix 使用 KEYS 命令

redis.keys() 阻塞操作,大量 key 时影响性能
建议使用 SCAN 命令

M23. 工作流节点参数未进行深度克隆

replaceEditorVariable 可能修改原始节点数据
多次执行同一节点可能出现数据污染

M24. Mongoose Schema 索引未优化

慢查询警告阈值 1000ms 过高
未配置复合索引和覆盖索引

M25. 文件上传未限制并发数

大量文件同时上传可能耗尽连接
建议添加上传队列和限流

M26. AI 模型调用未实现熔断机制

模型服务故障时持续重试
建议实现 Circuit Breaker 模式

M27. packages/web 组件未使用虚拟滚动

大列表渲染性能差
建议使用 react-window 或 react-virtualized

M28. 权限检查未缓存

每次 API 调用都查询数据库
建议缓存用户权限信息

完整问题清单汇总

按严重程度统计

等级	数量	占比	新增
🔴 高危	14	20.0%	+5
🟡 中危	37	52.9%	+18
🟢 低危	19	27.1%	+5
总计	70	100%	+28

按问题域分类

域	高危	中危	低危	小计
工作流引擎	3	4	1	8
数据库层	3	6	2	11
API/中间件	2	5	2	9
队列系统	2	3	1	6
认证/权限	1	3	1	5
缓存/Redis	1	4	1	6
向量数据库	1	2	1	4
前端性能	0	6	4	10
构建/部署	0	3	4	7
监控/日志	1	1	2	4

架构层面的系统性问题

基于深入分析,识别出以下架构层面的系统性问题:

1. 资源管理缺少统一抽象层

问题: 数据库、Redis、队列等各自管理连接,缺少统一的资源管理器

建议: 实现统一的 ResourceManager

class ResourceManager {
  private resources = new Map<string, any>();

  async registerResource(name: string, resource: any) {
    this.resources.set(name, resource);
    await resource.init?.();
  }

  async healthCheck() {
    const results = new Map();
    for (const [name, resource] of this.resources) {
      try {
        await resource.healthCheck?.();
        results.set(name, 'healthy');
      } catch (error) {
        results.set(name, 'unhealthy');
      }
    }
    return results;
  }

  async gracefulShutdown() {
    for (const [name, resource] of this.resources) {
      await resource.close?.();
    }
  }
}

2. 缺少统一的错误处理和重试策略

问题: 每个模块自行实现错误处理,缺少一致性

建议: 实现统一的 ErrorHandler 和 RetryPolicy

enum RetryableErrorType {
  NETWORK,
  TIMEOUT,
  RATE_LIMIT,
  DATABASE_LOCK
}

class RetryPolicy {
  static getPolicy(errorType: RetryableErrorType) {
    // 返回不同错误类型的重试策略
  }
}

3. 监控和可观测性不足

问题: 缺少统一的指标收集和链路追踪

建议: 集成 OpenTelemetry (已部分集成)

完善 trace、metrics、logs 三大支柱
添加关键业务指标 (工作流执行时间、AI 调用延迟等)
实现分布式追踪

4. 配置管理分散

问题: 配置散落在环境变量、代码常量、数据库中

建议: 实现配置中心

统一配置管理
动态配置更新
配置版本控制

修复优先级路线图

第一阶段: 紧急修复 (Week 1-2) - 稳定性优先

H10: Redis 连接池 (影响所有队列和缓存)
H11: BullMQ 重试和死信队列 (影响训练任务稳定性)
H14: 认证安全加固 (安全风险)
H3: SSE 客户端断开处理 (资源泄漏)
H12: 训练数据递归改迭代 (栈溢出风险)

第二阶段: 核心优化 (Week 3-4) - 性能提升

H1: 工作流并发控制
H2: MongoDB 连接池
H4: API 超时控制
H13: 向量数据库容错
M20-M28: 中危缓存和队列优化

第三阶段: 系统完善 (Week 5-8) - 长期稳定

架构层面系统性改造
监控和告警体系建设
自动化测试覆盖率提升
性能基准测试和持续优化

第四阶段: 持续改进 (持续)

代码质量提升 (ESLint、Prettier、TypeScript strict)
文档完善
开发体验优化
技术债务清理

性能优化预期收益

基于问题修复,预期获得以下收益:

指标	当前	优化后	提升
API P95 响应时间	~2s	~800ms	-60%
工作流执行成功率	~95%	~99.5%	+4.5%
内存使用 (峰值)	2.5GB	1.8GB	-28%
Redis 连接数	50+	15	-70%
MongoDB 连接数	100+	50	-50%
页面首次加载	4.5s	2s	-56%
训练任务失败率	~10%	~2%	-80%

监控指标建议

1. 应用层指标

// 建议添加的 Prometheus 指标
const metrics = {
  // 工作流
  workflow_execution_duration: 'histogram',
  workflow_node_execution_count: 'counter',
  workflow_error_rate: 'gauge',

  // 队列
  queue_size: 'gauge',
  queue_processing_duration: 'histogram',
  queue_job_success_rate: 'gauge',

  // API
  api_request_duration: 'histogram',
  api_error_count: 'counter',
  api_active_connections: 'gauge',

  // 数据库
  db_query_duration: 'histogram',
  db_connection_pool_size: 'gauge',
  db_slow_query_count: 'counter',

  // 缓存
  cache_hit_rate: 'gauge',
  cache_operation_duration: 'histogram'
};

2. 业务指标

const businessMetrics = {
  // 训练
  training_queue_length: 'gauge',
  training_success_rate: 'gauge',
  embedding_tokens_consumed: 'counter',

  // 对话
  chat_response_time: 'histogram',
  chat_token_usage: 'histogram',

  // 知识库
  dataset_size: 'gauge',
  vector_search_duration: 'histogram'
};

3. 告警规则

alerts:
  - name: high_api_error_rate
    expr: rate(api_error_count[5m]) > 0.05
    severity: critical

  - name: workflow_execution_slow
    expr: histogram_quantile(0.95, workflow_execution_duration) > 30
    severity: warning

  - name: queue_overload
    expr: queue_size > 1000
    severity: warning

  - name: redis_connection_high
    expr: redis_connections > 20
    severity: warning

  - name: mongodb_slow_queries
    expr: rate(db_slow_query_count[5m]) > 10
    severity: critical

总结

本次深度分析额外识别了 28 个问题,使问题总数达到 70 个,主要集中在:

队列系统 (BullMQ): 配置不当、缺少重试和监控
Redis 管理: 连接未复用、配置缺失
训练数据处理: 递归栈溢出、批量插入优化
向量数据库: 缺少容错和降级
认证安全: Cookie 配置、session 管理

核心改进建议:

实施统一的资源管理和连接池策略
完善队列系统的重试、监控和死信处理
加强认证安全和 session 管理
实现向量数据库容错和降级机制
建立完整的监控和告警体系

通过系统性的优化,预期可以:

提升 60% API 响应速度
降低 80% 训练任务失败率
减少 70% Redis 连接数
提升 4.5% 工作流成功率

下一步行动: 按照四阶段路线图逐步实施修复,优先处理高危稳定性问题。

报告完成时间: 2025-10-20 分析工具: Claude Code Deep Analysis Agent 报告位置: .claude/design/projects_app_performance_stability_deep_analysis.md

33 KiB Raw Blame History

FastGPT 性能与稳定性深度分析报告 (扩展版)

执行摘要

新增高危问题 (Additional High Priority)

🔴 H10. Redis 连接未复用导致连接数耗尽

🔴 H11. BullMQ 队列缺少重试策略和死信队列

🔴 H12. 训练数据递归插入存在栈溢出风险

🔴 H13. 向量数据库缺少降级和容错机制

🔴 H14. 认证 Cookie 配置存在安全隐患

新增中危问题 (Additional Medium Priority)

🟡 M20. 向量查询缓存策略过于激进

🟡 M21. 训练队列缺少优先级机制

🟡 M22-M28: 其他新增中危问题

完整问题清单汇总

按严重程度统计

按问题域分类

架构层面的系统性问题

1. 资源管理缺少统一抽象层

2. 缺少统一的错误处理和重试策略

3. 监控和可观测性不足

4. 配置管理分散

修复优先级路线图

第一阶段: 紧急修复 (Week 1-2) - 稳定性优先

第二阶段: 核心优化 (Week 3-4) - 性能提升

第三阶段: 系统完善 (Week 5-8) - 长期稳定

第四阶段: 持续改进 (持续)

性能优化预期收益

监控指标建议

1. 应用层指标

2. 业务指标

3. 告警规则

总结

33 KiB

Raw Blame History