diff --git a/content/en/docs/cluster-administration/cluster-status-monitoring.md b/content/en/docs/cluster-administration/cluster-status-monitoring.md index 0a396b45e..7c7eccf4b 100644 --- a/content/en/docs/cluster-administration/cluster-status-monitoring.md +++ b/content/en/docs/cluster-administration/cluster-status-monitoring.md @@ -9,7 +9,7 @@ weight: 300 KubeSphere provides monitoring of related metrics such as CPU, memory, network, and disk of the cluster. You can also review historical monitoring data and sort nodes by different indicators based on their usage in **Cluster Status Monitoring**. -## Prerequisites +## Prerequisites You need an account granted a role including the authorization of **Clusters Management**. For example, you can log in the console as `admin` directly or create a new role with the authorization and assign it to an account. @@ -17,29 +17,29 @@ You need an account granted a role including the authorization of **Clusters Man 1. Click **Platform** in the top left corner and select **Clusters Management**. -![Platform](/images/docs/cluster-administration/cluster-status-monitoring/platform.png) +![Platform](../../../../static../../../../static/images/docs/cluster-administration/cluster-status-monitoring/platform.png) -2. If you have enabled the [multi-cluster feature](../../multicluster-management) with member clusters imported, you can select a specific cluster to view its application resources. If you have not enabled the feature, refer to the next step directly. +1. If you have enabled the [multi-cluster feature](../../multicluster-management) with member clusters imported, you can select a specific cluster to view its application resources. If you have not enabled the feature, refer to the next step directly. -![Clusters Management](/images/docs/cluster-administration/cluster-status-monitoring/clusters-management.png) +![Clusters Management](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/clusters-management.png) 3. Choose **Cluster Status** under **Monitoring & Alerting** to see the overview of cluster status monitoring, including **Cluster Node Status**, **Components Status**, **Cluster Resources Usage**, **ETCD Monitoring**, and **Service Component Monitoring**, as shown in the following figure. -![Cluster Status Monitoring](/images/docs/cluster-administration/cluster-status-monitoring/cluster-status-monitoring.png) +![Cluster Status Monitoring](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/cluster-status-monitoring.png) ### Cluster Node Status 1. **Cluster Node Status** displays the status of all nodes, separately marking the active ones. You can go to the **Cluster Nodes** page shown below to view the real-time resource usage of all nodes by clicking **Node Online Status**. -![Cluster Nodes](/images/docs/cluster-administration/cluster-status-monitoring/cluster-nodes.png) +![Cluster Nodes](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/cluster-nodes.png) 2. In **Cluster Nodes**, click the node name to view usage details in **Running Status**, including the information of CPU, Memory, Pod, Local Storage in the current node, and its health status. -![Running Status](/images/docs/cluster-administration/cluster-status-monitoring/running-status.png) +![Running Status](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/running-status.png) 3. Click the tab **Monitoring** to view how the node is functioning during a certain period based on different metrics, including **CPU Utilization, CPU Load Average, Memory Utilization, Disk Utilization, inode Utilization, IOPS, DISK Throughput, and Network Bandwidth**, as shown in the following figure. -![Monitoring](/images/docs/cluster-administration/cluster-status-monitoring/monitoring.png) +![Monitoring](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/monitoring.png) {{< notice tip >}} @@ -53,11 +53,11 @@ KubeSphere monitors the health status of various service components in the clust 1. On the **Cluster Status Monitoring** page, click components (the part in the green box below) under **Components Status** to view the status of service components. -![component-monitoring](/images/docs/cluster-administration/cluster-status-monitoring/component-monitoring.jpg) +![component-monitoring](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/component-monitoring.jpg) 2. You can see all the components are listed in this part. Components marked in green are those functioning normally while those marked in orange require special attention as it signals potential issues. -![Service Components Status](/images/docs/cluster-administration/cluster-status-monitoring/service-components-status.png) +![Service Components Status](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/service-components-status.png) {{< notice tip >}} @@ -69,25 +69,25 @@ Components marked in orange may turn to green after a period of time, the reason **Cluster Resources Usage** displays the information including **CPU Utilization, Memory Utilization, Disk Utilization, and Pod Quantity Trend** of all nodes in the cluster. Click the pie chart on the left to switch indicators, which shows the trend during a period in a line chart on the right. -![Cluster Resources Usage](/images/docs/cluster-administration/cluster-status-monitoring/cluster-resources-usage.png) +![Cluster Resources Usage](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/cluster-resources-usage.png) ## Physical Resources Monitoring Monitoring data in **Physical Resources Monitoring** help users better observe their physical resources and establish normal standards for resource and cluster performance. KubeSphere allows users to view cluster monitoring data within the last 7 days, including **CPU Utilization**, **Memory Utilization**, **CPU Load Average** **(1 minute/5 minutes/15 minutes)**, **inode Utilization**, **Disk Throughput (read/write)**, **IOPS (read/write)**, **Network Bandwidth**, and **Pod Status**. You can customize the time range and time interval to view historical monitoring data of physical resources in KubeSphere. The following sections briefly introduce each monitoring indicator. -![Physical Resources Monitoring](/images/docs/cluster-administration/cluster-status-monitoring/physical-resources-monitoring.png) +![Physical Resources Monitoring](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/physical-resources-monitoring.png) ### CPU Utilization CPU utilization shows how CPU resources are used in a period. If you notice that the CPU usage of the platform during a certain period soars, you must first locate the process that is occupying CPU resources the most. For example, for Java applications, you may expect a CPU usage spike in the case of memory leaks or infinite loops in the code. -![CPU Utilization](/images/docs/cluster-administration/cluster-status-monitoring/cpu-utilization.png) +![CPU Utilization](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/cpu-utilization.png) ### Memory Utilization Memory is one of the important components on a machine, serving as a bridge for communications with the CPU. Therefore, the performance of memory has a great impact on the machine. Data loading, thread concurrency and I/O buffering are all dependent on memory when a program is running. The size of available memory determines whether the program can run normally and how it is functioning. Memory utilization reflects how memory resources are used within a cluster as a whole, displayed as a percentage of available memory in use at a given moment. -![Memory Utilization](/images/docs/cluster-administration/cluster-status-monitoring/memory-utilization.png) +![Memory Utilization](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/memory-utilization.png) ### CPU Load Average @@ -99,7 +99,7 @@ KubeSphere provides users with three different time periods to view the load ave - If the value of 1 minute in a certain period, or at a specific time point is much greater than that of 15 minutes, it means that the load in the last 1 minute is increasing, and you need to keep observing. Once the value of 1 minute exceeds the number of CPUs, it may mean that the system is overloaded. You need to further analyze the source of the problem. - Conversely, if the value of 1 minute in a certain period, or at a specific time point is much less than that of 15 minutes, it means that the load of the system is decreasing in the last 1 minute, and a high load has been generated in the previous 15 minutes. -![CPU Load Average](/images/docs/cluster-administration/cluster-status-monitoring/cpu-load-average.png) +![CPU Load Average](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/cpu-load-average.png) ### Disk Usage @@ -107,7 +107,7 @@ KubeSphere workloads such as `StatefulSets` and `DaemonSets` all rely on persist In the daily management of the Linux system, platform administrators may encounter data loss or even system crashes due to insufficient disk space. As an essential part of cluster management, they need to pay close attention to the disk usage of the system and ensure that the file system is not filling up or abused. By monitoring the historical data of disk usage, you can evaluate how disks are used during a given period of time. In the case of high disk usage, you can free up disk space by cleaning up unnecessary images or containers. -![Disk Usage](/images/docs/cluster-administration/cluster-status-monitoring/disk-usage.png) +![Disk Usage](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/disk-usage.png) ### inode Utilization @@ -115,30 +115,30 @@ Each file must have an inode, which is used to store the file's meta-information In KubeSphere, the monitoring of inode utilization can help you detect such situations in advance, as you can have a clear view of cluster inode usage. The mechanism prompts users to clean up temporary files in time, preventing the cluster from being unable to work due to inode exhaustion. -![inode Utilization](/images/docs/cluster-administration/cluster-status-monitoring/inode-utilization.png) +![inode Utilization](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/inode-utilization.png) ### Disk Throughput The monitoring of disk throughput and IOPS is an indispensable part of disk monitoring, which is convenient for cluster administrators to adjust data layout and other management activities to optimize the overall performance of the cluster. Disk throughput refers to the speed of the disk transmission data stream (shown in MB/s), and the transmission data are the sum of data reading and writing. When large blocks of discontinuous data are being transmitted, this indicator is of great importance for reference. -![Disk Throughput](/images/docs/cluster-administration/cluster-status-monitoring/disk-throughput.png) +![Disk Throughput](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/disk-throughput.png) ### IOPS **IOPS (Input/Output Operations Per Second)** represents a performance measurement of the number of read and write operations per second. Specifically, the IOPS of a disk is the sum of the number of continuous reads and writes per second. This indicator is of great significance for reference when small blocks of discontinuous data are being transmitted. -![IOPS](/images/docs/cluster-administration/cluster-status-monitoring/iops.png) +![IOPS](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/iops.png) ### Network Bandwidth The network bandwidth is the ability of the network card to receive or send data per second, shown in Mbps (megabits per second). -![Network Bandwidth](/images/docs/cluster-administration/cluster-status-monitoring/netework-bandwidth.png) +![Network Bandwidth](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/netework-bandwidth.png) ### Pod Status Pod status displays the total number of pods in different states, including **Running**, **Completed** and **Warning**. The pod tagged **Completed** usually refers to a Job or a CronJob. The number of pods marked **Warning**, which means an abnormal state, requires special attention. -![Pod Status](/images/docs/cluster-administration/cluster-status-monitoring/pod-status.png) +![Pod Status](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/pod-status.png) ## ETCD Monitoring @@ -154,7 +154,7 @@ ETCD monitoring helps you to make better use of ETCD, especially to locate perfo |DB Fsync|The submission delay distribution of the backend calls. When ETCD submits its most recent incremental snapshot to disk, a `backend_commit` will be called. Note that high latency of disk operations (long WAL log synchronization time or library synchronization time) usually indicates disk problems, which may cause high request latency or make the cluster unstable. For more information about the indicator, see [etcd Disk](https://etcd.io/docs/v3.3.12/metrics/#grpc-requests). | |Raft Proposals|- **Proposal Commit Rate** records the rate of consensus proposals committed. If the cluster is healthy, this indicator should increase over time. Several healthy members of an ETCD cluster may have different general proposals at the same time. A continuous large lag between a single member and its leader indicates that the member is slow or unhealthy.
- **Proposal Apply Rate** records the total rate of consensus proposals applied. The ETCD server applies each committed proposal asynchronously. The difference between the **Proposal Commit Rate** and the **Proposal Apply Rate** should usually be small (only a few thousands even under high loads). If the difference between them continues to rise, it indicates that the ETCD server is overloaded. This can happen when using large-scale queries such as heavy range queries or large txn operations.
- **Proposal Failure Rate** records the total rate of failed proposals, usually related to two issues: temporary failures related to leader election or longer downtime due to a loss of quorum in the cluster.
- **Proposal Pending Total** records the current number of pending proposals. An increase in pending proposals indicates high client loads or members unable to submit proposals.
Currently, the data displayed on the dashboard is the average size of ETCD members. For more information about these indicators, see [etcd Server](https://etcd.io/docs/v3.3.12/metrics/#server). | -![ETCD Monitoring](/images/docs/cluster-administration/cluster-status-monitoring/etcd-monitoring.png) +![ETCD Monitoring](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/etcd-monitoring.png) ## APIServer Monitoring [API Server](https://kubernetes.io/docs/concepts/overview/kubernetes-api/) is the hub for the interaction of all components in a Kubernetes cluster. The following table lists the main indicators monitored for the APIServer. @@ -164,7 +164,7 @@ ETCD monitoring helps you to make better use of ETCD, especially to locate perfo |Request Latency|Classified by HTTP request methods, the latency of resource request response in milliseconds.| |Request Per Second|The number of requests accepted by kube-apiserver per second.| -![APIServer Monitoring](/images/docs/cluster-administration/cluster-status-monitoring/apiserver-monitoring.png) +![APIServer Monitoring](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/apiserver-monitoring.png) ## Scheduler Monitoring @@ -176,10 +176,10 @@ ETCD monitoring helps you to make better use of ETCD, especially to locate perfo |Attempt Rate|Include the scheduling rate of successes, errors, and failures.| |Scheduling latency|End-to-end scheduling delay, which is the sum of scheduling algorithm delay and binding delay| -![Scheduler Monitoring](/images/docs/cluster-administration/cluster-status-monitoring/scheduler-monitoring.png) +![Scheduler Monitoring](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/scheduler-monitoring.png) ## Node Usage Ranking You can sort nodes in ascending and descending order by indicators such as CPU, Load Average, Memory, Local Storage, inode Utilization, and Pod Utilization. This enables administrators to quickly find potential problems or identify a node's insufficient resources. -![Node Usage Ranking](/images/docs/cluster-administration/cluster-status-monitoring/node-usage-ranking.png) +![Node Usage Ranking](../../../../static/images/docs/cluster-administration/cluster-status-monitoring/node-usage-ranking.png) diff --git a/content/zh/docs/cluster-administration/cluster-status-monitoring.md b/content/zh/docs/cluster-administration/cluster-status-monitoring.md new file mode 100644 index 000000000..8092f8261 --- /dev/null +++ b/content/zh/docs/cluster-administration/cluster-status-monitoring.md @@ -0,0 +1,5 @@ +# 集群状态监控 + +KubeSphere 提供了对集群的 CPU、内存、网络和磁盘等相关指标的监控。在**集群状态监控**中,您还可以查看历史监控数据并根据节点的使用情况按不同的指标对节点进行排序。 + +## 先决条件 \ No newline at end of file diff --git a/static/images/docs/cluster-administration/cluster-status-monitoring-zh/platform.png b/static/images/docs/cluster-administration/cluster-status-monitoring-zh/platform.png new file mode 100644 index 000000000..a9c183314 Binary files /dev/null and b/static/images/docs/cluster-administration/cluster-status-monitoring-zh/platform.png differ