Zhu Xi's Internet Architecture Practice Experience S1E5: Constantly Cultivating Basic Middleware

朱晔's Internet Architecture Practice Experience S1E5: Constantly Cultivating Basic Middleware

Generally speaking, the difference between middleware and framework is, in the middle The CS program is a stand-alone CS program for processing a specific service. There is a matching client and server. Although the framework handles a specific service, it is not a stand-alone program. It is a hosted process in the host program. Set of class libraries.


The green part of the picture represents the framework, the red part represents the management system, and the purple part represents the middleware. This article will focus on the management system and middleware components.

Configuration Management

  • For various configurations of the internal technical level of the system, the size of various pools, queue size, log level, various paths, batch size, processing interval, number of retries, timeout time Wait.
  • For various configurations of business operations, periodic rewards for events, black and white lists, pop-ups, advertising spaces, etc.
  • For the operation and maintenance and release level configuration, grayscale list, registration center address, database address, cache address, MQ address, etc.

Because some basic components such as SOA framework and publishing system will also use the configuration, this time there may be problems with chickens. Here I recommend the configuration system as the lowest level system, other services are You can rely on configuring the system. In general, configuration management, in addition to the most basic Key-Value configuration read and configuration, will also have the following features and functions:

  • High performance. The pressure of configuring the service can be very scary. There may be dozens of configuration calls in a service call. If the overall QPS of the service is at 500, then the pressure of the configuration service may be 10,000 QPS, so the QPS does not go. Cache is basically impossible. Fortunately, even 10,000 or even 50,000 QPS is not a very exaggerated pressure that cannot be solved.
  • 高可用. Nowadays, various open source configuration services are so-called distributed configuration services, which are subject to load balancing and high availability functions by a scalable configuration service cluster. Once the configuration service is hung, the system may be paralyzed. You may say that the configuration service generally has a local cache. There will be a local configuration file as a backup. There will be default values, but because the configuration is modified in real time, if the configuration of a service does not use the latest configuration. If the default value is wrong, the system will be completely confused, so the stability of the configuration service is too important.
  • 树形的系统系统. If you just put all the configuration in a list, plus the project and classification, there will be a bit more when configuring up to several thousand items. It can support the hierarchical configuration of the tree, not limited to the two conditions of project and classification. There can be modules under the project, there can be classification under the module, and there can be small classes under the classification, and the configuration tree can be dynamically constructed according to their own needs.
  • 好用客户端. For example, it can be combined with SpringBoot and @Value annotations, non-intrusive integration configuration system, without any code changes. The
  • millisecond granularity modification takes effect in real time. It can be implemented by means of long connection push, or by means of cache invalidation. Layered isolation of
  • configuration. This includes providing multiple sets of configurations independent of each other depending on the environment, cluster, and project, including configuration inheritance in a hierarchical manner.
  • Configured permission control. Different types, environments, clusters, and project configurations have different administrative rights, such as desensitization read-only, read-only, read-write, and export.
  • Configured version management. Each modification of the configuration is a version that allows a direct version rollback for a separate configuration or project.
  • Rich Value form. If you want to save the list, it is not convenient to save a JSON to read and modify. You can directly provide the Value of the List method. You can add and delete one of them in the background. This method is more efficient on references such as blacklists, otherwise the entire blacklist should be modified each time a list is updated. This feature can be implemented in conjunction with Redis. In addition to supporting strings, Value can be JSON and XML. The system can format the format and verify the format. Value can also be a variety of numeric formats of non-string type, and the system will also check according to the type.
  • Rich configuration release effective form. For example, it can take effect naturally, take effect immediately, and take effect at regular intervals. The timed-in function is suitable for opening a configuration at a certain point in time, such as for user-oriented push and active services. There is also support for automatic grayscale publishing, which publishes instances in the cluster at regular intervals, avoiding the hassle of manually issuing a single one by one.
  • Audit audit function. The configuration modification can be audited by the administrator (that is, the permission and modification of the modification and release are separated) to avoid configuration error modification. All configuration modification records can be queried who will modify the configuration for what reason and audit review afterwards.
  • Configure effective tracking and usage tracking. You can see which client is currently in use for each configuration item, and which version of the value that is in effect. This feature also allows you to troubleshoot configurations that have never been used in the current system and remove unused configurations.
  • Dynamic configuration. In the API design, we introduce the concept of context, by passing a Map dictionary as the context, such as a configuration according to different user types, the city needs different values, this logic we can not manually write in the code, directly Dynamically read different configuration values ​​by configuring context matching policies in the background.
  • Local snapshot. Perform a snapshot local save of the configuration and use the local configuration when a failure fails to connect to the server.

Here you can see that the workload of a fully functional configuration system is still quite large. An excellent powerful configuration system can save a lot of development work, because the configurable part of the function is basically configured by the system. Direct implementation, no need to engage in a large number of XXConfig tables in the database (not exaggerated, 40% of the workload of many business systems on this, not only need to do these configuration tables but also need to configure the background).


Implementing remote calls in the construction of microservices only achieves 20% of the workload (but does meet 80% of the requirements). There is a lot of work to do in the management of service management. This is the benefit of implementing your own RPC framework. This is the first step. With this first step, we can do more things after the data flows through our own framework, such as:

  • call chain tracking. Can you record the entire call and look at this call chain. I will talk about this in the next section.
  • 注册管理. View the registration status of the service, the service is manually online, the cluster is switched, and the pressure is assigned.
  • Configuration Management. Configure server client thread pool and queue configuration, timeout configuration, and more. Of course, this can also be done in the configuration system.
  • 管理维层的管理. View and management methods are blown, concurrent traffic limiting configuration, blacklist configuration of service permissions, security configuration (information encryption, log desensitization, etc.).
  • Service Store concept. The service release needs to meet certain requirements, there are documents (such as can be provided in the code comments by annotation), and there are information (development leader, operation and maintenance person in charge, service type, ability provided), which can be similar to the requirements. Apple App Store publishes the program as a way to publish services so that we can view service maintenance information and documentation on a unified platform.
  • version control call statistics. Grayscale upgrades to services, routing by version, analysis of different versions, and so on. Similar to the functions provided by some application statistics platforms (Friends, TalkingData).

The idea I want to say here is that the service can call the first step. As the number of services increases, the deployment method becomes more complicated, the dependencies are complicated, the version is iterated, the API changes, the developers and In fact, architects urgently need to have a set of maps to understand the overall picture of service capabilities. Operation and maintenance also requires systems to observe and deploy services. The part of service governance can be operated in the same way as iOS (development requires compliance with standards + release requires a process).


Open source implementations are https://github.com/dianping/cat and https://github.com/naver/pinpoint (above) and so on. For systems with a large number of microservices (the main process involves 8+ microservices), it is very difficult to troubleshoot faults and performance problems without full link invocation tracking of the service. The generally complete full-link monitoring system not only covers micro-services, but also has more functions, and realizes the following functions:

  • is implemented by Log, Agent, Proxy or integrated into the framework, with as little intrusion as possible. In the case of data collection. Moreover, to ensure that the collection of data does not affect the main business, the business does not affect the collection of server downtime.
  • call tracking. Involving service calls, cache calls, database calls, MQ calls, not only can represent the type, time-consuming, and result of each call in the form of a tree, but also can render the complete root, that is, the complete request for the website request. Information, which presents the job information for the Job task.
  • JVM information (for example, for Java). The usage of GC, Threads, Memory, and CPU of each process JVM level is presented. You can view remote stacks and Heap snapshots (no memory information for processes, many times based on coarse-grained resource usage monitoring at the server level, it is almost impossible to analyze the root cause), and you can set policies for periodic snapshots. Virtual machine information viewing and invocation tracking can even be correlated through snapshots. It is very beneficial to know the state of the virtual machine at the time of the problem to troubleshoot the problem.
  • A list of dependencies. Sometimes when we do the architecture solution, the first step is to sort out the dependencies between the modules and the services. Only then can we determine the scope of the scope of impact reconstruction. For the more complex projects that microservices do, everyone may just pay attention to The upstream and downstream of the service itself is completely unclear for the upstream upstream and downstream downstream, and the company has no one to say clearly the overall picture. At this time, if we have a full-link tracking system, we can draw an architectural diagram of the dependency by analyzing past calls. If this picture does some hotspots on QPS, it can also help us to do some capacity planning at the operation and maintenance level.
  • Advanced analysis suggestions. For example, the positioning analysis bottleneck is located after the full link pressure measurement. Regularly analyze the performance of all components, get a trend of performance degradation, and early warning of problems. Analyze the thread and GC of the JVM to assist in locating the High CPU and Memory Leak. Retreat 10,000 steps, even without such automated advanced analysis, with graphs and component dependency graphs that call traces, at least when problems arise, we can analyze them.
  • Dashboard. No, as long as the data collection is comprehensive enough, as shown in the previous article, we can use Grafana for a variety of personalized chart configurations.

Data access middleware

  • The most commonly used function is read and write separation. It also includes load balancing and failover functions. It automatically performs load balancing in multiple slave libraries. Through availability detection, it can switch between the high availability and replication of the database in the event of a failure of the primary library.
  • The fragmentation function is required as the amount of data increases. Sharding is Sharding. The data is evenly distributed to different tables according to a certain dimension, and then the table is distributed in multiple physical databases to achieve pressure dispersion. Sharding written here generally does not have much difference, but the aspect of reading is complicated because it involves the process of merging and summarizing. Since there may be more than one dimension of the slice, this aspect can be implemented by writing an underlying table with multiple dimensions or by using a dimension index table.
  • Other functions of some operation and maintenance. For example, client access control, black and white list, current limit, timeout fuse, call tracking with call chain, audit search for full operation, data migration assistance, etc.
  • other. Very few agents implement the functionality of distributed transactions (XA). It is also possible to implement distributed pessimistic locking at the agent level. In fact, think about it, SQL is not directly thrown into the database to execute, there are too many possibilities here, you can do it.

Implementation generally needs to do the following things:

  • has a high-performance network model, generally based on high-performance network framework implementation, after all, Proxy's network performance can not be a bottleneck.
  • There is a MySQL protocol parser, a lot of open source implementation, can be used directly.
  • There is a SQL syntax parser, Sharding and read-write separation inevitably need to parse SQL, the general process is SQL parsing, query optimization, SQL routing, SQL rewriting, after submitting SQL to multiple databases for execution Merger.
  • Proxy itself is preferably a stateless node that is highly available in a clustered manner.

These functions in addition to the implementation of the Proxy method and the implementation of the combination of data access standards, such as rewriting the JDBC framework implementation, the two implementations have their own advantages and disadvantages. The implementation of the framework mode is not limited to the database type, the performance is slightly higher, the implementation of the Proxy mode supports any language more transparent, and the function can also be more powerful. Recently, there has also been a concept of sidecar implementation of Sidecard. Similar to the concept of ServiceMesh, there is some information on the Internet, but this way has not seen mature implementation so far.

Distributed Cache Middleware

is similar to the database Proxy, here is the cache service as a backend, providing some clustering features. For example, the open source implementation of Redis as the back end is https://github.com/CodisLabs/codis and the hungry https://github.com/eleme/corvus and so on. In fact, it is not possible to use a Proxy method. It is also possible to develop a cache client at the framework level, but it has been said that the two methods have their own advantages and disadvantages. The proxy mode is more transparent. If Java, Python, and Go all need to link Redis, we don't need to develop multiple clients. Generally implement the following functions:

  • 分布. This is the most basic. The Key is distributed to each node through various algorithms to provide certain capacity planning and capacity alarm functions.
  • 高可用. A certain degree of high availability is achieved in conjunction with some of Redis' high availability solutions.
  • Operational features. Such as client access control, black and white list, current limit, timeout fuse, full-scale audit search, data migration assistance and so on.
  • Tracking and problem analysis. Integrated cache access tracking with full link monitoring. As well as smarter analysis of the use of the situation, combined with the cache hit rate, Value size, pressure balance provides some optimization suggestions and alarms, early detection of problems, cache crashes are often precursors.
  • Perfect management background, you can view the usage and performance of the cluster, and do capacity planning and migration solutions.

If the Redis cluster is particularly large, it is more convenient to have a set of its own Proxy system, and small projects are generally not used.


has been improved before. Job is one-third of the three-horse carriage in the Internet architecture system that I think, and plays an important role. The open source implementation is http://elasticjob.io/. There are two ways to implement Job management. One is similar to the framework, that is, the Job process is always started, and the framework calls the method to execute at the appropriate time. One is a way similar to external services, that is, the Job process is started on the right machine as needed. In the diagram at the beginning of this article, I drew a middleware for task scheduling. For the implementation of the latter method, we need a set of middleware or independent services to pull up the complex job process. The whole process is as follows:

  • Find some machines to join the cluster as our underlying server resources.
  • Job is packaged and deployed to a unified place. Job can be implemented in various languages, it doesn't matter. It can be a bare program or it can be implemented using Docker.
  • Before we allow the job, we need to allocate resources, estimate what kind of resources the Job needs, and then calculate a suitable resource allocation based on the execution frequency.
  • The middleware configures the process (or Docker) to be executed at the appropriate time according to the time configuration of each job. Before the execution, the appropriate machine is allocated according to the current situation. After the completion, the resources are released, and the next execution is not performed. Must be executed on the same machine. The middleware such as

is a lower-level set of services. In general, the task framework provides the following functions:

  • 分布. Jobs are not limited to stand-alone machines. They can be provided with running support by clusters. Clusters can be expanded as pressure increases. The downtime of any machine will not be a problem. If we use the middleware approach, this functionality is supported by the underlying middleware. The
  • API level provides a rich way to execute the job. For example, a task-based job, pulling data and processing separate jobs. If the data is separated from the processing, we can perform fragmentation execution on the data processing to achieve a Map-Reduce-like effect.
  • Execution dependency. We can configure the Job's dependencies to automate the job execution process analysis. The business only implements the disassembled business job, and the layout of the job is analyzed by the framework through rules.
  • integrated into the monitoring and tracking of the full link monitoring system.
  • Rich management background, providing unified execution time, data retrieval configuration, providing a list of job execution status and dependency analysis, viewing execution history, running, pausing, stopping Job and other management functions.


release management is not much related to development, but I think this is also a link in the closed loop of the whole system. Release management can use open source implementations such as Jenkins, and you may still need to have your own publishing system in the future. You can build a layer based on Jenkins, or you can implement the underlying deployment directly based on the general task scheduling middleware, as shown in the initial diagram. In general, release management has the following features:

  • Rich task types and plugins that support the construction and distribution of various language programs. There are the most basic release, rollback, restart, and stop functions.
  • Supports the project's dependency settings, enabling automatic publishing of programs on the dependent path.
  • Some control of the operation and maintenance level. For example, combined with CMDB to do permission control, do the release window control.
  • is used for the release process of the cluster. For example, you can view the grouping of clusters and set up an automatic grayscale publishing scheme.
  • is suitable for the release process of your company. For example, in process control, we are Dev environment to QA to Stage to Live. The QA environment can enter the Stage environment after being confirmed by QA. After confirmation by the development supervisor, it can go to the Stage environment. After confirmation by the product manager, you can enter the Live environment for release. On the publishing system we can combine OA to control this process.
  • At the time of construction, integrated unit testing, integrated coding specification checking, etc., in the background, you can easily see each released code change, test execution and code specification violations.

Jenkins and other systems are better for 1 and 2, and are incapable of combining with other systems at the company level. Often for this reason we need to package our own distribution system on Jenkins.

Summary, the reason why the title says that the basic middleware is constantly cultivating means that the middleware is also a good framework, and often requires a small team to maintain independently, and the function is continuously iteratively increased. Well, it is not only the most basic standard for implementing functions, but:

  • @维@Automation API and AI are very important components. The control is because we have mastered the data flow. The data is traversed from our middleware to the underlying service, database, and cache. With the control, there is the possibility of automation, and there is the possibility of intelligent monitoring and integrated alarm.
  • Also because of the data flow, by analyzing the data, we can give a lot of suggestions for development, we can do a lot of standards on this. These things can be done silently by the framework architecture team, without the need for business development.
  • Because of the shielding of the underlying data source, together with the service framework, we realize that the business system is surrounded by the framework rather than the business system using the framework and middleware, then some large-scale architectural transformations at the company level, such as multi-active architecture. We can achieve minimal transformation of the business system. The data + service + process has been surrounded and perceived by the middleware. The business system is only implementing the business function. We can dynamically route the data without the perception of the business system, make dynamic calls to the service, and dynamically update the process. control. As shown below, is it a bit Mesh?


Many places in this article are based on thinking and YY. Open source components need to have a lot of modifications and integrations to implement this concept. Many large companies have done these things to a certain extent, but because of the various adhesion dependencies of the framework, they cannot be completely open source. This work needs a lot of time and energy to do well. It really needs constant cultivation and precipitation to develop various middleware and management system systems suitable for the company's technology stack.