|
From The Programmer’s Guide to Apache Thrift by Randy Abernethy Distributed applications vary in the extreme when it comes to load profiles, communications patterns, message payload types and sizes, among other performance considerations. Performance isn’t a one size fits all proposition. That said, we’ll use this article to develop some basic intuition around Apache Thrift networked service performance and how Apache Thrift fits into the broad MSA landscape.
|
Save 37% on The Programmer’s Guide to Apache Thrift. Just enter code fccabernethy into the discount code box at checkout at manning.com.
Apache Thrift in the Distributed Landscape
Apache Thrift is billed as a cross-language framework for high performance RPC applications, the key phrases being “cross language” and “high performance”. Apache Thrift has impressive “cross language” support, ranging from Ruby to C++ and from JVM languages to .Net languages. That leaves us to consider the nature of the “high performance” claim.
Services
A service is a self-contained collection of invokable operations. The SOA and microservices approaches to system development decompose systems into encapsulated, individually testable services. This makes it possible to test the applications at the service level. Services requiring other services to run may depend on mock services for testing. Test clients and mocks built in scripting languages typically require language agnostic APIs to collaborate with production services.
Given that services are frequently composed from other services, inter-service communications performance is often an important consideration. Distributed systems can be decomposed into services in varying ways, including objects, functions and resources.
Systems organized around objects may be suitable for distributed object interfaces supported by technologies such as COM and CORBA, but the stateful nature of such services presents scaling challenges and has caused them to fall out of favor.
RPC-style services, perhaps the oldest service scheme, organize services around related sets of functions. Such systems are suitable for the implementation of discrete operations and resonate with traditional software approaches, as well as with modern functionally oriented systems.
Resource oriented services organize sets of related resources into services, typically in concert with the REST constraints. Often referred to as resource oriented architectures, or ROA, such systems identify resources using hierarchical international resource identifiers (IRIs are the international offspring of URL/URIs).
RPC systems, like SOAP, gRPC and Apache Thrift, and ROA style approaches like REST are all in common widespread use. To develop a basic understanding of the design differences between RPC and ROA style systems we can begin by looking at some representative interface specifications.
Interface Definition
RPC style services traditionally have interfaces defined in a formal interface definition language. IDLs define only the features of a service necessary to form a contract with the calling client. IDLs, therefore, define all of the mechanical aspects of an interface but can only hint at the semantics, which are equally important. Well-crafted public interfaces include the semantic explanations clients are required to use the interface correctly, typically in the form of documentation integrated in code comments.
RPC Interfaces
Though RPC services are embodied by sets of operations, those operations often accept and return complex data types, typically known as entities or types at the interface level of abstraction. For example, an interface designed to return stock trade reports might traffic in the TradeReport type. Listing 1 presents a simple stock trade report service and TradeReport type definition in Apache Thrift IDL.
Listing 1 Simple Apache Thrift interface in IDL
struct TradeReport { 1: string symbol, 2: double price, 3: i32 size, 4: i32 seq_num } service TradeHistory { TradeReport get_last_sale(1: string symbol) }
The Apache Thrift interface definition language is remarkably compact and simple. It’s also extremely expressive, allowing rich types and services to be described easily. Apache Thrift includes direct support for lists and maps and has a full complement of features enabling interface evolution.
The interface defined in Listing 1 exposes an RPC service called TradeHistory with a single self-describing method called get_last_sale() which returns a TradeReport. The TradeReport type is defined independently of the service allowing multiple methods and services to reuse the type without defining it repeatedly.
Large-scale systems may require large collections of much longer IDL files to fully describe their interfaces. Such a collection of interface definitions represent a mere fraction of the code necessary to implement the interfaces. The ability to summarize service functionality at a high-level of abstraction is one of the key features of IDL. For many architects the ability to automatically generate code from IDL is secondary to the interface specification and rigor provided by the process of crafting the IDL itself.
Benefits frequently attributed to IDL based systems and processes include:
-
Documentation – Interface definition languages are designed to simplify service descriptions by eliminating implementation details, making IDL a perfect base for interface documentation.
-
Abstraction – The absence of implementation affords engineers and architects the ability to scrutinize the pure interface and consider the interactions in the abstract, simplifying efforts to identify and reduce server roundtrips, making it easier to identify and remove nonessential data and methods, etc.
-
Domain Centric – As a simple abstract language, IDL sources in whole or in part, can be used to confer with users and Domain experts about the veracity of a design.
-
Cross Language/Platform Support – IDLs aren’t implementation languages, describing interfaces in a language and platform agnostic way.
-
Specification – IDL and associated standards can directly define an ABI.
-
Code Generation – A robust IDL enables tooling to generate client and server stubs, simplifying and improving the reliability of client and server construction.
-
Rigor – IDL can add type safety and other forms of rigor in critical interfaces, even when implementation languages (such as scripting languages) provide no such support.
Although less interesting in smaller projects, features such as those listed above pay large dividends when architecting larger systems. IDLs aren’t exclusive to RPC systems; Object and Resource oriented systems also offer interface specification mechanisms.
REST Interfaces
REpresentational State Transfer, REST, is an architectural style rather than a standard (see Fielding, 2000). Services conforming to the six REST constraints are said to be RESTful. In practice many interfaces branded as RESTful fall well short of fully embracing the six constraints. Given that REST is a tool to serve the architect, and not the other way around, many interface designers diverge from a pure REST approach intentionally. This makes the range of REST style services quite broad in practice.
One thing all practical REST services have in common is the use of HTTP as a transport. Properly designed restful services use the infrastructure of the Web, which requires the use of HTTP but places on tap a veritable universe of tools and systems ready to combine access to residential proxies, load balancers, firewalls, reverse proxy servers, etc.
Although there’s no unifying standard, restful service interfaces can be described using various technologies such as RAML, API Blueprint, WADL and the more recent Swagger based Open API Initiative. Listing 2 provides a sample listing for an HTTP based service which is functionally equivalent to the RPC service found in Listing 1.
Listing 2 Simple REST Interface in RAML
#%RAML 0.8 --- title: Trade Report API baseUri: http://api.example.com/{version} version: v1 schemas: - trade_report: | { "$schema": "http://json-schema.org/draft-03/schema", "type": "object", "properties": { "symbol": {"type": "string", "required": true}, "price": {"type": "number", "required": true}, "size": {"type": "number", "required": true}, "seq_num": {"type": "number", "required": true} } } /trades: /{symbol}: /last_sale: get: responses: 200: body: application/json: schema: trade_report
Though it includes a bit more punctuation and boilerplate than the Apache Thrift example, the RAML service definition clearly identifies the trade_report type and the trades IRI with its last_sale subresource. Rather than calling the get_last_sale() RPC function with a stock ticker parameter to retrieve the last trade, a user of the restful version of the interface would invoke the HTTP GET method on the IRI:
http://api.example.com/v1/trades/CSCO/last_sale
The preceding Apache Thrift and RAML IDL examples highlight some of the more important differences between RPC services and ROA services. Foundationally, RPC services are decomposed into functions/operations, and ROA services are decomposed into resources/entities. In the RPC service the symbol is a parameter, but in the ROA service the symbol is a resource and it’s represented by an IRI directly.
Although it’s possible to model RPC functions directly with IRIs (e.g. http://api.example.com/get_last_sale?symbol=CSCO), to do it fairly misses the point of REST. Given that HTTP supports IRI methods such as GET, POST, PUT and DELETE, one might ask what it means to PUT http://api.example.com/get_last_sale?symbol=CSCO. Though often ignored by the unindoctrinated, each of the HTTP methods has distinct semantics defining safety, idempotence and support or lack thereof for the upload of a document body, along with HTTP header implications.
Because REST is an architectural style with no associated standard, the range of implementations and approaches to documentation and interface specification vary widely. A given REST IRI may be invoked with any one of several HTTP methods (POST, OPTIONS, GET, HEAD,…) may receive and/or return a document body, may accept path parameters, query parameters and/or matrix parameters, and may define interactions with any number of HTTP headers. Restful services are, in essence, integrated into the HTTP protocol.
In contrast, RPC services generally define their own protocol and typically run directly on TCP or UDP, the protocols below HTTP. When RPC systems like Apache Thrift and SOAP use HTTP, they effectively tunnel within HTTP POST methods, making no practical use of HTTP methods, headers and the like.
HTTP integration brings many advantages to ROA systems. First and foremost, HTTP is the protocol of the Web and the Web is the largest distributed system ever created by mankind. Consequently, properly designed restful services integrate naturally with most pieces of networking software and hardware. For example, the HTTP GET method has “safe” semantics, meaning that the user expects no side effects (changes in state) on the server as a consequence of a GET requests. A GET request using a given IRI may be safely returned from cache in many cases. Every browser, proxy server and reverse proxy server in the world understands this. It’s hard to imagine a client/server interaction faster than one that returns the response from an in-process cache, as is possible in browser based REST clients.
REST and RPC
The ubiquity of HTTP effectively makes REST services language agnostic and universally accessible. Several frameworks supporting restful service creation can be found for any language. Given that restful services integrate with and use the infrastructure of the Web, it’s no wonder that the term API is almost synonymous with restful services these days.
Why do we need Apache Thrift? A good question, to which some might answer:
Apache Thrift offers clear standards in contrast to the diverse but interoperability challenged tools of the REST world
Apache Thrift IDL is cleaner and easier to work with than the many REST equivalents
The Apache Thrift IDL compiler generates consistent code in a wider range of languages than most REST code generators
Some systems have interfaces more naturally suited to functional decomposition rather than resource decomposition
Although all of these are fair points, they may not be enough to sway one from the utter ubiquity of REST, many applaud the tool-less nature and flexibility of the REST style. For public interfaces deployed over the Web, REST is often the natural choice.
One killer feature is provided by Apache Thrift that REST can’t compete with. Performance in a non-web environment. If you need responsiveness, support for extreme request rates in backend systems or the ability to run services in resource restricted embedded systems, Apache Thrift may be the right tool.
In this article we’ll build a simple rest style service and a corresponding Apache Thrift service in order to better understand the strengths and weaknesses of each. Every programming language worth its salt provides at least one, likely several, frameworks for implementing REST style services. We’ll build our services in Java and use Jersey, the reference implementation of JAX-RS, the Java REST API standard. Listing 3 provides the REST service source code.
Listing 3 ~/ ThriftBook/part3/ws/rest/rest-servlet/src/main/java/RestServer.java
import javax.ws.rs.GET; import javax.ws.rs.Path; import javax.ws.rs.Produces; import javax.ws.rs.QueryParam; import javax.ws.rs.core.MediaType; @Path("tradehistory") ? public class RestServer { ? public static class TradeReport { ? public String symbol; public double price; public int size; public int seq_num; public TradeReport(){} public TradeReport(String symbol, double price, int size, int seq_num) { this.symbol = symbol; this.price = price; this.size = size; this.seq_num = seq_num; } } @GET ? @Path("/get_last_sale") ? @Produces(MediaType.APPLICATION_JSON) ? public TradeReport get_last_sale(@QueryParam("symbol") String symbol) { return new TradeReport(symbol, 25.50, 100, 1); } }
The service in listing 3 is designed to provide a direct comparison with our upcoming Apache Thrift example, which makes it not resource-oriented in design. That said, it demonstrates the JAX-RS approach to REST services fairly well. JAX-RS uses annotations to identify mappings between code and interface elements, other popular Java frameworks such as Spring use a similar approach. For example, the @Path annotation causes the RestServer class to handle all IRIs with the tradehistory path ? . Similarly, the get_last_sale() method handles all of the GET requests to the subresource tradehistory/get_last_sale IRI ? .
Figure 1 – TradeHistory in WADL
The get_last_sale() method returns a TradeReport ? instance. The @Produces annotation ? causes the returned object to be converted into a JSON string and placed in the response body.
Embedding the interface into the server code represents a significant difference in approach to that of IDL style systems. Rather than generating service code from IDL, the Jersey framework generates IDL from the code. The Jersey generated IDL s known as WADL, Web Application Description Language, and is a distant relative of the SOAP WSDL interface definition language. Unlike WSDL, WADL is not a W3C standard and not all REST adherents use it. Diverse opinions on how best to build restful services complicates the task of defining an overarching REST service description language. The WADL generated from our JAX-RS server is available through the application.wadl IRI and appears in Figure 1. You may note that there’s no mention of our TradeReport type. In this case the WADL identifies that a TradeReport is returned by the get_last_sale() method as a JSON response.
As it stands the WADL in Figure 1 isn’t complete enough to generate full client or server stubs. For this and other reasons, tools such as Swagger, API Blueprint and RAML, have been developed as WADL alternatives, making it easier to use an IDL first or round-trip approach to interface development.
The JAX-RS REST service in Listing 3 can be built using the Apache Maven project included with the book’s source code. The Maven project includes support for running the service within an Apache Tomcat 7 application server, one of the most common hosts for restful web services. Here’s a sample run of the REST server:
thrift@ubuntu:~/ThriftBook/part3/ws/rest/rest-servlet$ mvn clean package ... thrift@ubuntu:~/ThriftBook/part3/ws/rest/rest-servlet$ mvn tomcat7:run [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------ [INFO] Building Rest Server [INFO] task-segment: [tomcat7:run] [INFO] ------------------------------------------------------------------ [INFO] Preparing tomcat7:run [INFO] [resources:resources {execution: default-resources}] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory src/main/resources [INFO] [compiler:compile {execution: default-compile}] [INFO] Nothing to compile - all classes are up to date [INFO] [tomcat7:run {execution: default-cli}] [INFO] Running war on http://localhost:8080/rest-server [INFO] Creating Tomcat server configuration at target/tomcat [INFO] create webapp with contextPath: /rest-server
Figure 2 – Calling a REST service from a browser
The first command, mvn clean package, removes intermediate files and targets, then builds the servlet into a war file (target/rest-server-1.0-SNAPSHOT.war). The second command runs a Tomcat 7 server to host the servlet.
One of the great features of REST services is the ease with which you can invoke them using a plain vanilla browser, figure 2 shows a browser invoking our new service.
This client uses the GET verb to invoke the tradehistory/get_last_sale URI, with the symbol set to AAPL. This client GETs the tradehistory/get_last_sale IRI one million times in a tight loop. Here’s a sample timed run of the client:
thrift@ubuntu:~/ThriftBook/part3/ws/rest/rest-client$ time mvn exec:java [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------- [INFO] Building Rest Client [INFO] task-segment: [exec:java] [INFO] ------------------------------------------------------------------- [INFO] Preparing exec:java [INFO] No goals needed for project - skipping [INFO] [exec:java {execution: default-cli}] [INFO] ------------------------------------------------------------------- [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------- [INFO] Total time: 5 minutes 2 seconds [INFO] Finished at: Fri Jul 31 13:55:59 PDT 2015 [INFO] Final Memory: 18M/90M [INFO] ------------------------------------------------------------------- real 5m3.331s user 1m15.539s sys 0m54.374s
In this session we clean and build the client jar, then run it under the Linux time command. The run takes a little over five minutes to complete on the test system, producing one server round trip per 303 microseconds. As you can see from the timing data, most of the elapsed time on the client is spent waiting for the server to respond. Out of the elapsed five minutes the client consumed 75 seconds of CPU in user mode and 54 seconds of CPU in kernel mode.
Whether you deem this REST service fast or slow, service responsiveness is an important factor. Google, Microsoft, IBM and others have produced numerous studies bearing out the criticality of application responsiveness, particularly in the context of mobile devices. TechCrunch reported on April 21, 2015 that 44% of the Fortune 500 failed a standardized mobile responsiveness test. Unfortunate given that the number of mobile internet users surpassed the number of desktop users in early 2014. Although a large part of responsive design is centered on making pages look good in a given viewport, you need to have the data to render in the first place. A 2014 Google study documented a 20% drop in traffic associated with a 500 millisecond increase in search latency. Load time is also one of Google’s search ranking criteria, be it for a site making use of a well known google my business management service or otherwise.
If we needed things to run faster we could consider Apache Thrift as an alternative to REST. Below we’ll create an Apache Thrift example as close to the REST technology platform as possible. As in the REST example, we’ll begin by creating an Apache Thrift RPC server using a Java servlet running under Tomcat 7 with the JSON serialization protocol. The only practical difference between our new Apache Thrift service and the prior REST service is that the REST service is IRI based and uses Jersey to parse and serialize and the Apache Thrift service is RPC based and uses generated Apache Thrift code to parse and serialize. The code for the simple Apache Thrift service appears in listing 4.
Listing 4 ~/ThriftBook/part3/ws/thrift/thrift-servlet/src/main/java/ThriftServer.java
import org.apache.thrift.protocol.TJSONProtocol; import org.apache.thrift.server.TServlet; public class ThriftServer extends TServlet { public static class TradeHistoryHandler implements TradeHistory.Iface { @Override public TradeReport get_last_sale(String symbol) { return new TradeReport(symbol, 25.50, 100, 1); } } public ThriftServer() { super(new TradeHistory.Processor(new TradeHistoryHandler()), new TJSONProtocol.Factory()); } }
The Apache Thrift service implements the interface defined in the IDL from listing 1, functionally equivalent to the WADL for our REST service. As illustrated in listing 4, using the Java servlet API makes implementing an Apache Thrift server a trivial affair, but how does it perform?
Here’s a session building and running the Apache Thrift servlet server:
thrift@ubuntu:~/ThriftBook/part3/ws/thrift-servlet$ mvn clean package ... thrift@ubuntu:~/ThriftBook/part3/ws/thrift-servlet$ mvn tomcat7:run [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------- [INFO] Building Thrift Servlet [INFO] task-segment: [tomcat7:run] [INFO] ------------------------------------------------------------------- [INFO] Preparing tomcat7:run [INFO] [thrift:compile {execution: thrift-sources}] [INFO] [resources:resources {execution: default-resources}] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory src/main/resources [INFO] Copying 1 resource [INFO] [compiler:compile {execution: default-compile}] [INFO] Changes detected - recompiling the module! [INFO] Compiling 4 source files to target/classes [INFO] [tomcat7:run {execution: default-cli}] [INFO] Running war on http://localhost:8080/thrift-servlet [INFO] Creating Tomcat server configuration at target/tomcat [INFO] create webapp with contextPath: /thrift-servlet
With the server up and running in Tomcat we can perform a throughput test by running a sample RPC test client in another shell. The thrift-servlet project has an exec:java goal to run the client, which makes the exact same 1,000,000 get_last_sale() calls used to test in the REST service. Here’s the output from the client session:
thrift@ubuntu:~/ThriftBook/part3/ws/thrift-servlet$ time mvn exec:java [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------- [INFO] Building Thrift Servlet [INFO] task-segment: [exec:java] [INFO] ------------------------------------------------------------------- [INFO] Preparing exec:java [INFO] No goals needed for project - skipping [INFO] [exec:java {execution: default-cli}] [INFO] ------------------------------------------------------------------- [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------- [INFO] Total time: 3 minutes 2 seconds [INFO] Finished at: Fri Jul 31 16:58:51 PDT 2015 [INFO] Final Memory: 11M/61M [INFO] ------------------------------------------------------------------- real 3m3.130s user 0m21.790s sys 1m22.852s
With the same system and conditions, the Apache Thrift server completes the run in 3/5ths of the time of the Rest service. The Apache Thrift test transfers more bytes than the REST example because the REST GET operation only has a payload on the response side, the Apache Thrift client POSTs its request to the server in the request body. The key differentiator is serialization efficiency. Even though both example programs use HTTP and JSON, the REST example uses a general-purpose framework (Jersey) and JSON serializer (Moxy), althought the Apache Thrift program uses a compiled purpose built JSON serializer created for the TradeReport type by the Apache Thrift IDL compiler.
Although this is a significant performance improvement we can do better.
Transports
If we need more performance we can leave the Tomcat server and the HTTP protocol behind and implement the exact same service with one of the Apache Thrift servers and the TSocket TCP transport. Listing 5 provides the source for such a server.
Listing 5 ~/ThriftBook/part3/ws/thrift/thrift-json/ThriftServer.java
import java.io.IOException; import org.apache.thrift.transport.TServerSocket; import org.apache.thrift.transport.TTransportException; import org.apache.thrift.protocol.TJSONProtocol; import org.apache.thrift.server.TThreadPoolServer; public class ThriftServer { public static class TradeHistoryHandler implements TradeHistory.Iface { @Override public TradeReport get_last_sale(String symbol) { return new TradeReport(symbol, 25.50, 100, 1); } } public static void main(String[] args) throws TTransportException, IOException { TradeHistory.Processor proc = new TradeHistory.Processor(new TradeHistoryHandler()); TServerSocket trans_svr = new TServerSocket(9090); TThreadPoolServer server = new TThreadPoolServer(new TThreadPoolServer.Args(trans_svr) .protocolFactory(new TJSONProtocol.Factory()) .processor(proc)); System.out.println("[Server] listening of port 9090"); server.serve(); } }
Not only does this server eliminate the Tomcat overhead, it also eliminates the reliance on HTTP (methods, headers and the like). Here’s a sample run of a TCP based Apache Thrift server (still using the JSON protocol):
thrift@ubuntu:~/ThriftBook/part3/ws/thrift-json$ ant runServer Buildfile: /home/thrift/ThriftBook/part3/ws/thrift-json/build.xml runServer: [java] [Server] listening of port 9090
Now we can run the TCP client, which uses the same 1,000,000 calls to get_last_sale():
thrift@ubuntu:~/ThriftBook/part3/ws/thrift-json$ time ant runClient Buildfile: /home/thrift/ThriftBook/part3/ws/thrift-json/build.xml runClient: BUILD SUCCESSFUL Total time: 29 seconds real 0m30.055s user 0m10.641s sys 0m7.038s
In this test the TCP based server is an order of magnitude faster than the HTTP REST server. Also, by eliminating the HTTP overhead, this example reduces request and response size by a factor of two.
Serialization
For even more performance we can switch from the slow to parse JSON protocol to the Apache Thrift binary or compact protocol. Here’s an example run of the service from listing 5 using the TBinaryProtocol instead of TJSONProtocol. The server first:
thrift@ubuntu:~/ThriftBook/part3/ws/thrift$ ant runServer Buildfile: /home/thrift/ThriftBook/part3/ws/thrift/build.xml runServer: [java] [Server] listening of port 9090
Now the client is 1,000,000 call timing output:
thrift@ubuntu:~/ThriftBook/part3/ws/thrift$ time ant runClient Buildfile: /home/thrift/ThriftBook/part3/ws/thrift/build.xml runClient: BUILD SUCCESSFUL Total time: 15 seconds real 0m15.370s user 0m4.196s sys 0m3.896s
Figure 3 – Number of seconds required to complete 1mm API calls
The TBinaryProtocol version of our client/server solution is 20 times faster than the REST solution. The number of bytes exchanged in this case is comparable to the preceding JSON example due to the trivial nature of the interface, but the elimination of JSON parsing improves performance by a factor of almost two. Another important service consideration in many environments is memory consumption. The Tomcat 7 Jersey based REST servlet initially reserves about 500K of private memory in the above tests while the final Apache Thrift server example reserves about 75K of private memory. You can use tools like WireShark, nethogs, iptraf, top, htop, ps and pmap to examine the footprint and performance features of your own targeted services.
Figure 3 shows the relative performance of the various examples above and adds a comparable SOAP example. The SOAP web service is the slowest of the bunch largely because it incurs all of the overhead of the REST service plus the additional size and processing burden of XML encoding in both directions.
Performance caveats
Although the preceding examples are useful for developing basic performance intuition, they should be taken with a grain of salt and are no substitute for practical testing of real interfaces using your languages and your production loads. Many important factors have been overlooked in this simple comparison. That said, even the limited nature of this comparison demonstrates why companies like Google (with Protocol Buffers), Facebook and Twitter (both users of Apache Thrift) have adopted non-REST solutions for high performance backend services.
That’s all for this article.
If you want to learn more about the book, check it out on liveBook here.