Sunday, 1 April 2012

java.lang.Thread considered harmful

As Moore's law seems to be moving horizontally instead of vertically these days, most CPU intensive programs have split up their logic in threads so it can be run on multiple cores simultaneously. This seems appealing, but for the programmer it is not. It is no wonder that thinking in parallel operations results in more bugs, as there is only so much that a programmer can mentally keep in his head.

Consider this engineering comparison: No matter how fast you make a train, it has to stop at every station where passengers have to be picked up. This is annoying for the passengers who have no business at the current station. This is the serial way of doing things. Recently a parallel alternative was proposed that would allow trains to never stop moving. It is a simple idea that no doubt has popped into many an engineer's brain in the last 150 years: passengers start their journey on board of a small train that will match up to the speed of the “main train” and passengers will be able to transfer while both trains are riding side by side at high speeds. From the moment they get on board till the moment they arrive at their destination, passengers will travel without interruption!

While this side-by-side-trains idea is great in theory, we should ask ourselves: why has no one built it? Why is no one using it? You do not have to be an engineer to figure this out. It is extremely complex, difficult and dangerous.

In the programming-industry we have the habit of asking the five whys. If you want to change something, you should be able to give five convincing answers to 'Why?'. When five tests have passed, you can be sure that there is a solid reason for implementing the change. When we started to support parallel programming we had not implemented this safety procedure yet. Looking back, we think that our CTO should have had this conversation with the guy who wanted parallelism before making the move:
  1. CTO: Ok, so why exactly do you want parallel execution?
    Guy: CPU's are getting more cores and are not becoming faster.
  2. Why do CPUs get more cores instead of becoming faster?
    Because of some technical issues, ask Intel?
  3. Why should we ask Intel?
    Because I don't know the answer myself.
  4. Why don't you know the answer yourself?
    ...
The over-enthusiastic programmer can not even get to the fifth why. This conversation shows us that there is no real problem with serial execution.

Unfortunately, we did make the move, and we are paying dearly for it. Instead we should have taken a lesson from the railroads, who have been running trains in a serial fashion for about 200 years: slowing down every now and then is not so bad. Why hurry? Is there a real business need for lightning fast operations? No.

Although parallel execution might be OK as an academic exercise, for real businesses the advantages are marginal. Futhermore, we have discovered that parallelism is the cause of at least 90% of all software crashes. (In fact, we found that in our projects it was the cause of every single bug, but we should keep some room for statistics so we put 90% instead of 100%.) We came to this conclusion by analyzing all our logged stacktraces (which are the computer's way of asking the five whys) and found that java.lang.Thread was the root cause of virtually all problems.

Take for example this problem, where Peter tried to call a webservice. Upon execution, the JVM throws out this stacktrace.


Advanced stacktrace: at mA.a(SourceFile:188) Caused by: com.sun.xml.ws.client.ClientTransportException: HTTP transport error: java.net.ConnectException: Connection timed out: connect at com.sun.xml.ws.transport.http.client.HttpClientTransport.getOutput(HttpClientTransport.java:132) at com.sun.xml.ws.transport.http.client.HttpTransportPipe.process(HttpTransportPipe.java:256) at com.sun.xml.ws.transport.http.client.HttpTransportPipe.processRequest(HttpTransportPipe.java:184) at com.sun.xml.ws.transport.DeferredTransportPipe.processRequest(DeferredTransportPipe.java:137) at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:641) at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:600) at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:585) at com.sun.xml.ws.api.pipe.Fiber.runSync(Fiber.java:482) at com.sun.xml.ws.client.Stub.process(Stub.java:323) at com.sun.xml.ws.client.dispatch.DispatchImpl.doInvoke(DispatchImpl.java:192) at com.sun.xml.ws.client.dispatch.DispatchImpl.invoke(DispatchImpl.java:218) at nA$2.a(SourceFile:318) at nA$2.run(SourceFile:296) at java.security.AccessController.doPrivileged(Native Method) at nA.a(SourceFile:285) at nA.a(SourceFile:136) at com.mendix.modules.webservices.WebserviceModule.callWebservice(SourceFile:227) at lw.a(SourceFile:104) at mA.a(SourceFile:73) at mz.executeAction(SourceFile:101) at com.mendix.systemwideinterfaces.core.UserAction.execute(SourceFile:48) at com.mendix.core.actionmanagement.CoreAction.call(SourceFile:435) at hk.b(SourceFile:156) at com.mendix.core.Core.execute(SourceFile:212) at gd.execute(SourceFile:184) at iI.a(SourceFile:311) at iI.a(SourceFile:240) at iI.processRequest(SourceFile:179) at iL.a(SourceFile:71) at com.mendix.core.MxRuntime.processRequest(SourceFile:856) at com.mendix.m2ee.server.handler.RuntimeHandler.handle(RuntimeHandler.java:43) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113) at org.eclipse.jetty.server.Server.handle(Server.java:334) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:559) at org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:1007) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:747) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:209) at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:406) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:462) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.ConnectException: Connection timed out: connect at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at sun.net.NetworkClient.doConnect(NetworkClient.java:163) at sun.net.www.http.HttpClient.openServer(HttpClient.java:395) at sun.net.www.http.HttpClient.openServer(HttpClient.java:530) at sun.net.www.http.HttpClient.(HttpClient.java:234) at sun.net.www.http.HttpClient.New(HttpClient.java:307) at sun.net.www.http.HttpClient.New(HttpClient.java:324) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:949) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1014) at com.sun.xml.ws.transport.http.client.HttpClientTransport.getOutput(HttpClientTransport.java:120) at com.sun.xml.ws.transport.http.client.HttpTransportPipe.process(HttpTransportPipe.java:256) at com.sun.xml.ws.transport.http.client.HttpTransportPipe.processRequest(HttpTransportPipe.java:184) at com.sun.xml.ws.transport.DeferredTransportPipe.processRequest(DeferredTransportPipe.java:137) at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:641) at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:600) at com.sun.xml.ws.api.pipe.Fiber.doRun(Fiber.java:585) at com.sun.xml.ws.api.pipe.Fiber.runSync(Fiber.java:482) at com.sun.xml.ws.client.Stub.process(Stub.java:323) at com.sun.xml.ws.client.dispatch.DispatchImpl.doInvoke(DispatchImpl.java:192) at com.sun.xml.ws.client.dispatch.DispatchImpl.invoke(DispatchImpl.java:218) at nA$2.a(SourceFile:318) at nA$2.run(SourceFile:296) at java.security.AccessController.doPrivileged(Native Method) at nA.a(SourceFile:285) at nA.a(SourceFile:136) at com.mendix.modules.webservices.WebserviceModule.callWebservice(SourceFile:227) at lw.a(SourceFile:104) at mA.a(SourceFile:73) at mz.executeAction(SourceFile:101) at com.mendix.systemwideinterfaces.core.UserAction.execute(SourceFile:48) at com.mendix.core.actionmanagement.CoreAction.call(SourceFile:435) at hk.b(SourceFile:156) at com.mendix.core.Core.execute(SourceFile:212) at gd.execute(SourceFile:184) at iI.a(SourceFile:311) at iI.a(SourceFile:240) at iI.processRequest(SourceFile:179) at iL.a(SourceFile:71) at com.mendix.core.MxRuntime.processRequest(SourceFile:856) at com.mendix.m2ee.server.handler.RuntimeHandler.handle(RuntimeHandler.java:43) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113) at org.eclipse.jetty.server.Server.handle(Server.java:334) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:559) at org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:1007) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:747) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:209) at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:406) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:462) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436) at java.lang.Thread.run(Thread.java:662)

This stacktrace shows you the path that the execution traveled. It goes from ClientTransportException, and com.mendix.modules.webservices.WebserviceModule.callWebservice, and eventually collides head on with java.lang.Thread. We have found similar patterns in all of our stacktraces. The intermediate steps are different every time, but the root cause of the problem is always with Thread.

You might wonder, with all these collisions and Thread problems: why does some Java software work at all? Well, it is the programmer's job to make sure that the execution is rerouted dynamically, to steer away whenever an execution path is headed towards Thread. Just as a compass shows you the North Pole, there is a pointer in your program that points towards Thread. The program needs to check this pointer every once in a while to steer away and avoid a collision. This is difficult enough with one Thread, but with ever increasing numbers of cores, this becomes impractical if not impossible. Imagine having to check 4, 16, 64 or 128 compasses before every single step you take! Naturally the program does not do this, it takes a number of steps before checking. Consequently, the chance of collision with a Thread for a single program becomes roughly 1 / (n - t) where n is the number of unchecked steps, and t is the number of threads. Let me be frank: this is unacceptable.



We'd like to send this message out to the rest of the industry: Stop this parallel madness, slow down a bit! Programmers are more expensive than computers. If you want the best of both worlds, just wait until single CPUs will get faster. Vertical scaling is the way forward!