hi,
Hello,
I’m running spark application with spark 2.3 docker image and getting below error in spark driver pod logs and executor pods are getting killed midway through while the job is running and even driver pod Terminated with below intermittent error
Not able to see executor logs as executor pods are killed
How can i view the executor logs to get more info ?
Im using spark.executor.memory as 7g
spark.driver.memory 1g
And im using AWS Ec2 instance type r4.4xlarge which has 16 vcPus and 122 GB on each Ec2 instance and im running on 20 instances.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 23 in stage 36.0 failed 4 times, most recent failure: Lost task 23.3 in stage 36.0 (TID 1006, 10.10.125.119, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor lost for unknown reasons.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
... 42 mor
More details from driver pod logs
2018-07-30 19:58:41 ERROR TaskSchedulerImpl:70 - Lost executor 3 on 10.*.*.*.*: Executor lost for unknown reasons.
2018-07-30 19:58:41 WARN TaskSetManager:66 - Lost task 32.0 in stage 9.0 (TID 133, 10.10.147.6, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor lost for unknown reasons.
2018-07-30 19:58:41 WARN KubernetesClusterSchedulerBackend:66 - Received delete event of executor pod accelerate-snowflake-test-5b6ba9d5495b3ae9a1358ae9c3f9a8c3-exec-3. Reason: null
2018-07-30 19:58:41 WARN KubernetesClusterSchedulerBackend:347 - Executor with id 3 was not marked as disconnected, but the watch received an event of type DELETED for this executor. The executor may have failed to start in the first place and never registered with the driver.
2018-07-30 19:58:41 INFO TaskSetManager:54 - Starting task 32.1 in stage 9.0 (TID 134, 10.*.*.*.*, executor 7, partition 32, ANY, 9262 bytes)
2018-07-30 19:58:42 INFO ContextCleaner:54 - Cleaned accumulator 246
2018-07-30 19:58:42 INFO ContextCleaner:54 - Cleaned accumulator 252
2018-07-30 19:58:42 INFO ContextCleaner:54 - Cleaned accumulator 254
2018-07-30 19:58:42 INFO BlockManagerInfo:54 - Removed broadcast_11_piece0 on spark-1532979165550-driver-svc.spark.svc:7079 in memory (size: 6.9 KB, free: 997.6 MB)
2018-07-30 19:58:42 INFO BlockManagerInfo:54 - Removed broadcast_11_piece0 on 10.*.*.*.*:43815 on disk (size: 6.9 KB)
2018-07-30 19:58:42 WARN TransportChannelHandler:78 - Exception in connection from /10.*.*.*.*:37578
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
2018-07-30 19:58:42 ERROR TransportResponseHandler:154 - Still have 1 requests outstanding when connection from /10.*.*.*.*:37578 is closed
2018-07-30 19:58:42 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Disabling executor 7.
2018-07-30 19:58:42 INFO DAGScheduler:54 - Executor lost: 7 (epoch 1)
2018-07-30 19:58:42 WARN BlockManagerMaster:87 - Failed to remove broadcast 11 with removeFromMaster = true - Connection reset by peer
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
2018-07-30 19:58:42 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 7 from BlockManagerMaster.
2018-07-30 19:58:42 WARN BlockManagerMasterEndpoint:66 - No more replicas available for rdd_9_37 !
MasterEndpoint:54 - Removing block manager BlockManagerId(7, 10.*.*.*.*, 43888, None)
2018-07-30 19:58:42 INFO BlockManagerMaster:54 - Removed 7 successfully in removeExecutor
2018-07-30 19:58:42 INFO DAGScheduler:54 - Shuffle files lost for executor: 7 (epoch 1)
2018-07-30 19:58:42 ERROR ContextCleaner:91 - Error cleaning broadcast 11
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:155)
at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:321)
at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java: