Problem with pyspark in kubernetes via bitnami helm chart

Hi everyone,
I have some problems in using pyspark in spark cluster via kubernetes bitnami helm chart,
every code with java or scala run correctly but some codes run with pyspark will not resulted correctly.

my spark cluster creation process has written at this link medium

e.g. python code (ctp.py) :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WritingParquet").getOrCreate()

df = spark.read.csv("/path/to/parquet/file.csv")

df.show()

df.write.parquet("a.parquet")

running bash command :

kubectl exec -it kayvan-release-spark-master-0 -- ./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077
ctp.py

result & problem :thinking: :
df.show() —> shows the list correctly and the folder a.parquet is created with _SUCCESS and _SUCCESS.crc files but *.parquet file will not exist, while the result with scala shell is correct !


problem 2 :
and also the bin/pyspark shell will not run inside pod (container) !


I test above code with docker compose too with bitnami image and the result was the same fault in creation of *.parquert file :

csv read success :

csvread

parquet file creation failure :

docker-compose.yml :

version: '3.6'

services:

  spark:
    container_name: spark
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark   
    ports:
      - 127.0.0.1:8081:8080
    

  spark-worker:
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark

docker run :

docker-compose up --scale spark-worker=2

ctp.py :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WritingParquet").getOrCreate()

df = spark.read.option("header", True).csv("csv/file.csv")

df.show()

df.write.mode('overwrite').parquet("a.parquet")

spark submit :

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://35368355157f:7077 csv/ctp.py

I create a issue in bitnami github repo too : link

please help me :joy:

I tested the python code for saving dataframe to json format, but the result was the same problem as I mentioned before :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WritingJson").getOrCreate()

df2 = spark.createDataFrame([(1, "Alice", 10),
                            (2, "Bob", 20),
                            (3, "Charlie", 30)], 
                            ["id", "name", "age"])


df2.show()

df2.write.mode('overwrite').json('file_name.json')

please say something helpfull.

with scala shell (spark-shell), everything is ok.

val df = spark.read.csv("csv/file.csv")

df.write.mode("overwrite").format("json").save("file_name.json")

jsonScala

but with pyspark and spark-submit python code file not found !

I tested the java code for saving dataframe to json format, but the result was the same problem as I mentioned before :

JavacsvreadSchema

package arka;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ctjson {

	public static void main(String[] args) {

		SparkSession SPARK_SESSION = SparkSession.builder().appName("Mahla ctjson")
				.master("spark://6fe9e36ddaa9:7077")
				.getOrCreate();

		Dataset<Row> df = SPARK_SESSION.read().option("inferSchema", "true")
				.option("header", "true")
				.csv("csv/file.csv");

		df.show();

		df.printSchema();
		
		df.write().mode("overwrite").format("json").save("file_name.json");
		
	}
}

pom.xml :

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.mahla</groupId>
	<artifactId>arka</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>csvtojson</name>

	<dependencies>

		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.12</artifactId>
			<version>3.5.1</version>
		</dependency>

		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-sql_2.12</artifactId>
			<version>3.5.1</version>
			<scope>provided</scope>
		</dependency>
		
	</dependencies>

</project>

jar file :
ctj.zip

submit command :

./bin/spark-submit --class arka.ctjson --master spark://6fe9e36ddaa9:7077 csv/ctj.jar

Could you please check the issue.