Problem with pyspark in kubernetes via bitnami helm chart

kayvansol · April 1, 2024, 9:39am

Hi everyone,
I have some problems in using pyspark in spark cluster via kubernetes bitnami helm chart,
every code with java or scala run correctly but some codes run with pyspark will not resulted correctly.

my spark cluster creation process has written at this link medium

e.g. python code (ctp.py) :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WritingParquet").getOrCreate()

df = spark.read.csv("/path/to/parquet/file.csv")

df.show()

df.write.parquet("a.parquet")

running bash command :

kubectl exec -it kayvan-release-spark-master-0 -- ./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077
ctp.py

result & problem :
df.show() —> shows the list correctly and the folder a.parquet is created with _SUCCESS and _SUCCESS.crc files but *.parquet file will not exist, while the result with scala shell is correct !

problem 2 :
and also the bin/pyspark shell will not run inside pod (container) !

I test above code with docker compose too with bitnami image and the result was the same fault in creation of *.parquert file :

csv read success :

csvread

parquet file creation failure :

docker-compose.yml :

version: '3.6'

services:

  spark:
    container_name: spark
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark   
    ports:
      - 127.0.0.1:8081:8080
    

  spark-worker:
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark

docker run :

docker-compose up --scale spark-worker=2

ctp.py :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WritingParquet").getOrCreate()

df = spark.read.option("header", True).csv("csv/file.csv")

df.show()

df.write.mode('overwrite').parquet("a.parquet")

spark submit :

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://35368355157f:7077 csv/ctp.py

I create a issue in bitnami github repo too : link

please help me

kayvansol · April 3, 2024, 10:22pm

I tested the python code for saving dataframe to json format, but the result was the same problem as I mentioned before :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WritingJson").getOrCreate()

df2 = spark.createDataFrame([(1, "Alice", 10),
                            (2, "Bob", 20),
                            (3, "Charlie", 30)], 
                            ["id", "name", "age"])


df2.show()

df2.write.mode('overwrite').json('file_name.json')

please say something helpfull.

kayvansol · April 4, 2024, 5:34pm

with scala shell (spark-shell), everything is ok.

val df = spark.read.csv("csv/file.csv")

df.write.mode("overwrite").format("json").save("file_name.json")

jsonScala

but with pyspark and spark-submit python code file not found !

kayvansol · April 5, 2024, 2:42pm

I tested the java code for saving dataframe to json format, but the result was the same problem as I mentioned before :

JavacsvreadSchema

package arka;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ctjson {

	public static void main(String[] args) {

		SparkSession SPARK_SESSION = SparkSession.builder().appName("Mahla ctjson")
				.master("spark://6fe9e36ddaa9:7077")
				.getOrCreate();

		Dataset<Row> df = SPARK_SESSION.read().option("inferSchema", "true")
				.option("header", "true")
				.csv("csv/file.csv");

		df.show();

		df.printSchema();
		
		df.write().mode("overwrite").format("json").save("file_name.json");
		
	}
}

pom.xml :

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.mahla</groupId>
	<artifactId>arka</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>csvtojson</name>

	<dependencies>

		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.12</artifactId>
			<version>3.5.1</version>
		</dependency>

		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-sql_2.12</artifactId>
			<version>3.5.1</version>
			<scope>provided</scope>
		</dependency>
		
	</dependencies>

</project>

jar file :
ctj.zip

submit command :

./bin/spark-submit --class arka.ctjson --master spark://6fe9e36ddaa9:7077 csv/ctj.jar

Could you please check the issue.

Topic		Replies	Views
Deploy Spark into Kubernetes Cluster General Discussions	6	4032	April 28, 2021
Spark with k8s orchestration General Discussions helm	0	781	July 1, 2019
Kubernates with mesos support General Discussions development	0	489	February 8, 2019
Airflow + Kubernetes VS Airflow + Spark General Discussions	0	780	October 11, 2018
We have published a series of Kubernetes tutorials. Need feedback on these tutorials General Discussions	1	1146	November 20, 2018

Problem with pyspark in kubernetes via bitnami helm chart

Related Topics