I’m trying to read a txt file from S3 with Spark, but I’m getting thhis error:
JavaScript
x
2
1
No FileSystem for scheme: s3
2
This is my code:
JavaScript
1
7
1
from pyspark import SparkContext, SparkConf
2
conf = SparkConf().setAppName("first")
3
sc = SparkContext(conf=conf)
4
data = sc.textFile("s3://"+AWS_ACCESS_KEY+":" + AWS_SECRET_KEY + "@/aaa/aaa/aaa.txt")
5
6
header = data.first()
7
This is the full traceback:
JavaScript
1
36
36
1
An error occurred while calling o25.partitions.
2
: java.io.IOException: No FileSystem for scheme: s3
3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
4
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
5
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
6
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
7
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
8
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
9
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
10
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
11
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
12
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
13
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
14
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
15
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
16
at scala.Option.getOrElse(Option.scala:121)
17
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
18
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
19
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
20
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
21
at scala.Option.getOrElse(Option.scala:121)
22
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
23
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
24
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
25
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
26
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
27
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
28
at java.lang.reflect.Method.invoke(Method.java:498)
29
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
30
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
31
at py4j.Gateway.invoke(Gateway.java:280)
32
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
33
at py4j.commands.CallCommand.execute(CallCommand.java:79)
34
at py4j.GatewayConnection.run(GatewayConnection.java:214)
35
at java.lang.Thread.run(Thread.java:748)
36
How can I fix this?
Advertisement
Answer
If you are using a local machine you can use boto3:
JavaScript
1
10
10
1
s3 = boto3.resource('s3')
2
# get a handle on the bucket that holds your file
3
bucket = s3.Bucket('yourBucket')
4
# get a handle on the object you want (i.e. your file)
5
obj = bucket.Object(key='yourFile.extension')
6
# get the object
7
response = obj.get()
8
# read the contents of the file and split it into a list of lines
9
lines = response[u'Body'].read().split('n')
10
(do not forget to setup your AWS S3 credentials).
Another clean solution if you are using an AWS Virtual Machine (EC2) would be granting S3 permissions to your EC2 and launching pyspark with this command:
JavaScript
1
2
1
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2
2
If you add other packages, make sure the format is: ‘groupId:artifactId:version’ and the packages are separated by commas.
If you are using pyspark from Jupyter Notebooks this will work:
JavaScript
1
10
10
1
import os
2
import pyspark
3
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
4
from pyspark.sql import SQLContext
5
from pyspark import SparkContext
6
sc = SparkContext()
7
sqlContext = SQLContext(sc)
8
filePath = "s3a://yourBucket/yourFile.parquet"
9
df = sqlContext.read.parquet(filePath) # Parquet file read example
10