The function – parse_url always works fine if we working with spark-sql throw sql-client (via thrift server), IPython, pyspark-shell, but it doesn’t work throw spark-submit mode:
JavaScript
x
2
1
/opt/spark/bin/spark-submit --driver-memory 4G --executor-memory 8G main.py
2
The error is:
JavaScript
1
10
10
1
Traceback (most recent call last):
2
File "/home/spark/***/main.py", line 167, in <module>
3
)v on registrations.ga = v.ga and reg_path = oldtrack_page and registration_day = day_cl_log and date_cl_log <= registration_date""")
4
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
5
File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
6
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 40, in deco
7
pyspark.sql.utils.AnalysisException: undefined function parse_url;
8
Build step 'Execute shell' marked build as failure
9
Finished: FAILURE
10
So, we are using workaround here:
JavaScript
1
21
21
1
def python_parse_url(url, que, key):
2
import urlparse
3
ians = None
4
if que == "QUERY":
5
ians = urlparse.parse_qs(urlparse.urlparse(url).query)[key][0]
6
elif que == "HOST":
7
ians = urlparse.urlparse(url).hostname
8
elif que == "PATH":
9
ians = urlparse.urlparse(url).path
10
return ians
11
12
def dc_python_parse_url(url, que, key):
13
ians = None
14
try:
15
ians = python_parse_url(url, que, key)
16
except:
17
pass
18
return ians
19
20
sqlCtx.registerFunction('my_parse_url', dc_python_parse_url)
21
Please, any help with this issue?
Advertisement
Answer
Spark >= 2.0
Same as below, but use SparkSession
with Hive support enabled:
JavaScript
1
2
1
SparkSession.builder.enableHiveSupport().getOrCreate()
2
Spark < 2.0
parse_url
is not a classic sql function. It is a Hive UDF and as such requires HiveContext
to work:
JavaScript
1
18
18
1
from pyspark import SparkContext
2
from pyspark.sql import HiveContext, SQLContext
3
4
sc = SparkContext()
5
6
sqlContext = SQLContext(sc)
7
hivContext = HiveContext(sc)
8
9
query = """SELECT parse_url('http://example.com/foo/bar?foo=bar', 'HOST')"""
10
11
sqlContext.sql(query)
12
## Py4JJavaError Traceback (most recent call last)
13
## ...
14
## AnalysisException: 'undefined function parse_url;'
15
16
hivContext.sql(query)
17
## DataFrame[_c0: string]
18