I’m trying to run a transformation function in a pyspark script:
JavaScript
x
22
22
1
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "test_csv", transformation_ctx = "datasource0")
2
3
dataframe = datasource0.toDF()
4
5
def to_long(df, by):
6
7
# Filter dtypes and split into column names and type description
8
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
9
# Spark SQL supports only homogeneous columns
10
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
11
12
# Create and explode an array of (column_name, column_value) structs
13
kvs = explode(array([
14
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
15
])).alias("kvs")
16
17
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
18
19
#to_long(df, ["A"])
20
.
21
df = to_long(dataframe, ["Name","Type"])
22
My dataset looks like this:
JavaScript
1
4
1
Name |01/01(FRI)|01/02(SAT)|
2
ALZA CZ| 0 | 0
3
CLPA CZ| 1 | 5
4
My desired output is something like this:
JavaScript
1
6
1
Name |Type | Date. |Value |
2
ALZA CZ|New | 01/01(FRI) | 0
3
CLPA CZ|New | 01/01(FRI) | 1
4
ALZA CZ|Old | 01/02(SAT) | 1
5
CLPA CZ|Old | 01/02(SAT) | 5
6
However, the last code line gives me an error similar to this:
JavaScript
1
2
1
AnalysisException: Cannot resolve 'Name' given input columns 'col10'
2
When I check:
JavaScript
1
2
1
df.show()
2
I see ‘col1’, ‘col2’ etc in the first row instead of the actual labels ( [“Name”,”Type”] ). Should I separately remove and then add the original column titles?
Advertisement
Answer
It seems like that your meta data table was configured using the built-in CSV classifier. If this classifier isn’t able to detect a header, it will call the columns col1, col2 etc.
Your problem lies one stage before your ETL job, so in my opinion you shouldn’t remove and re-add the original column titles, but fix your data import / schema detection, by using a custom classifier.