Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:

[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]


Above result is created by join with a dataframe to itself, you can see there are 4 columns with both two a and f.

The problem is is there when I try to do more calculation with the a column, I cant find a way to select the a, I have try df[0] and df.select('a'), both returned me below error mesaage:

AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.


Is there anyway in Spark API that I can distinguish the columns from the duplicated names again? or maybe some way to let me change the column names?

1 Answer

0 votes
by (32.3k points)
edited by

Let me take some random data:

from pyspark.mllib.linalg import SparseVector

from pyspark.sql import Row

df1 = sqlContext.createDataFrame([

    Row(a=107831, f=SparseVector(

        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),

    Row(a=125231, f=SparseVector(

        5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),

])

df2 = sqlContext.createDataFrame([

    Row(a=107831, f=SparseVector(

        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),

    Row(a=107831, f=SparseVector(

        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),

])

There are a few ways you can solve this problem.

You can unambiguously reference child table columns using parent columns:

df1.join(df2, df1['a'] == df2['a']).select(df1['f']).show(2)

##  +--------------------+

##  |                 f|

##  +--------------------+

##  |(5,[0,1,2,3,4],[0...|

##  |(5,[0,1,2,3,4],[0...|

##  +--------------------+

You can also use table aliases:

from pyspark.sql.functions import col

df1_a = df1.alias("df1_a")

df2_a = df2.alias("df2_a")

df1_a.join(df2_a, col('df1_a.a') == col('df2_a.a')).select('df1_a.f').show(2)

##  +--------------------+

##  |                 f|

##  +--------------------+

##  |(5,[0,1,2,3,4],[0...|

##  |(5,[0,1,2,3,4],[0...|

##  +--------------------+

Finally you can programmatically rename columns:

df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))

df2_r = df1.select(*(col(x).alias(x + '_df2') for x in df2.columns))

df1_r.join(df2_r, col('a_df1') == col('a_df2')).select(col('f_df1')).show(2)

## +--------------------+

## |               f_df1|

## +--------------------+

## |(5,[0,1,2,3,4],[0...|

## |(5,[0,1,2,3,4],[0...|

## +--------------------+

If you want to know more about Spark, then do check out this awesome video tutorial:

Browse Categories

...