🪅
Data Analytics
  • Apache Spark
    • Hardware
    • Distributed Computing
    • Data Wrangling with Spark
      • Data I/O
      • Spark DataFrames
      • Helpful Functions
Powered by GitBook
On this page
  • General Functions
  • Aggregate Functions
  • User Defined Functions (UDF)
  • Create a Python Function
  1. Apache Spark
  2. Data Wrangling with Spark

Helpful Functions

General Functions

Following are some general functions which are quite similar to methods of Pandas Dataframes:

  • select(): returns a new DataFrame with the selected columns

  • filter(): filters rows using the given condition

  • where(): is just an alias for filter()

  • groupBy(): groups the DataFrame using the specified columns, so we can run aggregation on them

  • sort(): returns a new DataFrame sorted by the specified column(s). By default the second parameter 'ascending' is True.

  • dropDuplicates(): returns a new DataFrame with unique rows based on all or just a subset of columns

  • withColumn(): returns a new DataFrame by adding a column or replacing the existing column that has the same name. The first parameter is the name of the new column, the second is an expression of how to compute it.

Aggregate Functions

Spark SQL provides built-in methods for the most common aggregations via the pyspark.sql.functions module. Some common aggregation functions are:

  • count()

  • countDistinct()

  • avg()

  • max()

  • min()

In many cases, there are multiple ways to express the same aggregations. For example, if one would like to compute one type of aggregate for one or more columns of the DataFrame we can just simply chain the aggregate method after a groupBy().

Also, If one would like to use different functions on different columns, agg()comes in handy. For example agg({"salary": "avg", "age": "max"})computes the average salary and maximum age.

User Defined Functions (UDF)

In Spark SQL we can define our own functions with the udf method from the pyspark.sql.functions module.

The default type of the returned variable for UDFs is string. If we would like to return an other type we need to explicitly do so by using the different types from the pyspark.sql.types module.

The following examples illustrates the usage of udf with Spark Dataframes.

First, we would create a PySpark Dataframe like below:

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.appName('SparkByExamples').getOrCreate()

columns = ["Seqno","Name"]
data = [("1", "john jones"), ("2", "tracey smith"), ("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)

Output:

+-----+------------+
|Seqno|Names       |
+-----+------------+
|1    |john jones  |
|2    |tracey smith|
|3    |amy sanders |
+-----+------------+

Create a Python Function

Next, we create a Python function convertCase which takes a string parameter and converts the first letter of every word to capital letter.

PreviousSpark DataFrames

Last updated 3 years ago

Page cover image