Helpful Functions
General Functions
Following are some general functions which are quite similar to methods of Pandas Dataframes:
select()
: returns a new DataFrame with the selected columnsfilter()
: filters rows using the given conditionwhere()
: is just an alias forfilter()
groupBy()
: groups the DataFrame using the specified columns, so we can run aggregation on themsort()
: returns a new DataFrame sorted by the specified column(s). By default the second parameter 'ascending' is True.dropDuplicates()
: returns a new DataFrame with unique rows based on all or just a subset of columnswithColumn()
: returns a new DataFrame by adding a column or replacing the existing column that has the same name. The first parameter is the name of the new column, the second is an expression of how to compute it.
Aggregate Functions
Spark SQL provides built-in methods for the most common aggregations via the pyspark.sql.functions
module. Some common aggregation functions are:
count()
countDistinct()
avg()
max()
min()
In many cases, there are multiple ways to express the same aggregations. For example, if one would like to compute one type of aggregate for one or more columns of the DataFrame we can just simply chain the aggregate method after a groupBy()
.
Also, If one would like to use different functions on different columns, agg()
comes in handy. For example agg({"salary": "avg", "age": "max"})
computes the average salary and maximum age.
User Defined Functions (UDF)
In Spark SQL we can define our own functions with the udf method from the pyspark.sql.functions
module.
The default type of the returned variable for UDFs is string. If we would like to return an other type we need to explicitly do so by using the different types from the pyspark.sql.types
module.
The following examples illustrates the usage of udf
with Spark Dataframes.
First, we would create a PySpark Dataframe like below:
Output:
Create a Python Function
Next, we create a Python function convertCase
which takes a string parameter and converts the first letter of every word to capital letter.
Last updated