Working with Semi-Strcutured JSON Data in Databricks.

Tips, Tricks & Methods

Krishna yogi
6 min readApr 19, 2024

Semi-structured JSON data presents a unique challenge for data analysis. Its flexibility can be a boon, but its lack of a rigid schema can make it hard to work with. Databricks, a powerful Apache Spark-based platform, offers tools to conquer these challenges and unlock the insights hidden within your JSON files.

Understanding Semi-Structured JSON

JSON (JavaScript Object Notation) is a popular data interchange format known for its human-readable nature and flexibility. Unlike structured data with a predefined schema, semi-structured JSON allows for variations in field names, data types, and nesting levels. This adaptability comes at the cost of complexity when working with it programmatically.

DataBricks to the Rescue!

Databricks provides several mechanisms to tackle semi-structured JSON:

Auto Loader: This feature seamlessly ingests JSON data into Delta Lake, Databricks’ lakehouse storage solution. Auto Loader automatically detects schema changes, ensuring your data remains up-to-date.

SQL Functions for JSON Processing: Databricks SQL offers a rich set of functions to query and manipulate JSON data directly. Let’s delve into some key functions:

  • get_json_object(json_string, path): Extracts a specific value from a nested JSON structure based on the provided path.
  • explode(array_column): Transforms an array column into separate rows, enabling analysis of individual elements.
  • struct(...): Creates a structured column from multiple columns.

PySpark for Programmatic Workflows: For more granular control and complex transformations, PySpark shines. You can leverage PySpark’s DataFrame API and JSON-specific functions like spark.read.json() and to_json() to process and analyze your data.

Getting Started with Databricks:

To begin working with JSON data in Databricks, you’ll need access to a Databricks workspace. If you don’t have an account, sign up for Databricks and create a new workspace.

Loading JSON Data into Databricks:

Databricks provides seamless integration with various data sources, including JSON files. You can easily load JSON data into Databricks using Spark’s DataFrame API. Here’s an example of how to load JSON data from a file stored in DBFS (Databricks File System):

# Load JSON data into a DataFrame
json_df = spark.read.json("/path/to/json/file")

Exploring and Querying JSON Data:

Once you’ve loaded JSON data into a DataFrame, you can explore its structure and query its contents using Spark SQL. Here are some common operations:

Displaying the schema of the DataFrame:

json_df.printSchema()

Displaying the first few rows of the DataFrame:

json_df.show()

Running SQL queries on the DataFrame:

json_df.createOrReplaceTempView("json_data")
spark.sql("SELECT * FROM json_data WHERE ...")

Working with Nested JSON Structures:

JSON data often contains nested structures, such as arrays of objects or nested objects within objects. Databricks provides powerful functions for working with nested JSON data. Here’s an example of how to flatten a nested JSON structure into a tabular format:

from pyspark.sql.functions import explode
# Flatten nested JSON structure
flattened_df = json_df.selectExpr("col1", "col2", "explode(nested_array_col) AS nested_col")

Writing JSON Data from Databricks:

After processing and analyzing JSON data in Databricks, you may want to write the results back to JSON files or other formats. Databricks makes it easy to write DataFrame contents to various output formats, including JSON. Here’s an example:

# Write DataFrame to JSON file
json_df.write.json("/path/to/output/json_file")

Code Examples

Example 1:

Let’s create a table customer_info with a JSON column raw_data containing customer information:

Extracting Top-Level Columns:

To extract top-level columns from the JSON data, we’ll use the . operator.

CREATE TABLE customer_info AS
SELECT
'{"name": "John", "age": 30, "city": "New York"}' AS raw_data

Now, let’s extract the name and city columns from the JSON data:

SELECT 
raw_data.name AS customer_name,
raw_data.city AS customer_city
FROM
customer_info

Explanation:

  • In the CREATE TABLE statement, we define a table customer_info with a single column raw_data containing JSON strings.
  • We then use the SELECT statement to extract the name and city fields from the raw_data column using the . operator.
  • The AS keyword is used to alias the extracted fields as customer_name and customer_city for clarity in the output.

Extracting Nested Fields:

Let’s consider a JSON structure representing employee data. We’ll create a table employee_data with nested JSON data:

CREATE TABLE employee_data AS
SELECT
'{
"employee": {
"name": "Alice",
"department": {
"name": "Engineering",
"location": "San Francisco"
}
}
}' AS raw_data

Now, let’s extract the department name and location from the nested JSON:

SELECT 
raw_data.employee.department.name AS department_name,
raw_data.employee.department.location AS department_location
FROM
employee_data

Explanation:

  • We define a table employee_data with a single column raw_data containing nested JSON strings representing employee information.
  • Using the SELECT statement, we extract the nested fields name and location of the department object using the . operator to navigate through the nested structure.
  • We alias the extracted fields as department_name and department_location for clarity in the output.

Extracting Values from Arrays:

Consider JSON data representing a list of products. We create a table product_data with JSON arrays:

CREATE TABLE product_data AS
SELECT
'{
"products": [
{"id": 1, "name": "Laptop"},
{"id": 2, "name": "Smartphone"}
]
}' AS raw_data

Now, let’s extract the name of the first product from the array:

SELECT 
raw_data.products[0].name AS first_product_name
FROM
product_data

Explanation:

  • We define a table product_data with a single column raw_data containing JSON strings representing product information.
  • Using the SELECT statement, we extract the name field of the first product in the products array using the [0] index.
  • We alias the extracted field as first_product_name for clarity in the output.

Example 2:

# Sample JSON data (replace with your actual data)
json_data = [
{'id': 1, 'name': 'Alice', 'orders': [{'product': 'Laptop', 'price': 1200}]},
{'id': 2, 'name': 'Bob', 'orders': [{'product': 'Headphones', 'price': 100}, {'product': 'Keyboard', 'price': 50}]}
]

# Create a Spark DataFrame
df = spark.createDataFrame(json_data)
# Extract top-level fields
df.select("id", "name").show()
# Access nested data using dot notation
df.select("id", "orders.product", "orders.price").show()
# Explode the orders array
df_exploded = df.withColumn("order", explode("orders"))
df_exploded.select("id", "order.product", "order.price").show()
# Filter and aggregate data based on nested values
df_filtered = df_exploded.filter(df_exploded["order.product"] == "Laptop")
df_filtered.groupBy("id").agg(sum("order.price").alias("total_spent")).show()

Explanation:

  • Sample JSON Data: This section defines a list representing customer records with id, name, and orders (containing product and price information).
  • Creating a Spark DataFrame: It converts the list into a Spark DataFrame, a distributed tabular data structure for efficient analysis.
  • Extracting Top-Level Fields: This selects and displays only the id and name columns from the DataFrame.
  • Accessing Nested Data (Dot Notation): It demonstrates how to access data within nested structures using dot notation (e.g., orders.product) to retrieve product names and prices.
  • Exploding the Orders Array: This step expands the orders array (which can have multiple orders per customer) into separate rows using the explode function. This creates a new column named order with each order as a distinct row.
  • Selecting and Filtering: It selects specific columns (id, product name, and price) from the exploded DataFrame, providing detailed information about each order.
  • Filtering and Aggregating: This performs more complex analysis. It filters for “Laptop” orders, groups by customer id, calculates the total spent on laptops for each customer, and displays the results.

Optimizations and Best Practices

  • Leverage Schema on Read: If your JSON data has a relatively consistent structure, consider defining a schema during reading using spark.read.json(data_path, schema). This can improve performance and type safety.
  • Flatten for Complex Queries: For complex queries involving nested data, flattening the JSON might be more efficient. Use PySpark functions like withColumn and nested structures to achieve flattening.
  • Partitioning: Partition your data based on frequently accessed fields to optimize query performance when working with large datasets.

10 Powerful Features for Wrangling Semi-structured JSON in Databricks Lakehouse

While semi-structured JSON offers flexibility, it can be challenging to manage. Databricks Lakehouse provides a powerful toolkit to simplify this process:

  1. Automatic Schema Inference: Databricks intelligently guesses data types (integers, strings, etc.) during schema creation, saving you manual effort.
  2. Schema on Read: Want more control? Specify desired data types during reading to fine-tune your schema for better analysis.
  3. Schema Evolution Made Easy: No worries if your JSON data structure changes over time. Databricks adapts the schema automatically to keep your data organized.
  4. No Data Left Behind: Encounter malformed data? Databricks captures it in a separate column, ensuring nothing gets lost and allowing for later correction.
  5. Streamlined Ingestion: Need to bring in massive amounts of JSON data? Databricks offers options for both one-time and scheduled data ingestion, keeping your lakehouse up-to-date.
  6. Checkpoint Control: Track the progress of your data ingestion process. Databricks checkpoints allow you to resume from specific points if needed, ensuring efficient data loading.
  7. Effortless Top-Level Access: Directly access top-level fields within your JSON files using clear syntax, simplifying data retrieval.
  8. Dot Notation: Databricks’ dot notation lets you manage complex JSON objects intuitively.
  9. Data Type Transformation: Need to convert values to specific data types for analysis? Databricks’ casting functionality allows for seamless data type manipulation.
  10. Unlocking Arrays and Structs: Access elements within JSON arrays and nested structs efficiently using Databricks’ built-in features, empowering you to analyze intricate data structures.

Conclusion

Databricks empowers you to harness the power of semi-structured JSON data. By understanding the core concepts, utilizing Databricks’ built-in tools (Auto Loader, SQL functions), and following best practices, you can streamline your data-wrangling process and unlock valuable insights from your JSON files.

--

--