Skip to main content

Command Palette

Search for a command to run...

Building a JSON Parser from Scratch

A Deep Dive into Parser Development

Published
•5 min read
Building a JSON Parser from Scratch

Have you ever used JSON.parse() and wondered about the magic happening behind the scenes? I recently embarked on a journey to build a JSON parser from scratch, and what I discovered was both challenging and fascinating. In this post, I'll walk you through the entire process of building a parser, from tokenization to parsing, and share the valuable lessons learned along the way.

Why Build Your Own Parser?

"Why reinvent the wheel?" you might ask. While it's true that we have excellent JSON parsers available, building one from scratch offers invaluable insights into:

  • How parsers and compilers work

  • The intricacies of handling text processing

  • Error handling in complex systems

  • The JSON specification itself

Plus, it's just plain fun to build something from the ground up!

Understanding the Architecture

Before diving into the code, let's understand what we're building. Our JSON parser consists of three main components:

  1. Tokenizer: Breaks raw JSON text into tokens

  2. Parser: Converts tokens into data structures

  3. Error Handler: Provides meaningful error messages

Let's visualize this with a diagram:

The Tokenizer: The First Step

The tokenizer is like a skilled reader who breaks down text into meaningful chunks. Let's see how it handles a simple JSON string:

{
  "name": "John",
  "age": 25,
  "hobbies": ["coding", "reading"]
}

How Tokenization Works

The tokenizer reads this character by character, producing a stream of tokens. Here's the process:

  1. Character Recognition:
def scan_token(self):
    char = self.advance()

    if char == '{': 
        self.add_token(TokenType.LEFT_BRACE)
    elif char == '"':
        self.string()  # Handle string separately
    elif char.isdigit() or char == '-':
        self.number()  # Handle number separately
    # ... handle other characters
  1. String Handling:
def string(self):
    string_content = []
    while True:
        char = self.peek()
        if char == '"':  # End of string
            break
        if char == '\\':  # Handle escape sequences
            self.advance()
            escape_char = self.advance()
            if escape_char in '"\\bfnrt':
                string_content.append('\\' + escape_char)
            else:
                raise TokenizerError("Invalid escape sequence")
        string_content.append(char)
        self.advance()
  1. Number Processing:
def number(self):
    # Handle integers
    while self.peek().isdigit():
        self.advance()

    # Handle decimals
    if self.peek() == '.':
        self.advance()
        while self.peek().isdigit():
            self.advance()

    # Handle scientific notation
    if self.peek() in ['e', 'E']:
        self.advance()
        if self.peek() in ['+', '-']:
            self.advance()
        while self.peek().isdigit():
            self.advance()

The Parser: Building the Structure

The parser takes our tokens and builds actual Python objects. It uses a technique called recursive descent parsing, which is perfect for JSON's hierarchical structure.

Parsing in Action

Let's look at how the parser handles different JSON elements:

  1. Objects:
def parse_object(self):
    obj = {}

    if not self.check(TokenType.RIGHT_BRACE):
        while True:
            # Get key (must be string)
            if not self.check(TokenType.STRING):
                raise ParserError("Expected string key")

            key = self.advance().value
            self.consume(TokenType.COLON, "Expected ':'")
            value = self.parse_value()
            obj[key] = value

            if not self.match(TokenType.COMMA):
                break

    self.consume(TokenType.RIGHT_BRACE, "Expected '}'")
    return obj
  1. Arrays:
def parse_array(self):
    array = []

    while not self.check(TokenType.RIGHT_BRACKET):
        array.append(self.parse_value())
        if not self.match(TokenType.COMMA):
            break

    self.consume(TokenType.RIGHT_BRACKET, "Expected ']'")
    return array

Error Handling: The Secret Sauce

Good error handling can make or break a parser. We implemented a robust error system that provides precise error locations and helpful messages.

Example Error Scenarios

  1. Missing Quotes:
# Input
{"name": John}

# Error Output
Error: Expected string at line 1, column 9
{"name": John}
         ^
  1. Invalid Number:
# Input
{"age": 12.34.56}

# Error Output
Error: Invalid number format at line 1, column 13
{"age": 12.34.56}
             ^

Real-World Testing and Edge Cases

We tested our parser with various challenging inputs:

  1. Nested Structures:
# Deep nesting
parse_json('
{
    "user": {
        "profile": {
            "address": {
                "city": "New York",
                "coordinates": [40.7128, -74.0060]
            }
        }
    }
}
')
  1. Special Numbers:
# Scientific notation
parse_json('{"small": 1.23e-10, "large": 1.23E+10}')

# Negative zero
parse_json('{"zero": -0}')
  1. Unicode and Escapes:
# Unicode characters
parse_json('{"greeting": "t", "emoji": "đź‘‹"}')

# Escape sequences
parse_json('{"text": "Line 1\\nLine 2\\t(tabbed)"}')

Performance Considerations

While building the parser, we made several performance-related decisions:

  1. Memory Usage:
  • Store tokens in a list instead of generating them one by one

  • Keep track of string positions instead of copying substrings

  1. Speed Optimizations:
  • Use string concatenation for small strings

  • Implement look-ahead for more efficient parsing

  • Cache repeated token checks

The Web Interface

We also built a simple web interface using Flask to make the parser more accessible:

@app.route('/parse', methods=['POST'])
def parse_json():
    try:
        data = request.json
        json_str = data.get('json', '')

        # Create tokenizer and get tokens
        tokenizer = Tokenizer(json_str)
        tokens = tokenizer.tokenize()

        # Create parser and parse tokens
        parser = Parser(tokens)
        parsed_result = parser.parse()

        return jsonify({
            'tokens': [str(token) for token in tokens],
            'parsed': parsed_result
        })
    except (TokenizerError, ParserError) as e:
        return jsonify({'error': str(e)})

Lessons Learned

Building this parser taught me several valuable lessons:

  1. Start Small: Begin with the basics (simple objects and arrays) and gradually add support for more complex features.

  2. Test Early, Test Often: Write tests for each new feature and edge case you discover. Our test suite caught many subtle bugs:

def test_edge_cases():
    assert parse_json('{"empty": ""}') == {"empty": ""}
    assert parse_json('{"space": " "}') == {"space": " "}
    assert parse_json('{"unicode": "\\u0041"}') == {"unicode": "A"}
  1. Error Messages Matter: Clear error messages can save hours of debugging time. Always strive to make them as helpful as possible.

  2. Documentation is Crucial: Good documentation helps others understand and use your code effectively.

Future Improvements

While our parser is functional, there's always room for improvement:

  1. Performance Optimizations:
  • Implement streaming for large files

  • Add parallel processing for large arrays

  • Optimize string handling

  1. Feature Additions:
  • Support for comments (non-standard JSON)

  • Pretty printing with customizable indentation

  • JSON Schema validation

  1. Developer Experience:
  • Better IDE integration

  • More detailed error messages

  • Interactive debugging tools

Conclusion

Building a JSON parser from scratch was an enlightening experience that improved my understanding of:

  • Text processing and parsing

  • Error handling and user experience

  • The importance of thorough testing

  • Performance optimization techniques

The next time you use JSON.parse(), you'll have a deeper appreciation for what's happening under the hood!


Want to explore the code or contribute? Check out the full project on GitHub. Follow me on Twitter.

Have questions or suggestions? Feel free to open an issue or ask me on twitter!

T

This is a great write up! I commend you for building something from scratch for the learning experience! Have you tried publishing it to PyPI?

1
A

Hey Jones, thank you. No I haven't tried PyPI yet. Will explore this now for sure.

1
T

good deal! it will likely involve adding a setup.py file and perhaps modifying your directory structure but I think you're nearly there already. happy to chat if you'd like any pointers!

1

More from this blog

Raw Tech

11 posts

Here we will be talking mostly about the stuffs that I have explored deeply in tech. I may not be doing it to get views but rather to store what I have learned and I don't forget it.