From Cipher to Code: Crafting a PyTorch Solution for Pigpen

This post explores the Pigpen cipher, a hallmark of Freemasonry that uses symbols instead of letters. Its allure extends from historical novels to escape rooms, presenting a distinctive challenge. Despite the prevalence of advanced technology, understanding foundational ciphers like Pigpen is crucial for comprehending sophisticated encryption techniques. My fascination with Pigpen began in grade school, inspiring me to delve deeper into its mysteries with friends. In this project, I developed a PyTorch model that decodes Pigpen with over 95% accuracy. For more details and access to the model, please visit my GitHub or the Kaggle page dedicated to this project.

Understanding the Pigpen Cipher

The Pigpen cipher, also known as the Freemason’s cipher, substitutes letters for symbols based on a grid system, earning it the moniker ‘Tic Tac Toe cipher’ due to its distinctive layout. This cipher is unique for its use of geometric shapes as a means of encoding, offering simplicity coupled with secrecy. Initially utilized by Freemasons for confidential records, its utility has since expanded to military purposes and puzzles in popular culture. Although it might seem like a curiosity in the current digital age, the principles underlying the Pigpen cipher are essential for anyone learning the basics of cryptography. The adaptation of this cipher for machine learning introduces specific challenges, particularly in transforming geometric symbols into a format understandable by a model. Creating a targeted dataset for image recognition and classification in PyTorch underscores the convergence of age-old encryption methods with cutting-edge technology.

Setting Up the Project

In the realm of machine learning, PyTorch is renowned for its adaptability and user-friendliness, making it a go-to framework for research and development initiatives. Its dynamic computational graph and optimized memory management make it particularly suited for projects that necessitate bespoke architectures or frequent modifications.

The initial phase of our project involves setting up the PyTorch environment. Essential steps include installing the most recent version of PyTorch, ensuring it is compatible with our Python setup. Additionally, we incorporate libraries like NumPy, for advanced mathematical functions, and Matplotlib, for data visualization purposes.

Creating The Pigpen Dataset

Creating a dataset for the Pigpen cipher required an innovative approach, as no predefined dataset was available for this unique cipher. To begin, I developed a script to generate images for every symbol in the Pigpen cipher’s alphabet (link to GitHub), ensuring comprehensive coverage. The initial dataset comprised images rendered using the Python Pillow library. Ultimately, I produced four distinct ‘imagesets’ to capture all possible symbols, employing various brushtypes in Paint for diversity, simulating different handwriting styles.

One of the datasets, which I termed ‘brush’, involved hand-drawing symbols to add variation.

To introduce realistic variations in the symbols’ appearance, such as different rotations and slight distortions, I utilized a ‘randomizer/respin’ script (link to GitHub). This approach was crucial for simulating real-world conditions.

Significantly, for the validation and test sets, I increased the level of randomization to challenge the model further, ensuring it learned to recognize symbols beyond simple patterns. This strategy aimed to demonstrate the model’s capability to generalize from the training data to more complex, unseen examples.

The dataset was divided into training, validation, and testing sets, totaling 75,600 images (approximately 47 MB): 54,000 for training, 10,800 for validation, and 10,800 for testing. The final step involved organizing the data for efficient processing by PyTorch, utilizing its DataLoader for seamless batch handling and shuffling.

Upon finalizing the dataset, I made it available on Kaggle for broader access (link to Kaggle Dataset), facilitating further research and experimentation.

Model Design and Development

For our Pigpen cipher project, we selected a Convolutional Neural Network (CNN) architecture, renowned for its efficacy in image recognition tasks. This choice is pivotal for accurately interpreting the visual symbols of the Pigpen cipher. Drawing parallels to classic datasets such as MNIST and FashionMNIST, which share similarities with our project, a CNN approach emerged as the most effective based on prior experiments with these datasets.

Layers and Structure: Rationale Behind Selections

  • Input Layer: Initiates the processing of symbol images.
  • Convolutional Layers: Essential for feature extraction, capturing varying levels of detail across multiple layers.
  • Pooling Layers: Serve to reduce dimensionality, enhancing computational efficiency.
  • Fully Connected Layers: Crucial for interpreting extracted features, leading to accurate symbol classification.
  • Output Layer: Finalizes the classification, linking symbols to their respective letters.

Informed by coursework from DeepLearning.AI, the inclusion of Dropout and Batch Normalization components within the CNN layers was deemed beneficial.

PyTorch Implementation Details

  • Data Preparation: Given the uniformity of image sizes (28×28 pixels) in our custom dataset, preprocessing steps like normalization or resizing were not required.
  • Model Definition: Utilizing PyTorch’s nn.Module, we defined the CNN architecture, integrating layers in a sequence that includes Convolutional, ReLU activation, Batch Normalization, and Dropout layers, followed by a sequence of Fully Connected layers.
  • Loss Function and Optimizer: Selection of cross-entropy loss and the SGD optimizer was based on their effectiveness during the model training phase.
  • Training Loop: Implementation focused on iterating over image batches, loss calculation, and weight optimization.
  • Evaluation: Utilization of separate validation and test sets aids in fine-tuning the model for peak performance.

The ‘EnhancedCNN’ model, featuring a structured layering of Conv-Relu-BN-DO followed by FC-RELU-FC, represents our final architecture. Dropout proved highly effective, while Batch Normalization added value, justifying its inclusion.

Training the Model

Data Preprocessing and Augmentation

  • Normalization: Like I mentioned above, since I made the dataset to spec — all images are 28×28 pixels so there is no scaling necessary.
  • Augmentation: As part of the Dataset generation process we have already built in rotation, scaling, and shifting of symbols in our images to improve model robustness against varied representations.
  • Batching: We are going to use PyTorch’s DataLoader to create manageable batches for training.

Training Process and Hyperparameter Tuning

  • Epochs: First, we determine the number of training cycles. I did some longer training sessions to get a general look at things to figure out a ballpark.
    • So I ran a very long training session, for 50 epochs, – here are the stats:
      • Above, we can see both loss and accuracy seem to stop improving past 11 epochs of training
        • Looks like 3-11 epochs might be the best performance
  • Learning Rate: General guidance here is to start with a standard rate (0.01), adjusting based on performance. I just left it there, any tinkering only messed up the accuracy I was seeing.
  • Batch Size: Here the goal is to optimize based on memory constraints and model complexity.
    • Originally I had erroneously entered 27 as the batch size, accidentally mixing up this value with the # of classes in the dataset. Once I adjusted the batch size to 32, the process actually went a bit faster.
  • Optimizer Selection: We could experiment with different optimizers like Adam; but SGD seems to be working very well for us.
  • Regular Evaluation: We will monitor performance on the validation set to guide adjustments.

Handling Overfitting and Underfitting

  • Overfitting: If the model performs well on training data but poorly on validation data:
    • Increase data augmentation.
    • Introduce dropout layers or regularization techniques.
    • Reduce model complexity.
  • Underfitting: If the model performs poorly on both training and validation data:
    • Increase model complexity.
    • Experiment with longer training (more epochs).
    • Explore alternative network architectures or learning rates.

The issue I ran into was overfitting, I felt like the model was performing too well early on and not actually getting much better from training. That’s when I came up with the idea of ‘hiking up the randomness’ for test & validation samples; but creating a larger dataset to allow the model the ability to generalize and accurately assess these ‘more random’ sets of images.

Testing and Validation

Evaluating the Model’s Performance

  • Accuracy Measurement: We’ll be using accuracy as the primary metric to evaluate how often the model correctly decodes the Pigpen cipher symbols.
  • Identifying Problem Areas: One technique we can use to locate classes that may be trickier than others for the model; is to create a confusion matrix. These make it easy to understand the model’s performance across different classes (symbols).
  • Loss Evaluation: Another key data point is the loss; we’ll be monitoring the loss on the validation set to gauge the model’s prediction errors.

Analyzing Test Results

  • Performance Overview: Review accuracy and loss metrics to assess the overall performance.
  • Error Patterns: Look for patterns in misclassifications to understand potential flaws in the model or data.
    • To be fair, this analysis could benefit from something like the use of a Confusion Matrix to try and help identify systemic issues – patterns of inaccuracy. I did not do that as part of my initial investigation here.

Here is what we saw in the training data over just 5 epochs; each epoch the loss fell consistently and accuracy rose — before plateauing at around 98% accuracy.

Strategies for Improvement

  • Hyperparameter Optimization: Refine hyperparameters like learning rate or batch size to improve performance.
  • Model Adjustments: Based on test results, tweak the model architecture. Consider adding or modifying layers if certain symbols are consistently misclassified.
  • Additional Training Data: Since I created the dataset, we actually can work on the dataset side of things to add more generalization, etc.

Applications and Implications

Potential Uses in Real-World Scenarios

  • Educational Tools: Use the dataset/model to create interactive learning platforms for teaching cryptography and history.
  • Puzzle and Game Development: Implement in puzzle games or digital escape rooms for decoding challenges. (NOTE: During this project I became aware that a variant of the Pigpen Cipher was made somewhat famous recently from being featured in an Assassin’s Creed game. Very cool! With some additions to the dataset, this model could potentially read that variant as well)
  • Historical Research: Aid in deciphering historical documents encoded with similar ciphers.

Future Scope and Enhancements

  • Model Improvement: We could work to continuously refine the model for higher accuracy and efficiency. Things like learning rate, momentum, # of epochs, etc. could all be further optimized.
  • Broader Cipher Applications: This was just one variant of the Pigpen cipher, the model could be trained on multiple variants – or even different ciphers entirely.
  • Integration with Other Technologies: Explore integration with other AI and ML technologies for more complex cryptographic applications.
  • Creating a ‘Picture to Cipher translation’ workflow: Utilize image processing & ML models to create a solution that allows a user to take a picture of a page of Pigpen cipher, and the solution would parse it character by character and feed these to the model, outputting the translation of the cipher.

Conclusion

Key Findings and Achievements

  • Developed a PyTorch model that decodes Pigpen cipher with high accuracy.
  • Showcased CNN’s effectiveness in image classification, applicable to historical ciphers.
  • Built a specialized dataset from scratch, underscoring the importance of dataset preparation in machine learning.

Personal Reflections

  • This project merged history, cryptography, and machine learning, offering insights into the integration of ancient techniques with current technologies.
  • Overcoming the challenge of adapting a historical cipher for neural network analysis was enlightening and underscored the complexity of working with – and creating – this kind of dataset

Resources: