Abstract

Deep learning has revolutionized computer vision, making it possible to build highly accurate image classification systems with relatively little effort. But how much difference does it really make to use pre-trained models versus building your own from scratch? In this project, I set out to answer that question by building a vegetable image classifier using three different approaches: a custom Convolutional Neural Network (CNN) built from the ground up, and two state-of-the-art architectures—VGG19 and ResNet50.

The goal was simple: classify images of broccoli, cabbage, and cauliflower with the highest possible accuracy, while understanding the trade-offs between building custom models and leveraging transfer learning.

The Dataset: Vegetable Image Classification

I used the Vegetable Image Dataset by misrakahmed, available on Kaggle. This dataset contains 21,000 images spanning 15 different vegetable categories. For this project, I focused on just three classes: broccoli, cabbage, and cauliflower.

Dataset Breakdown

The dataset comes pre-split into three segments:

Training set: 1,000 images per class × 3 classes = 3,000 images
Validation set: 200 images per class × 3 classes = 600 images
Test set: 200 images per class × 3 classes = 600 images

One major advantage of this dataset is that all images are already standardized to 224 × 224 pixels. This meant I could skip complex resizing operations and focus directly on building and training models.

Data Exploration & Pre-Processing

Before diving into model building, I performed some basic data exploration:

Understanding the Structure: The dataset is organized into folders, where each folder name represents a vegetable class. This made it easy to iterate through the directory structure and automatically label images based on their parent folder.

Minimal Pre-Processing Required: Since images were already uniform in size, my pre-processing pipeline was straightforward:

Loading Images: I used the Pillow library to open image files and convert them into NumPy arrays
Normalization: Pixel values were scaled from [0, 255] to [0, 1] by dividing by 255.0
Label Encoding: I applied one-hot encoding to convert class labels into a format suitable for neural network training (e.g., broccoli = [1,0,0], cabbage = [0,1,0], cauliflower = [0,0,1])
Tensor Conversion: Arrays were converted to TensorFlow tensors for efficient computation

The folder-based structure made automation possible. I wrote a Python script that navigated through the train, validation, and test directories, filtered only the three vegetables I needed, and assembled the complete dataset ready for training.

Building the Models

With data preparation complete, I moved on to the exciting part: building and training models. I took three different approaches to see which would yield the best results with the least effort.

Approach 1: Vanilla CNN (Custom Architecture)

The first challenge was to build a CNN from scratch—no pre-trained weights, no transfer learning, just raw neural network design. This “Vanilla CNN” would serve as my baseline.

Why Start From Scratch?

Building a custom model helps you understand the fundamental building blocks of CNNs: convolution layers, pooling operations, activation functions, and how they work together to extract features and make predictions.

The Design Process

I went through four iterations before arriving at a satisfactory architecture. Each iteration taught me something valuable:

Iteration 1: I started simple with a basic convolution-pooling-dense pattern. This achieved 93.3% accuracy but showed signs of overfitting after epoch 16. I used L2 regularization (1e-4) to combat this, and implemented early stopping to prevent the model from degrading further.

Iteration 2: I replaced L2 regularization with dropout (50%) on the dense layers. This improved accuracy slightly to 93.6%, but overfitting was still present in the validation loss curve.

Iteration 3: I combined both techniques—adding back L2 regularization alongside dropout—and increased the dense layer neurons from 64 to 128. I also expanded training to 100 epochs with early stopping patience of 15. This configuration achieved 95.8% accuracy and showed much better stability.

Iteration 4 (Final): Inspired by research showing that deeper networks can capture more complex features, I built a true deep learning architecture:

4 convolutional layers with varying filter sizes (64, 128, 64, 32)
Strategic use of different kernel sizes (3×3, 5×5, 1×1) to capture features at multiple scales
Leaky ReLU activation instead of standard ReLU to prevent vanishing gradients
Removed regularization from dense layers to allow the model to train more freely
Trained for 300 epochs

This final architecture achieved 99% accuracy on the test set! The loss curves showed smooth, consistent learning without the explosive gradients I’d seen in earlier iterations.

Key Takeaway: Building from scratch requires patience and experimentation. It took 300 epochs and multiple architectural revisions to reach 99% accuracy. But the learning experience was invaluable.

Approach 2: Transfer Learning with VGG19

After spending considerable effort on the custom model, I wanted to see how much easier it would be to use a pre-trained architecture. Enter VGG19.

What is VGG19?

VGG19 is a convolutional neural network developed by the Visual Geometry Group at Oxford. It consists of 16 convolutional layers and 3 fully connected layers, hence the name “19”. The model was originally trained on ImageNet, a massive dataset containing millions of images across thousands of categories.

Why Use Transfer Learning?

Transfer learning leverages knowledge from models trained on large datasets. Instead of learning features from scratch, you’re starting with a model that already understands edges, textures, shapes, and complex patterns. You simply adapt the final layers to your specific classification task.

My VGG19 Experiments

I ran four experiments with different configurations:

Experiment 1: I added a hidden dense layer (64 neurons) after the VGG19 base and applied L2 regularization. This achieved 99.8% accuracy but experienced exploding gradients around epoch 14, triggering early stopping at epoch 18.

Experiment 2: I tried replacing L2 regularization with dropout (50%) to avoid gradient explosion. Unfortunately, this caused vanishing gradients—the model stopped learning entirely and got stuck at 33% accuracy.

Experiment 3: Suspecting ReLU might be causing vanishing gradients, I switched to Leaky ReLU while keeping dropout. The problem persisted. This confirmed that dropout itself was the culprit when combined with transfer learning.

Experiment 4 (Final): I removed both regularization and dropout, trusting that VGG19’s pre-trained weights would naturally resist overfitting. I trained for 100 epochs with early stopping patience of 20. The results were spectacular: 100% accuracy with a loss of just 0.0121. All 100 epochs were used productively without any gradient issues.

Key Insight: Transfer learning dramatically reduced the trial-and-error process. VGG19 reached 100% accuracy in just 100 epochs, compared to the 300 epochs needed for my custom model.

Approach 3: Transfer Learning with ResNet50

For my final experiment, I used ResNet50, another powerful architecture but with a fundamentally different design philosophy.

What Makes ResNet Different?

ResNet (Residual Network) introduced the concept of “skip connections” or “residual connections.” These connections allow the network to learn residual functions rather than direct mappings, making it possible to train very deep networks (50+ layers) without vanishing gradients. ResNet50 has—you guessed it—50 layers.

My ResNet50 Experiments

Experiment 1: I started with a hidden dense layer (64 neurons), L2 regularization (1e-2), and dropout (30%). This achieved 99.83% accuracy with a loss of 0.1922 after 100 epochs—solid results but not perfect.

Experiment 2 (Final): I increased the hidden layer to 128 neurons, removed L2 regularization, and increased dropout to 50%. Training was set for 100 epochs with early stopping patience of 10.

The model hit early stopping at epoch 61, meaning the best weights were found at epoch 51. The final results: 100% accuracy with an incredibly low loss of 0.000875—the best performance across all models!

Key Observation: ResNet50 not only achieved perfect accuracy but did so with the lowest loss value, suggesting the most confident predictions. The skip connections likely helped the model converge faster and more reliably.

Results Comparison: The Moment of Truth

Let me lay out the final results side by side:

Model	Training Epochs Used	Test Accuracy	Final Loss	Epochs Until Loss < 0.1
Vanilla CNN	300	99.0%	0.0325	~100
VGG19	100	100%	0.0121	~18
ResNet50	51	100%	0.000875	~21

What Do These Numbers Tell Us?

Efficiency: Transfer learning models (VGG19 and ResNet50) reached peak performance much faster than the custom CNN. VGG19 needed roughly 18 epochs to get loss below 0.1, while my Vanilla CNN needed about 100 epochs to reach the same point.

Effort vs. Results: I had to make 12 different layer modifications while building the Vanilla CNN to reach 99%. In contrast, VGG19 only required 1 additional layer on top of the pre-trained base, and ResNet50 needed 3 custom layers. This dramatically reduced development time.

Final Performance: While the Vanilla CNN achieved impressive 99% accuracy, both transfer learning models hit perfect 100% accuracy. More importantly, ResNet50’s extremely low loss (0.000875) indicates highly confident predictions, which is crucial for real-world deployment.

Resource Investment: If I had standardized all models to train for just 25 epochs, the Vanilla CNN would have significantly underperformed compared to the transfer learning approaches, likely stuck somewhere around 94-95% accuracy.

Lessons Learned & Insights

This project taught me several valuable lessons:

1. Transfer Learning Is Powerful, But Not Magic

Pre-trained models gave me a massive head start, but I still needed to understand how to properly adapt them. My failed experiments with dropout and regularization on VGG19 showed that you can’t just blindly add layers—you need to understand how they interact with pre-trained weights.

2. Architecture Matters

ResNet50’s skip connections proved superior for this task, achieving both perfect accuracy and the lowest loss. The architectural innovation of residual connections isn’t just theoretical—it translates to real performance gains.

3. Building From Scratch Has Value

While more time-consuming, building the Vanilla CNN taught me fundamentals that made working with VGG19 and ResNet50 much easier. Understanding why certain layer combinations cause vanishing or exploding gradients helped me debug issues faster.

4. Regularization Requires Careful Tuning

One of the most surprising findings was that removing regularization entirely from the transfer learning models actually improved performance. This suggests that ImageNet pre-training already provided strong regularization through learned feature representations.

Conclusion & Future Directions

This project demonstrated that for image classification tasks, transfer learning offers a compelling advantage: faster training, higher accuracy, and less manual tuning. However, understanding how to build CNNs from scratch remains valuable for developing intuition about deep learning.

Potential Improvements

If I were to extend this project, I would:

Expand to all 15 vegetable classes to test model robustness
Implement data augmentation (rotations, flips, color jittering) to improve generalization
Try newer architectures like EfficientNet or Vision Transformers
Deploy the best model as a web application for real-time classification
Analyze misclassified images to understand model limitations
Experiment with ensemble methods combining multiple models

Final Thoughts

Whether you’re building a custom model or using transfer learning, the key is understanding your data and iterating based on results. This project reinforced that machine learning is as much art as science—requiring experimentation, patience, and continuous learning.

If you’re starting with image classification, my recommendation is clear: start with transfer learning to get quick wins and build momentum, but don’t skip learning the fundamentals of CNNs. The combination of both approaches will make you a more effective practitioner.

All experiments were conducted using TensorFlow Keras on Google Colab with L4 GPU acceleration. The complete code and detailed results are available in my research report.

Yosua Kristianto

Building a Vegetable Image Classifier: From Scratch to State-of-the-Art