Skip to content

Commit

Permalink
Merge pull request #12 from unicef/dedup-engine
Browse files Browse the repository at this point in the history
add ! dedup engine
  • Loading branch information
domdinicola authored Oct 4, 2024
2 parents f5de211 + 2a7355d commit e8eee8f
Show file tree
Hide file tree
Showing 16 changed files with 414 additions and 345 deletions.
9 changes: 6 additions & 3 deletions docs/components/hde/.pages
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
nav:
- index.md
- setup.md
- tmp.md
- tmp2.md
- Setup: setup.md
- REST API: API.md
- Demo Application: demo.md
- Duplicated Image Detection: did
- Troubleshooting: troubleshooting.md
- Development: development.md
17 changes: 17 additions & 0 deletions docs/components/hde/API.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
The application provides comprehensive API documentation to facilitate ease of use and integration. API documentation is available via two main interfaces:

#### Swagger UI
An interactive interface that allows users to explore and test the API endpoints. It provides detailed information about the available endpoints, their parameters, and response formats. Users can input data and execute requests directly from the interface.

URL: `http://localhost:8000/api/rest/swagger/`

#### Redoc
A static, beautifully rendered documentation interface that offers a more structured and user-friendly presentation of the API. It includes comprehensive details about each endpoint, including descriptions, parameters, and example requests and responses.

URL: `http://localhost:8000/api/rest/redoc/`


These interfaces ensure that developers have all the necessary information to effectively utilize the API, enabling seamless integration and interaction with the application’s features.

!!! warning "Environment-Specific URLs"
The URLs will vary depending on the server where it is hosted. If the server is hosted elsewhere except for the local machine, replace **http://localhost:8000** with the server's domain URL.
76 changes: 76 additions & 0 deletions docs/components/hde/demo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
To help you explore the functionality of this project, a demo server can be run locally using the provided sample data. This demo server includes pre-configured settings and sample records to allow for a comprehensive overview of the application's features without needing to configure everything from scratch.


## Running the Demo Server Locally

To set up and start the demo server locally, use the following command:

docker compose -f tests/extras/demoapp/compose.yml up --build

This command will build and launch all necessary containers for the demo environment, allowing you to see how different components of the system interact. Once everything is running, you can access the demo server's admin panel to manage and configure various settings within the application.

## Accessing the Admin Panel

The admin panel is accessible via the following URL in your browser, using the credentials below:

- URL: **http://localhost:8000/admin**
- Username: **[email protected]**
- Password: **123**


## API Interaction

To further understand how the API works and how different endpoints can be used, there are scripts available for API interaction. These scripts are located in the `tests/extras/demoapp/scripts` directory.

### Prerequisites

To use these scripts, ensure that the following tools are installed:

- [httpie](https://httpie.io/): A command-line HTTP client, used for making API requests in a more readable format compared to traditional curl.
- [jq](https://jqlang.github.io/jq/) : A lightweight and flexible command-line JSON processor that allows you to parse and manipulate JSON responses from API endpoints.

### Scripts Overview

#### Configuration Scripts

Configuration scripts are used to set up the environment for the API interactions. These scripts hold internal settings and functions that are shared across multiple API interaction scripts, making it easier to reuse common functionality and standardize configuration.

| Name | Arguments | Description |
|-----------------------|-----------|-------------------------------------------------|
| .vars | - | Contains configuration variables |
| .common | - | Contains common functions used by other scripts |


#### Public Scripts

These scripts help manage specific parameters for API interactions, allowing for easy setup and modification of variables that will be used in other commands.

| Name | Arguments | Description |
|-----------------------|----------------------|---------------------------|
| use_base_url | base url | Sets base url |
| use_auth_token | auth token | Sets authentication token |
| use_deduplication_set | deduplication set id | Sets deduplication set id |


#### API Interaction Scripts

These scripts are used to interact directly with the API endpoints, performing various operations like creating deduplication sets, uploading images, starting the deduplication process, and retrieving results.

| Name | Arguments | Description |
|---------------------------|-----------------------------------------|---------------------------------------------|
| create_deduplication_set | reference_pk | Creates new deduplication set |
| create_image | filename | Creates image in deduplication set |
| ignore | first reference pk, second reference pk | Makes API ignore specific reference pk pair |
| process_deduplication_set | - | Starts deduplication process |
| show_deduplication_set | - | Shows deduplication set data |
| show_duplicates | - | Shows duplicates found in deduplication set |


#### Test Case Scripts

Test case scripts are designed to automate end-to-end testing scenarios, making it easy to validate the deduplication functionality.

| Name | Arguments | Description |
|------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------|
| base_case | reference pk | Creates deduplication set, adds images to it and runs deduplication process |
| all_ignored_case | reference pk | Creates deduplication set, adds images to it, adds all possible reference pk pairs to ignored pairs and shows duplicates found |
17 changes: 17 additions & 0 deletions docs/components/hde/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Local Development

To develop the service locally, you can utilize the provided `compose.yml` file. This configuration file defines all the necessary services, including the primary application and its dependencies, to create a consistent development environment. By using **Docker Compose**, you can effortlessly spin up the entire application stack, ensuring that all components work seamlessly together.

To build and start the service, along with its dependencies, run the following command:
docker compose up --build


## Running Tests
To ensure that the service is working correctly, a comprehensive suite of tests is available. You can run these tests execute the following command:

docker compose run --rm backend pytest tests -v --create-db


## Viewing Coverage Report

After running the tests, a coverage report will be generated. This report helps in assessing how much of the code is covered by the tests, highlighting any areas that may need additional testing. You can find the coverage report in the `~build/coverage` directory.
5 changes: 5 additions & 0 deletions docs/components/hde/did/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
nav:
- Image Processing and Duplicate Detection: index.md
- Configuration: config.md
- workflow.md

68 changes: 68 additions & 0 deletions docs/components/hde/did/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
The configuration can be managed directly through the **admin panel**, which provides a simple way to modify settings without changing the codebase. Navigate to:

Home › Constance › Config

Here, you will find all the configurable settings that affect the behavior of the system, allowing for quick adjustments and better control over application behavior.

## Deep neural networks (DNN)

The deep learning component of the system is crucial for performing advanced inference tasks, including **face detection**, **face recognition**, and **finding duplicate images** using a pre-trained model. These tasks are fundamental to ensuring the accuracy and efficiency of the system in identifying and managing images.

This component relies on **Convolutional Neural Networks (CNNs)**, a type of deep learning model particularly well-suited for processing visual data. CNNs are used to automatically extract relevant features from images, such as facial landmarks and distinctive patterns, without the need for manual feature engineering.

### DNN_BACKEND

Specifies the computation backend to be used by [OpenCV](https://github.com/opencv/opencv) library for deep learning inference.

### DNN_TARGET

Specifies the target device on which [OpenCV](https://github.com/opencv/opencv) library will perform the deep learning computations.


## Face Detection

This component is responsible for locating and identifying faces in images. It uses advanced deep learning algorithms to scan images and detect the regions that contain human faces. This section outlines the key configuration parameters that influence how the face detection model processes input images and optimizes detection results.

### BLOB_FROM_IMAGE_SCALE_FACTOR

Specifies the scaling factor applied to all pixel values when converting an image to a blob. Mostly it equals 1.0 for no scaling or 1.0/255.0 and normalizing to the [0, 1] range.

Remember that scaling factor is also applied to mean values. Both scaling factor and mean values must be the same for the training and inference to get the correct results.

### BLOB_FROM_IMAGE_MEAN_VALUES

Specifies the mean BGR values used in image preprocessing to normalize pixel values by subtracting the mean values of the training dataset. This helps in reducing model bias and improving accuracy.

The specified mean values are subtracted from each channel (Blue, Green, Red) of the input image.

Remember that mean values are also applied to scaling factor. Both scaling factor and mean values must be the same for the training and inference to get the correct results.

### FACE_DETECTION_CONFIDENCE

Specifies the minimum confidence score required for a detected face to be considered valid. Detections with confidence scores below this threshold are discarded as likely false positives.

### NMS_THRESHOLD

Specifies the Intersection over Union (IoU) threshold used in Non-Maximum Suppression (NMS) to filter out overlapping bounding boxes. If the IoU between two boxes exceeds this threshold, the box with the lower confidence score is suppressed. Lower values result in fewer, more distinct boxes; higher values allow more overlapping boxes to remain.

## Face Recognition

This component builds on face detection to identify and differentiate between individual faces. This involves generating face encodings, which are numerical representations of the unique facial features used for recognition. These encodings can then be compared to determine if two images contain the same person or to find matches in a database of known faces.

### FACE_ENCODINGS_NUM_JITTERS

Specifies the number of times to re-sample the face when calculating the encoding. Higher values increase accuracy but are computationally more expensive and slower. For example, setting 'num_jitters' to 100 makes the process 100 times slower.

### FACE_ENCODINGS_MODEL

Specifies the model type used for encoding face landmarks. It can be either 'small' which is faster and only 5 key facial landmarks, or 'large' which is more precise and identifies 68 key facial landmarks but requires more computational resources.


## Duplicate Finder

This component is responsible for identifying duplicate images in the system by comparing face embeddings. These embeddings are numerical representations of facial features generated during the face recognition process. By calculating the distance between the embeddings of different images, the system can determine whether two images contain the same person, helping in the identification and removal of duplicates or grouping similar faces together.

### FACE_DISTANCE_THRESHOLD

Specifies the maximum allowable distance between two face embeddings for them to be considered a match. It helps determine if two faces belong to the same person by setting a threshold for similarity. Lower values result in stricter matching, while higher values allow for more lenient matches.

1 change: 1 addition & 0 deletions docs/components/hde/did/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This feature consists of several interconnected components that work together to process images, detect and recognize faces, and find duplicate images using deep learning techniques.
63 changes: 63 additions & 0 deletions docs/components/hde/did/workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
The Image Processing and Duplicate Detection workflow is designed to provide reliable face detection, recognition, and duplicate detection by leveraging a pre-trained deep learning model.

## Inference Mode Operation

This application operates strictly in inference mode, which means that it does not perform training but instead relies on a pre-trained model for face recognition tasks. This mode ensures that the application can rapidly deploy face recognition capabilities without the computational cost or time required for training models from scratch.

### Pre-Trained Model Usage.

The pre-trained model is stored in Azure Blob Storage and is automatically downloaded by the application when it starts. This process ensures that the latest version of the model is always available for inference.
### Manual Model Update.

In addition to automatic loading, administrators have the option to manually update the model through the admin panel. This feature provides flexibility for applying updates or new models when improvements or changes are required without modifying the underlying code.

## Model Details

The face recognition capabilities are powered by the [OpenCV](https://github.com/opencv/opencv) library. Currently, the application utilizes an open-source, pre-trained model specifically designed for face detection.

### Model Components

- **deploy.prototxt**: This file defines the model architecture, including the network layers and the specific parameters used for each layer. It serves as a blueprint that guides how the model processes input data.
- **res10_300x300_ssd_iter_140000.caffemodel**: This file contains the trained weights of the model. It was trained using the **Caffe** deep learning framework, with a total of 140,000 iterations, ensuring robustness in face detection tasks.

### Model Architecture

- The model follows the **Res10** architecture, which is known for its efficiency in detecting faces. Res10 is a lightweight model that balances speed and accuracy, making it suitable for real-time applications.
- The model operates with a fixed input resolution of **300x300**, optimizing detection for faces within that scale. This resolution offers a compromise between detail and processing efficiency, allowing the model to quickly identify facial features without excessive computational load.
- SSD Methodology. The model utilizes the **Single Shot MultiBox Detector (SSD)** methodology, which is a popular approach for object detection. SSD is designed to predict both the bounding boxes and the confidence scores for each object in a single forward pass through the network. By leveraging the SSD approach, the model can efficiently detect multiple faces in a single image, making it suitable for batch processing and applications where rapid detection is required.


## Worklow Diagram

The workflow diagram illustrates the overall process of Image Processing and Duplicate Detection within the system, showcasing how different components interact to achieve **face detection**, **recognition**, and **duplicate identification**.

```mermaid
flowchart LR
subgraph DNNManager[DNN Manager]
direction TB
load_model[Load Model] -- computation <a href="../config/#dnn_backend">backend</a>\ntarget <a href="../config/#dnn_target">device</a> --> set_preferences[Set Preferences]
end
subgraph ImageProcessing[Image Processing]
direction LR
subgraph FaceDetection[Face Detection]
direction TB
load_image[Load Image] -- decoded image as 3D numpy array\n(height, width, channels of BlueGreeRed color space) --> prepare_image[Prepare Image] -- blob 4D tensor\n(normalized size, use <a href="../config/#blob_from_image_scale_factor">scale factor</a> and <a href="../config/#blob_from_image_mean_values">means</a>) --> run_model[Run Model] -- shape (1, 1, N, 7),\n1 image\nN is the number of detected faces\neach face is described by the 7 detection values--> filter_results[Filter Results] -- <a href="../config/#face_detection_confidence">confidence</a> is above the minimum threshold,\n<a href="../config/#nms_threshold">NMS</a> to suppress overlapping bounding boxes --> return_detections[Return Detections]
end
subgraph FaceRecognition[Face Recognition]
direction TB
load_image_[Load Image] --> detect_faces[Detect Faces] -- detected face regions\n<a href="../config/#face_encodings_num_jitters">number of times</a> to re-sample the face\n<a href="../config/#face_encodings_model">key facial landmarks</a> --> generate_encodings[Generate Encodings] -- numerical representations of the facial features\n(face's geometry and appearance) --> save_encodings[Save Encodings]
end
end
subgraph DuplicateFinder[Duplicate Finder]
direction TB
load_encodings[Load Encodings] --> compare_encodings[Compare Encodings] -- face distance less then <a href="../config/#face_distance_threshold">threshold</a> --> return_duplicates[Return Duplicates]
end
DNNManager --> ImageProcessing --> DuplicateFinder
FaceDetection --> FaceRecognition
```
17 changes: 16 additions & 1 deletion docs/components/hde/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# Deduplication Engine
# Deduplication

Deduplication Engine component of the HOPE ecosystem. It provides users with powerful capabilities to identify and remove duplicate records within the system, ensuring that data remains clean, consistent, and reliable.


## Repository

<https://github.com/unicef/hope-dedup-engine>


## Features

- [Duplicated Image Detection](did/index.md)


## Help

**Got a question**? We got answers.

File a GitHub [issue](https://github.com/unicef/hope-dedup-engine/issues)
Loading

0 comments on commit e8eee8f

Please sign in to comment.