- C++ 95.4%
- Makefile 3.6%
- Dockerfile 1%
- Add function to get envariable with default value incl. tests - Allow setting 'mmproj_use_gpu' through OCR_NO_GPU with default true - Allow setting 'gpu_layers' through OCR_GPU_LAYERS with default 99 => reducing helps with insufficient memory - Terminate if initialization (of the model) fails Reviewed-on: https://codeberg.org/gitdode/ocr-cpp/pulls/14 |
||
|---|---|---|
| .gitea/workflows | ||
| .settings | ||
| LICENSES | ||
| res | ||
| .clang-format | ||
| .clangd | ||
| .cproject | ||
| .gitignore | ||
| .project | ||
| detail-test.cpp | ||
| detail.cpp | ||
| detail.cppm | ||
| Dockerfile | ||
| Doxyfile | ||
| exception.cpp | ||
| exception.cppm | ||
| image-test.cpp | ||
| image.cpp | ||
| image.cppm | ||
| LICENSE | ||
| llama.cpp | ||
| llama.cppm | ||
| Makefile | ||
| ocr.cpp | ||
| README.md | ||
| tesseract-test.cpp | ||
| tesseract.cpp | ||
| tesseract.cppm | ||
ocr-cpp
About
OCR Service in C++.
Web service to perform optical character recognition on images using Tesseract API with Leptonica.
There is currently experimental support to optionally use a local LLM such as GLM OCR with llama.cpp to recognize i.e. handwritten text which Tesseract is not designed for.
The web service is using cpp-httplib, providing multithreading depending on the number of availabe logical CPUs, and json to structure recognized text in a Json array. Plain text, hOCR and TSV output is supported as well.
Since a Tesseract API instance can not be used concurrently, one instance is created per thread and reused for all requests handled by that thread.
LLM recognitions are possible only one at a time; concurrent requests using LLM are blocked until an ongoing process completes. Recognitions using Tesseract are however handled concurrently, also while the LLM is busy.
Images are converted to JPEG, so all image formats supported by libvips can be used for LLM recognitions, including multipage TIFF and PDF. If necessary, images are scaled down to a reasonable size to improve performance.
As a little by-product, the service can also generate thumbnails very efficiently.
Limitations
- LLM recognition currently only returns plain text/Markdown
Building
A C++ compiler supporting modules, Make and the dependent libraries are required to build and test the project. Building works fine on Debian 13 with the following toolchain and libraries installed from Debian repository with 'apt':
- GNU Make
- g++ 15.2.0
- doxygen
- catch2
- libtesseract5
- libleptonica6
- libcpp-httplib0.41
- nlohmann-json3
- libicu78
- libvips
The following command should install all that is needed to build the project:
sudo apt install build-essential libtesseract-dev libleptonica-dev \
libcpp-httplib-dev nlohmann-json3-dev libicu-dev libvips-dev \
doxygen catch2
Since this project is currently using the internal API, it is probably easiest
to download the source from llama.cpp,
build it and copy the needed headers to i.e. /usr/local/include/llama
and the shared libraries to /usr/local/lib/llama.
Once all dependencies are satisfied the project can be built by running:
export LD_LIBRARY_PATH=/usr/local/lib/llama
make
This compiles the executable ocr.
Testing
To build and run the tests:
export LD_LIBRARY_PATH=/usr/local/lib/llama
make test
This compiles the executable ocr-test and runs it - which runs the actual
tests.
Running
It might be necessary to install libtesseract5 and language files, and other libraries:
sudo apt install libtesseract5 tesseract-ocr-deu tesseract-ocr-eng \
tesseract-ocr-fra libleptonica6 libcpp-httplib0.18 libicu78 \
libvips42t64 libopenblas0 libvulkan1
The service is run with for example:
export LD_LIBRARY_PATH=/usr/local/lib/llama
./ocr 0.0.0.0 8080 /path/to/GLM-OCR-Q8_0.gguf /path/to/mmproj-GLM-OCR-Q8_0.gguf
Available parameters are:
- bind address: i.e.
0.0.0.0or a hostname/FQDN - http port: i.e.
8080 - model path: the path to the model, i.e.
/path/to/GLM-OCR-Q8_0.gguf - mmproj path: the path to the multimodal projection, i.e.
/path/to/mmproj-GLM-OCR-Q8_0.gguf
An effort is made to clean up before exiting on CTRL-C.
Using
Images can be PUT'ed to the REST endpoint /ocr with the following query parameters:
llm: 'true' for using the LLM, 'false' or absent to use Tesseract.lang: i.e. 'lang=en', applies only to Tesseract. The matching language file must be available.format: can be one of: 'text', 'json', 'hocr', 'tsv'. Ignored by the LLM option.
Just for fun, the header X-Prompt can be used to override the default prompt
with for example "Could you describe the image for me?"
Examples
Recognize text in 'res/eng.png' with Tesseract and return it as plain text:
curl --request PUT --url 'http://localhost:8080/ocr?lang=en' \
--data-binary @res/eng.png
Recognize script in 'res/scribble.png' with llama and return it as plain text:
curl --request PUT --url 'http://localhost:8080/ocr?llm=true' \
--data-binary @res/scribble.png
Generate a thumbnail (JPEG):
curl --request PUT --url 'http://localhost:8080/thu' \
--data-binary @res/eng.png --output thumb.jpg
Container
Building a container image currently requires some manual work:
- Copy the llama shared libraries into the build context, i.e.:
mkdir -p llama && cp -r /usr/local/lib/llama/* llama/ - Copy the model and multimodal projection to:
models/model.ggufandmodels/mmproj.gguf - Run
make image
To run the container:
docker run --privileged --rm -p 8080:8080 gitdode/ocr-cpp
Running the container in privileged mode is necessary to give access to the graphics device.
Documentation
To update the documentation in the directory doc, run:
make doc
Disassembly
For insights on what the compiler produces, like how many and which instructions are executed for a function, run
make disasm
and have a look at ocr.lst.
Clang/Eclipse CDT LSP Editor
Once initially, the BMI for libstdc++ needs to be precompiled, for example:
make bmi
Precompiling the module interfaces *.cppm when they were modified is done
the same way, and for the changes to be reflected in the LSP editor it is
enough to edit an affected file.
TODO
- Write more tests (always)
- See issues