EXTRACTING INFORMATION FROM MEDICINE USING DEEP LEARNING AND COMPUTER VISION

7 min readJul 5, 2020

In this world, where technology brings the world to our fingertips, Deep Learning plays a vital role in solving the major issues in today’s world.

So What is Deep Learning?

Deep Learning is basically the subset of the machine learning or Artificial Intelligence which has the neural networks that are capable of learning unsupervised from unstructured or unlabeled data

Introduction to the Problem

Visually impaired people suffer a lot in society. Being disable is either by the mercy of God or by any external factor that led to this. With all this disability they even face many day to day problems. One of them is being fooled by others be it in the supermarkets or by the auto rickshaw’s driver etc. Even they are fooled by shrewd chemists. No one focuses on this problem but it exists in our society. To stop people from getting cheated, our vision is to design a tool that can update the off-the-shelf inventory by just scanning the images of medicines.

Visually Impaired Person using stick to walk.

The proposed solution is used to extract some basic but important details from that package of medicine like name of that medicine, manufacturing date, expiry date, MRP(maximum retail price) of that medicine and convert them to speech format.

We propose a model to detect and recognize the text from the images and convert them into speech using a deep learning framework.The process of extracting text data in a machine/computer readable format from real world images is quite a complex and challenging task in the computer vision community..

The ultimate goal is to design an end-to-end framework, which internally has two steps: Text Detection and Recognition.

This problem mainly has three important functionalities:

Text Detection
Text Recognition
Converting text into speech

Dataset Used

The Dataset used to train the model was the dataset provided to the ICDAR 2019. Robust Reading Challenge on Multi-Lingual scene Text detection and Recognition.
The Training set is composed of 10,000 images with 10,000 text files having the coordinates of the word by word of the text in the image with the text of the image.

EAST (Efficient Accurate Scene Text detector)

This is a very effective method for text detection. It is only a proper text detection method. It can find horizontal and rotated bounding boxes. It can be combined with any text recognition method. EAST can detect text both in images and in the video. So we re-implemented EAST Text detector using TensorFlow to develop the model of our own need.

Methodology

ResNet-50 is an already preexisting Convolutional Neural Network that is 50 layers deep. So we retrained that neural network on our dataset. The training was done at 1000 epochs and then extracted the frozen model of it with the accuracy of 81%.

Below is the training sample.

The trained model was then used to detect and localize the text on the image.

The image was then cropped according to the rectangular box created on it. Then this image was passed as a parameter into the pytesseract library to extract the text. This text is then converted into the speech using pyttsx3 and Google Translator.

Text Reading includes text detection and recognition from the input images, which typically belongs to Optical Character Recognition(OCR) which has already been used widely in many Computer Vision(CV) applications. So instead of just focusing on information extraction, we bridge text reading and information extraction tasks with shared features.

Text Recognition

For text recognition we used Pytesseract library. We passed the detected image that we got from running the model inference on it as a parameter. So we individually crop out the image inside the rectangular box and apply text recognition on it using pytesseract. The result is then converted into the speech and then is written into the temporary file.

So what is Pytesseract?

Tesseract OCR is an open source tool that can be used directly and is compatible with many programming languages.

To recognize an image containing text of arbitrary length PyTesseract uses RNNs and LSTM(Long Short Term Memory) is one of the popular and best of RNNs because they are great in learning sequences

So before using Pytesseract, you need to download Pytesseract from any other online source and then make some changes in the Image Recognition.py file according to the path set up by you while installing pytesseract.

So follow these steps to extract Text from Image:

Process the image like resizing, calculating ratio etc.
Load pre-trained EAST model and define output layers
Forward pass the image through EAST model
decode bounding box from EAST model prediction
Get final bounding box after non max suppression
Generate list with bounding box coordinates and recognize text in the boxes using pytesseract.

In our case, we have used a specific configuration of the tesseract. There are multiple options available too for tesseract configuration given below.

oem(OCR Engine modes):

0 for Legacy engine only.

1 for Neural nets LSTM engine only.

2 for Legacy + LSTM engines.

3 for Default, based on what is available.

psm(Page segmentation modes):

0 for Orientation and script detection (OSD) only.

1 for Automatic page segmentation with OSD.

2 for Automatic page segmentation, but no OSD, or OCR. (not implemented)

3 for Fully automatic page segmentation, but no OSD. (Default)

4 for Assume a single column of text of variable sizes.

5 for Assume a single uniform block of vertically aligned text.

6 for Assume a single uniform block of text.

7 for Treat the image as a single text line.

8 for Treat the image as a single word.

9 for Treat the image as a single word in a circle.

10 for Treat the image as a single character.

11 for Sparse text. Find as much text as possible in no particular order.

12 for Sparse text with OSD.

13 for Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

Results

The Dataset used to train the model was the dataset provided to the ICDAR 2019.. Robust Reading Challenge on Multi-Lingual scene Text detection and Recognition.
The Training set is composed of 10,000 images with 10,000 text files having the coordinates of the word by word of the text in the image with the text of the image.
The text is extracted from the image and is converted into the speech.
Image should be clear enough so that text can be recognized successfully.
After training the model for about 1000 epochs we got an accuracy of 81% for the text Detection.After text Detction the boxes will be cropped and passed to pytesseract.
We used Tesseract as OCR because it has higher efficiency as compared to other OCR and analytics from google also proves the same.
Total loss was around 10%.

Conclusion

A solution is devised to solve the Text Detection and Text Recognition problem and then converting it into speech format which can be used by the user through a hearing device or Bluetooth.

We used TWO different approach for Text Detection and Recognition just because the testing dataset could have Curved text or Curved image so normal OCR does not work for the same so we have deployed a different text detection technique and then passed the detected part of the image to the normal OCR such as pytesseract to increase the efficiency of the OCR.

We would like to extend our project by deploying it in the form of an app and connecting it with the walking stick so that the person can use it with the help of Bluetooth.

Thank you so much for spending so much time while reading the blog, if you liked it, please give it a applaud..!!

Thank You Roohein Malhotra and Chahat Bindra without you guys this project won’t be a success.