Boosting the processing rate of data in a pharmaceutical company

SECTOR:

AREA:

WORK SCOPE:

pharmaceutical

data processing

implementation of Big Data environment

STARTING POINT:

Today, the effective processing of the obtained data is the fundamental AREA of Big Data. Proficiency in this AREA allows for the significant reduction of the time spent on taking business decisions, and also it enables faster drawing of conclusions based on voluminous historical data.

While analysing a recently implemented project for a customer from the pharmaceutical sector, owing to which the time of processing data was significantly reduced by means of the Hadoop ecosystem, Apollogic experts decided that the currently available range of Big Data methods and tools guarantees even better improvement of data management and processing in a company.

The first objective of the new project was to create an application in the Scala language which uses received data and obtains the results faster than ever before. An additional step was testing the Apache Spark technology which uses the resources of RAM for computational purposes. Thus, Big Data Apollogic specialists wanted to prove that the correct implementation of the above tool may boost the process of data processing even several times compared to the Apache Hive solution which was originally used. Due to the limited availability of hardware resources, analogous data samples were prepared for testing purposes with the size of 15 GB each.

OUR SOLUTION:

Restoration of the Apache Hadoop platform on a private server where replicated data were stored
Platform responsible for data processing was created on 4-node cluster Apache Hadoop, Cloudera distribution
The optimal use of the Apache Spark possibilities for data processing, including mainly the use of “Broadcast” variables and the appropriate allocation of resources at the start of the application
The main parameters of the Hadoop cluster are as follows: 4 Nodes, memory RAM 128 GB, 32 processors, 2 TB disk capacity.

RESULTS:

By means of the Apache Spark technology, the time of obtaining business indicators was shortened. The respective steps of data processing were completed from 2 to 20 times faster compared to the project based on the Apache Hive technology. A different decrease in time of the tasks fulfilled was dependent on the operations performed during obtaining the results.
Indirect steps were stored in the memory which reduced the time for accessing data, at the same time allowing for avoiding unnecessary saving and disk space loss on the Hadoop platform.
The Apache Spark technology proved itself to be one of the fastest tools currently available on the market and its additional asset is the Open-Source.