data processing
implementation of Big Data environment


Currently, the quantity of data obtained from various sources is growing intensively. Many companies are not able to follow solutions required for their processing. Companies do not know how to use the information obtained as effectively as possible, and thus – they are not able to present them in a manner understandable to the managers.

A similar problem occurred in the analysed pharmaceutical company. Difficulties with big data whose processing took more and more time as well as their volume convinced the managers to take efforts to transfer information from many dispersed databases to one Data Lake. It was to work by means of the Apache Hadoop environment. This solution was to boost considerably data processing and such data were used for establishing a global sales strategy.


  • Creation of a system responsible for the automatic integration of all the data in one location (Data Lake)
  • The developed system was used by the entire company through ensuring access to granulated data and integrated and aggregated data from many commercial sources
  • Platform responsible for processing terabytes of data was created on the multi-node Apache Hadopp cluster, Cloudera distribution
  • The main parameters of the Hadoop cluster are as follows: 18 nodes, memory RAM 1.32 TB, 468 processors, 919 TB disk capacity


  • Projects using the Big Data technology boosts considerably the speed of the results obtained and easiness in connecting data from various sources enables their visualisation in an analyst-friendly way
  • The Hadoop platform enables a company to have a more detailed insight into data with faster access; therefore, employees do not have to spend time on constant integration
  • The entire project, by means of the Big Data technology – Hadoop Ecosystem, collects data from various sources to one Data Lake
  • The project allows for obtaining analyses over 27 days earlier than before, by means of implementing the Big Data technology
  • It is possible to obtain predictive analyses referring to the establishment of a global sales strategy nearly one month earlier


Apache Hadoop, Apache Hive, activating BASH scripts by means of ETL tool – Informatica Big Data Edition, Qlik Sense