Big Data Analytics in Healthcare – Highlighting Challenges

By James Tredwell on November 7, 2019

Big Data Analytics is the method of extracting from Big Data sets and mining for an understanding, information, insight or knowledge mining. Big Data extracts helpful information from a computation pipeline that transfers stores and analyzes data for an entire application.

In the below section we highlight the challenges with respect to Big Data Analytics in Healthcare.

Data aggregation into large volumes:

The most frequently used technique of aggregating and transferring massive amounts of data is to copy / transfer data to a storage device, but its effectiveness decreases as volume rises. Big Data typically requires various organisations, geographic places and various computers to aggregate over a structure; thus, creating big data sets by replication from production should minimize the continuing usage of network facilities and database assets, allowing the production system to make the system running.

Furthermore, it is very difficult to manage the transfer of data between organizations and databases; therefore, generally a secondary database is developed, external to processing technologies. Another aggregation strategy is to move data across a network. However, large quantities of data need to be transferred, aggregated and indexed over a significant period. A third alternative is to replicate and iteratively produce data from sources across cases and various nodes, as Hadoop does when replicate and saving file blocks through distributed batch procedures.

Data is segmented and siloed:

While there are numerous challenges in implementing big data & analytics to its full extent. The biggest and the foremost challenge is data. First and foremost, the data used in healthcare is usually segmented and siloed.

For instance, finance data is only available with administrative team such as claims, reimbursement and cost information. This data set qualifies as the business side data of the healthcare which has nothing to do with patient care and treatment. And not used for patient care or treatment.

EHR (Electronic Health Record) consist of patient history, vital signs, progress note and the result of diagnostic tests. This all data can be summed up as clinical data which is accessed and maintained by healthcare practitioners, nurses and doctors and serves as a purpose for treatment.

Data on quality and outcomes such as surgical site infections, surgical return rates, patient drops, and value-based purchasing measures of Medicare and Medicaid Services Centres (CMS) are in the departments of quality or risk management. These data are collected and typically used to measure the provider’s performance retrospectively. According to study conducted in USA, 43% of healthcare data remains in silos.

For effective and efficient use of data analytics, there is a prodigious need to combine all such data. Experts are working hard towards this effort and are coming up with solutions such as data warehouses and decision support databases which is enacting as an enabler to combine such data sets.

Data maintenance:

Since Big Data, by definition, comprises of massive amounts of information, it is very tricky for continuing queries to be stored and maintained, particularly with continuous data batch handling. With smaller organization, time and cost is constraint in dealing with large amount of data. Another healthcare industry concern is that there is a need for constant updating of actual patient data, metadata and data profiles; otherwise the analytics will become ineffective.

Many solutions are available to address this concern viz. – NoSQL / NewSQL and other storage systems (MonogoDB, HBase, Voldemort DB, Cassandra, Hadoop Distribution File System, and Google’s BigTable. With data maintenance, legalities and ethics are significant problems. Data sets are usually governed by security, confidentially and privacy which holds those responsible for information retention accountable.

As per HIPA requirements, 18 critical information is to be removed from the patient information. By applying the appropriate software and database technologies, privacy concerns can be addressed, such as key value storage services. In an extremely secure building most hospitals house their data in server racks and providers are not generally permitted to use cloud facilities.

Unstructured Data Sets:

The quantity of unstructured data may be the most important task in aggregating and evaluating large health care information. Structured or discrete information involves information that in a relational database can be collected and recovered.

In the EHR of patients, unstructured health care information includes – test result outcomes, scanned records, pictures and progress reports. While standards such as the Clinical Documentation Architecture allow EHR data to be interoperable and shared, the contents of the defined fields are often free text and therefore unstructured data.

As free-text search technology matures and natural linguistic handling technology is incorporated into these, unstructured information is probable to be one of the most precious parts of the large information image of health care.

Patient’s Privacy:

A second major task in taking full advantage of the large information of healthcare is to protect the privacy of the patient. The exchange of health care data between organisations is often indicated as an objective and organisations such as national health information organisations have been specifically created to bring together health care data from stakeholders including suppliers, payers, and government health organisations.

There is certain regulatory requirement as well such as Health Insurance Portability and Accountability (HIPA) Act. After de-identification, patient information may be communicated, but it is difficult to protect the patient from either immediate or indirect identification while preserving data’s usefulness.

Covered organizations, including health service suppliers and health insurance firms among others, often mistake on the conservative hand and only publish aggregate information or information with removal of all prospective identifiers. Removing these information components and fulfilling the “secure harbour” de-identification requirements of the Health Insurance Portability and Accountability Act makes it almost difficult to use information for trend or longitudinal care research.

When surveys contain a time element such as those examining readmission prices or morality rates, removing date components is difficult. Even if the patient’s privacy can be guaranteed, due to industry rivalry, many health service suppliers are unwilling to disclose information. Many doctors don’t want their rivals to understand precisely how many processes they have carried out and where. The combination or demographics of patient insurance may provide an economic benefit over another clinic.

Although most clinics are run as non-profit organizations, they are still a company and follow all the laws governing the operation of a sustainable company. There are a range of data sets that are openly accessible that may enable rivals to gather comparable information, but these sources are typically historical or restricted to public payers.

The patients themselves are becoming more and more an information source. In formulating a solid information management scheme, the compilation of this information and the effect of its incorporation in the health care record are critical. This information can be gathered via surveillance devices linked to an offsite computer via mobile computing or downloaded from the computer periodically during an office visit.

The data must be validated in either scenario to ensure that the patient used the monitoring device and did not transfer it to another household person. With this patient information obtained, the danger of impaired information integrity is much greater than with sources under the clinician’s immediate command.


A significant task to be recognized in information analytics for health care is that the assessment is often a secondary use of the information. For example, administrative information is mainly gathered for the billing of rendered services and deposit collection. The primary purpose of EHR information is to monitor patient advancement, therapy, and clinical status. When this information is then used to evaluate quality and results, the initial use of the information must be recognized as a prospective restriction and may compromise any subsequent models ‘ accuracy and legitimacy.

Comprehensive data and information management programs can be used within and across providers to tackle many of these challenges. A data management program involves guidelines on data format and the suitable use of data sources and data areas. Rigorous information management strategies guarantee coherent information material and format and support the technical elements of mapping and combining information from different sources. An information management program deals with data handling, evaluation and security.

Information management strategies will guide information consumers in determining whether a secondary use of the information is suitable, as well as the amount of information that can be published while preserving the identity of the patient. Data and information management operations must be cross-departmental for inner data sets to one organization and cross-organizational for data sets drawn from various organisations in order to be most efficient. This sort of framework will assist both inner and external information silos to be broken down.

Contact Us for Free Consultation

Are You Planning to outsource Digital Tansformation services? Feel free for work-related inquiries, our experts will revert you ASAP,