Big data refers to datasets whose size, volume and structure is beyond the ability of traditional software tools and database systems to store,process and analyze within reasonable timeframes.
Big data security is a term used for the different tools and techniques used to protect data,any backend processess from outside attacks and thefts.
The influx of big data and the need to move this information throughout an organization has created a massive new target for hackers and other cybercriminals. This data, which previously was unusable by various organizations is now highly valuable and is subject to privacy laws and compliance regulations, and must always be protected.
Hadoop and similar NoSQL data stores is used in many organization of large and small size to collect,manage and analyze large data sets .Even though, these tools are popular among large and small organization ,they were not designed with comprehensive security in mind.
Data security in Hadoop Framework
In Hadoop based ecosysten, there are many new ways to process and ingest data whether it might be a push or pull based architecture.Hadoop framework can be used to handle these data for different use cases.But managing petabytes of data in a single centralized cluster can be dangerous as data is the most valuable asset of a company.
The question about Hadoop security is not just only about securing the source data which is moved from the enterprise systems to the Hadoop ecosystems,but also about securing the business insights and intelligence developed from those data.Any such insights in the hands of the competitor,individual hacker or any unauthorized personnel could be disastrous as they could steal personal or corporate data and use it for unlawful purposes.That’s why all these data must be fully secured.
Sensitive data stored in Hadoop or any big data frameworks are subject to privacy standards such as HIPPA,HITECH etc and security regulation and audits .In addition to bringing benefits to the enterprise ,Hadoop framework is also introducing new dimensions in the cyber-attack landscape.In the time when attackers are constantly looking for which system to target,Hadoop has become a starting point as all data are stored on top of HDFS.
Data security strategy is one of the most widely discussed topics among executives, business stakeholders , data scientists and the developers when working with data based solutions in the enterprise level .
Reasons for Securing Big Data Cluster
Among the many reasons of securing the Big data cluster ,below are some of the important ones.
- Contains Sensitive Data
Sensistive data like Crdit card,SSN and other corporate needs to be protected all the time.
- Data is Subject to Regulatory Compliance
Certain Country/Region like USA/EU have different data protection policies like HIPPA,FISMA,GDPR to protect sensitive data. These complaince differs based upon the data types and the region in which a company is conductiong the Business. Companies rea required legally to fo
- Secured data can Enable one’s Business
By securing the sensitive data ,companies can allow different workloads on the sensitive datasets.
Key Security Considerations in Hadoop
A complex and holistic approach is needed for data security in the entire Big Data Hadoop ecosystem.Below are some of the key considerations while designing security features for Hadoop Based Big Data Ecosystem
Authentication: A single point for authentication is needed for an enterprise identity and access management system. It is about verifying the identity of user or service so that only legitimate users get access to the data and services of Hadoop cluster. In large organizations ,Hadoop is integrated with existing authentication systems like below.
Active Directory(AD) Use of Active Directory has many advantage on the part of the organization and the users. From the organization perspective ,Re-usage of existing services reduces maintenance efforts and costs. From the user perspective use of Single Sign-On service is important to simplify the access and to increase the security in the cluster as password hashes do not get repeatedly transmitted over the wire.
Use of Keberos and LDAP Kerberos provides Single Sign-On visa a ticket-based authentication mechanism. The SPNEGO protocol, which is supported by all major browsers, extends Kerberos authentication to web applications and portals.
SAML(Security Assertion Markup Language)
Outh (Open Authentication)
HTTP Authentification : REST API based authetification mainly used for JDBC connection
Authorization: A role-based authorization with fine-grained access control needs to be set up for providing access to sensitive data.
Access control: Access to the data needs to be controlled based upon the availability of the processing capacity in the cluster.
Data Masking and encryption: Enterprise must deploy a proper encryption and masking techniques on the data so that secure access to sensitive data are available for authorized personnel only.
External or external leakage of data is a key business concern in any organization. It is a challenging task for any organization to secure sensitive and critical business data and personally identifiable in a Big Data cluster as data is stored across in various format after passing through different data pipelines
There are two types of encryption techniques that can applied in the data.
Data intransit Encryption
Implementing these techniques can be challanging as many information is not file-based in nature ,but rather handled through complex chain of message queues and message brokers. Sometimes application in Hadoop may use a local temprorary files that can contain sensitive information which must be secured.Plain version of Hadoop provides encrytion for data that are stored in HDFS.But,it does not have any comprehensive cryptographic key management solution or any Hardware Security Module(HSM) integration.
To support data at rest encryption ,Hadoop distribution from Cloudera provides a tool named Cloudera Navigator Encrypt and Key Trustee Server whereas Hortonworks provides Ranger Key Management Service.MapR uses formatpreserving encryption and masking techniques maintaining the data format without replacing it with cryptic text supporting faster analytical processing between applications.
Cryptographic protection of Data-at-rest can be done in three ways.
Application Level It integrates with the current application by securing the data during ingestions by using a an external key amanger with cryptic keys in HSM to encrypt and decrypt the data.
HDFS- Level It is a transparent encryption in which content is encrypted on wrire and decrypted on read.It protects against file-system & OS level attacks.
Disk Level It is a transparent encryption which is at a layer between application & file system.It provides process based access control which can secure metadata logs and config files.
Network perimeter security:
Core Hadoop does not provide any native safeguard against network based attacks like denial of service attacks ,which can . Some of these denial of service attacks can include Denial of Service(DoS),Distributed Denial of Service(DDoS), flooding a cluster with extra jobs or running jobs that consume high amount of resources.
To protect a Big data cluster from network based attacks, an organziation needs to :
- Perform packet level encryption and protect the client to cluster data with TLS(Transport Layer Security).
- Protect communication traffic within cluster by enabling encrypted shuffle and TLS/https for HDFS, MapReduce, YARN, HBase UIs etc.
- Protect Traffic in Cluster between Mapper and Reducer Jobs
System Security: System level security is achieved by hardening the OS and the applications that are installed as part of the ecosystem.
Infrastructure Security -SELinux: Data centers should have a strict infrastructure and physical access security.
Security-Enhanced Linux (SELinux) is a Linux Kernel security module that provides a mechanism for supporting access control policies such as MAC(Mandatory Access Control) .It was developed by NSA and adopted by upstream Linux Kernel . It prevents command injection attacks such as having a lib files with executable permission(x) but not write permssions(w). This policy prevents another user or process from accessing one’s home directory even if that user changes any settings on their home directory. This policy helps to label files ,grant permissions on it and enforce MAC.
Audits /Event Monitoring and Data Governance:
Enterprise should have a proper audit trail indicating any changes to the data ecosystem and also provide audit reports for any data access and data processing that occurs within the ecosystem.
As part of following givernment regulations, companies are often required to keep an audit trail of the log related cluster access and cluster configuration changes. Most of the Hadoop distributions like Cloudera,Mapr and Hortonwokrs offer audit capabilities to ensure that platform administrator and users activities can be logged.
Logging for audits should include at least below items.
- Change of File & Folders in filesystem
- Modification of Database structures
- Reconfigurations of the cluster,
- Application exceptions
- Login attempts to services.
Good auditing practice in an organizaton allows to identify sources of data and application & data errors as well as identify security events.Most of the big data plaform components allows for one or another form of logging either to local files system or HDFS. Main challange in the big data world for auditing is the distributed nature of big data components and the tight integration of distinct components with each othetr.
Good practice of auditing on Hadoop frameowrk let organizations capture metadata for data lineage, database changes and security events. Some of the commpon tools for auditing are the Cloudera Navigator Audit Server and Apache Atlas (Hortonworks). By using this ,organizations can capture events from the filesystem,database and authorization components automatically and display these data througha User Interface.
Disaster Recovery and Back Up
Disaster Recover (DR) enables business continuity for significant data center failures beyond what high availability features of Hadoop components can cover.
Disaster recovery is supported by various computer system in three ways.
Backup Backup of data refers to cold storage of data which won’t be used all the time.
Replication aims to provide a close resemblance of the production system by replicating data on a scheduled interval. Replication can also be used within the cluster to increase availability and reduce single points of failure.
- Mirrors A mirror is usually an exact copy of the production system with virtually no delay and is setup as a failover instance of the production system.