Nitendra Gautam

Metadata for Big Data Applications

Metadata is the information that describes other data or simply stating ,it is data about the data. It is the descriptive,administrative and structural data that defines a firm’s data assets.It specifically identifies the attributes,properties and tags that will describe and classify information. It represents different data information asset such as type of asset, author, date originated, workflow state, and usage within the Enterprise, among numerous others. Once metadata is properly defined ,it provides the value to the data content along with providing a tool for quickly location information .It can streamline and enhance the process of collecting, integrating, and analyzing big data sources.

Metadata provides the map and lineage between source and target systems. It is the semantic fabric that ties all our systems and users together. It provides the framework to enable effective policy development, clear access channels, consistent data definitions, usage and security protocols, and lineage pathways. It allows data users in all capacities and positions to work within a shared linguistic system.

Importance of Metadata

Metadata summarizes the basic information about data, which can make finding and working with particular instances of data easier. Metadata includes the descriptive, structural, and administrative attributes of business data.

According to Forrester, metadata is important because it is the foundation for many efforts including:

  • Easier interaction with the data It allows users to interact with the data through the the higher-level logical abstraction of a table rather than as a mere collection of files on HDFS or a table in HBase. In order to discover or interact with data visually, users don’t need to be concerned about where or how the data is stored.

  • Supplies information about data It supplies information about data stored in the cluster(partitioning ,sorting properties) that can be leveraged by various tools within the company while querying and populating the data.

  • Information Discovery and Tracing Lineage We can connect metadata to different data management tools .Once metadata is connected to the data mamagement tools ,we can discover what data is available and how it can be used. We can also trace the lineage of the data (find out where and when the data set orginated). As business strategic decisions become more data-driven, it becomes more critical to find and effectively employ the data that a business has about a customer, market, or product. Metadata assists this through consistent definitions of data, such as customer address; the association of “related” information, such as all interactions with that customer; and the configuration of views into this data, for BI applications .

  • Data Governance Data-driven business decisions necessitate that a business executive knows he can trust the information; that it’s accurate, timely, and means what he thinks it means. Trust requires a consistent level of quality in an environment of change. Metadata describes the lineage of information, where it comes from, and the quality attributes of information. For example, a shipping address must contain a postcode correctly formatted for the country of the address. As part of data governnance process different information such as record count, source to target mapping record count ,null values count and duplicate counts can be stored in metadata.

  • Data management Metadata is the backbone for administering and enforcing business policies, such as privacy and security. Similarly, metadata assists with managing the costs of information storage by allowing you to isolate, retain, and delete information according to policy.

Types of MetaData

Metadata in general can be divided into two categories.

Technical Metadata

Technical metadata mainly includes system information which defines the data structures such as:

  • Tables
  • Fields
  • Data Types
  • Indexes
  • Partition Column in Database
  • Data Dimensions
  • Data Measueres
  • Data Mining Models

It defines the objects and processes of information assets as seen from a technological point of view .It also defines the data model and other data related to access policy such as necessary required permissions,rights, and other protocols that enable metdata and data access and use.

Business Metadata

Business metadata links the technical metadata back to the business needs . Some of the items included by the business metadata is given below.

  • Data about the functionality
  • Data Elements definition
  • Definition on how business units use the data
  • Business requirements
  • Project timelines
  • Business metrics
  • Business process flows
  • Business terminology

Users of the data wants to understand the meaning of data and be confident on it . When data is accurate and consistent ,it can be trusted throughout the organization when making business decisions.Metadata can be accessed along with the data that are being represented at the same time so that both data sources metadata and data can be used for data science inquiries.

Metadata in Hadoop

When we discuss metadata in perspective of Hadoop ecosystem ,it can fall in one of below categories.

Metadata about Logical Data Sets

This metadata are stored in seperated metadata repository and can includes following information:

  • Location of a data set (e.g., directory in HDFS or the HBase table name)
  • Schema associated with the data set (Column name,Column Data type(String,Long,Float) )
  • Information related the partition and sorting properties of the data sets
  • Format of the data sets (CSV,TSV,Sequence,Parquet,Avro etc)

Metadata about files on HDFS

This includes information like permissions and ownership of such files and the location of various blocks of that file on data nodes. Such information is usually stored and managed by Hadoop NameNode.

Metadata about tables in HBase and Hive This includes following unformation.

  • Table Names and associated NameSpace/Database
  • Associated Attributes(e.g. MAX_FILESIZE,READONLY,WRITEONLY etc)
  • Name of Columns

Metadata about data ingest and transformations This includes information like which user generated a given data set, where the data set came from,time and date of creation, how long it took to generate it,where the data is going to be stored(HDFS,S3 and Azure) and how many records there are or the size of the data loaded.

Metadata about data set/Table statistics

This includes information like:

  • Table/dataset rows count for each partition (date,country or unique column)
  • number of unique values in each column in datas sets
  • histogram of the distrubution of the data sets
  • Maxinum and minimum values in the data sets

Such metadata is useful for various tools that can leverage it for optimizing their execution plans but also for data analysts, who can do quick analysis based on it.

Metadata Governance

Managing meatadata is an integral part of overall data governance standard.An efficient way to do this is to establish a data stewardship for metadata.Stewarding metadata will ensure that data will remain consistent throughout the enterprise and provides big data analytics decision with an accuracy.Stewarding metadata also provides the users of this data with value and a context for understanding the data and its components.

Below are some of the major responsibilities of the metadata stewardship.

  • Documenting the data heritage and lineage of the data content
  • Defining and documenting the data definitions for data store entities & attributes.
  • Identification of the relationship between data
  • Providing Validation of data timeliness,accuracy and completeness
  • Assist in development of data compliance,audits, legal and regulatory controls for data governance

Companies need to adhere several compliance related to data privacy which differs for different industry . Below tables shows some data privacy related complaince and its applicable industry/data type .

ComplainceApplicable Data types
HIPAA(Health Insurance Portability and Accountability Act)It is standardized mechanisms to ensure healthcare organizations (called “Covered Entities”) protect the integrity, privacy and confidentiality of individuals’’ health-related data
FDA (Food & Drug Administration) Part 11It requires drug makers, medical device manufacturers, biotech companies, and other FDA-regulated industries to implement controls including audits, system validations, audit trails, electronic signatures, and documentation for software and systems involved in processing electronic data
GLBA (The Gramm-Leach -Bliley Act)It includes provisions to protect consumers personal financial information held by companies broadly defined as “financial institutions.”
EUDPD (EU Data Protection Directive)It declares that data protection is a fundamental human right. It standardizes protection of data privacy for EU citizens
HITECH (Health Information Technology for Economic and Clinical Health Act)It broadens the scope and increases the rigor of HIPAA compliance.
FINRA (Financial Industry Regulatory Authority)It’s member companies must maintain business continuity and contingency plans to satisfy obligations to clients in the event of an emergency or outage. It requires members to create, test, and update business continuity plans to satisfy obligations to clients in the event of an emergency or outage
SEC (Securities and Exchange Commission) Rules 17-a 3 and 4These rules require broker-dealers to create, and preserve in an easily accessible manner, a comprehensive record of securities transactions they effect and of their business in general. Rule 17a4 requires electronic storage to preserve records in a non-rewriteable and non-erasable format. Retention is required for a specific period of time.
FERPA (The Family Educational Rights and Privacy Act)This law is designed to protect the privacy of student education records and applies to all schools that receive funds under programs of the US Department of Educations.

Metadata Management Function

Key Metadata management functions would include:

  • Inventory. A complete inventory of the data ecosystem. This includes both physical and logical representations of data assets, business or semantic information, services, APIs, etc.
  • Information Model. A information model representing the business vocabulary relevant to the business. The model provides the dictionary data assets are mapped to, providing a common understanding and clear translation between business and technical representation of data.
  • Classification. The process of associating data assets to the information model. Today, this is done manually, but this process must be automated in order to achieve enterprise scale. This capability provides tremendous value by enabling data quality, compliance and improving time to market.
  • Data Quality / Data Usage Information. Using common rules enabled by classification, data assets are validated for completeness, correctness and compliance. Automated data cleansing is enabled as an result of data quality processes. Provide information on what data is utilized for each operation (CRUD).
  • Governance. The oversight of data assets. Data Governance provides the oversight, standards and policies for the information model and classification and data quality processes. Stewardship and workflow. Just as a library’s card catalog provides a directory of what books are on the shelves, in the data engineering space, Metadata provides the equivalent directory of what business information is created, inventoried, and available for application and business use.







Big Metadata, Smart Metadata, and Metadata Capital: Toward Greater Synergy Between Data Science and Metadata


Semi Structured or Unstructured MetaData