Introduction to Big Data Analytics for Cyber Security
Big data analytics is the practice of studying large amounts of data of a variety of types and a variety of courses to learn interesting patterns, unknown facts, and other useful information. Big data analytics can play a crucial role in cyber security. Many in the industry are changing the tone of their conversation, saying that it is no longer if or when your network will be compromised, but the assumption is that your network has already been hacked or compromised, and suggest focusing on minimizing the damage and increasing visibility to aid in identification of the next hack or compromise.
Advanced analytics can be run against very large diverse data sets to find indicators of compromise (IOCs). These data sets can include different types of structured and unstructured data processed in a “streaming” fashion or in batches. NetFlow plays an important role for big data analytics for cyber security, and you will learn why as you read through in this chapter.
What Is Big Data?
There are a lot of very interesting definitions for the phenomenon called big data. It seems that a lot of people have different views of what big data is. Let’s cut through the marketing hype and get down to the basics of the subject. A formal definition for big data can be obtained in the Merriam-Webster dictionary: http://www.merriam-webster.com/dictionary/big%20data.
- An accumulation of data that is too large and complex for processing by traditional database management tools.
- Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.
The size of data that can be classified as big data is a moving target. It can range from a few terabytes to yottabytes of data in a single data set. For instance:
- A petabyte is 1000 terabytes.
- An exabyte is 1000 petabytes.
- A zettabyte is 1000 exabytes.
- A yoyabyte is 1000 zettabytes.
Unstructured Versus Structured Data
The term unstructured data is used when referring to data that does not have a predefined data model or is not organized in a predetermined way. Typically, unstructured data is defined as data that is not typically tracked in a “structured” or traditional row-column database. The prime examples of unstructured data are as follows:
- Multimedia content such as videos, photos, and audio files
- E-mail messages
- Social media (Facebook, Twitter, LinkedIn) status updates
- Presentations
- Word processing documents
- Blog posts
- Executable files
In the world of cyber security, a lot of the network can be also categorized as unstructured:
- Syslog
- Simple Network Management Protocol (SNMP) logs
- NetFlow
- Server and host logs
- Packet captures
- Executables
- Malware
- Exploits
Industry experts estimate that the majority of the data in any organization is unstructured, and the amount of unstructured data is growing significantly. There are numerous, disparate data sources. NetFlow is one of the largest single sources, and it can grow to tens of terabytes of data per day in large organizations, and it is expected to grow over the years to petabytes. The differentiation in the usefulness of any big data solution is the merging of numerous data sources and sizes that are all in the same infrastructure and providing the ability to query across all of these different data sets using the same language and tools.
There is an industry concept called Not-Only SQL (NoSQL), which is the name given to several databases that do not require SQL to process data. However, some of these databases support both SQL and non-SQL forms of data processing.
Big data analytics can be done in combination of advanced analytics disciplines such as predictive analytics and data mining.
Extracting Value from Big Data
Any organization can collect data just for the matter of collecting data; however, the usefulness of such data depends on how actionable such data is to make any decisions (in addition to whether the data is regularly monitored and analyzed).
There are three high-level key items for big data analytics:
- Information management: An ongoing management and process control for big data analytics.
- High-performance analytics: The ability to gain fast actionable information from big data and being able to solve complex problems using more data.
- Flexible deployment options: Options for on-premises or cloud-based, software-as-a-service (SaaS) tactics for big data analytics.
There are a few high-level approaches for accelerating the analysis of giant data sets. The following are the most common:
- Grid computing: A centralized grid infrastructure for dynamic analysis with high availability and parallel processing.
- Intra-database processing: Performing data management, analytics, and reporting tasks using scalable architectures.
- In-memory analytics: Quickly solves complex problems using in-memory, multiuse access to data and rapidly runs new scenarios or complex analytical computations.
- Support for Hadoop: Stores and processes large volumes of data on commodity hardware. Hadoop will be covered in a few pages in the section “Hadoop.”
- Visualizations: Quickly visualize correlations and patterns in big data to identify opportunities for further analysis and to improve decision making.
Examples of technologies used in big data analytics are covered in detail later in this chapter.