

Let us begin by looking at the jargon for data. A bit (binary digit) is the smallest form of data in computer science. It is thought of as an atom. A bit can be 0 or 1. It is typically used to measure the amount of data transmitted from one area to another — for example, the amount of data transferred from within the Internet. A byte is used for storage and can get very large.
Where does data come from?
All AI models and algorithms make use of data; I mean and a lot of data. This data can come from different sources. Some of the data sources include the Internet of things (IOT- ID tags and smart devices), Web and social media, biometric data such as genetics tests, fitness trackers), point-of-sale systems in both e-commerce and brick and mortar stores, cloud systems such business applications like CRM, and corporate databases(Taulli, 2019).
Data can be organized in four ways: structured data- mostly stored in relational databases and spreadsheets. Examples of structured data include addresses, phone numbers, social security numbers, financial information, product information(specifications), point of sale data. Structured data accounts for about 20% of an AI project(Taulli, 2019).
Structured data can be easier to work with. It does not have huge volumes. This type of data could come from CRM and ERP. In terms of analysis, it can be simple to analyze. We can use different Business intelligence programs to derive insights from such data.
On the other hand, unstructured Data consists of information with no predefined formatting. Examples of unstructured data include audio files, images, videos, text files, social network information such as posts and tweets, etc. It constitutes over 70% of the AI project. Unstructured requires a lot of formatting, which can be tedious and time-consuming. Processing unstructured data can be done with tools such as the next generation databases like those founded on NoSQL. AI systems’ algorithms can recognize patterns present in unstructured data, making the data’s management and structuring more effectively.
Semi-structured data– this data type is in between structured and unstructured data sources. Examples of semi-structured data include JSON(JavaScript Object Notation), which is used to transfer information on the web through Application Programming Interface and XML(Extensible Markup Language) used on different rules to identify the elements in a document. Semi-structured data is 5to10% of the Data that AI systems use.
Time-series data- This type of data can be structured or unstructured, or semi-structured. This data type is mostly for interactions. For example, time series can be used to track the behavior of stocks or customer journey. If a customer visits an e-commerce site, browse different products, clicks on a link, buy some items, and check out. The information about the customer time on the e-commerce will be collected as time-series data. This type of data can be hard to understand. For example, if a customer visits the site, browses, and leaves, it will generate a lot of messy and complicated data to understand the customer’s intent to the site. It can also be challenging to design metrics that capture success and explain why the customer did what they did on the site. However, AI can help in analyzing the time series data and derive useful business intelligence.
Big data– Then there is big data. Big Data has been capital. Most of the world’s biggest companies derive a lot of value from their data, which they regularly analyze to produce more efficient services and products. Companies are generating a humongous amount of data on a minute-by-minute basis. This data continues to grow as more and more sensors are being installed to harvest data from different sources. International Data Corporation (IDC), in a report “Data Age 2025”, estimates that the amount of data generated would reach 163 zettabytes by 2025(gartner.com).
When it comes to Big Data’s definition, there is no agreed-upon definition; instead, Big Data characterizes there. Big Data is said to have the 3Vs characteristics. The three Vs. were coined by Doug Laney, a Gartner analyst, back in 2001. There are volume, variety, and velocity. In brief, the volume is the scale of the data, usually unstructured. There is no specific threshold rule, but it is in the tens of terabytes. Variety describes the diversity of the data. The diversity here means that the data could combine any of the three types (structured, unstructured, or semi-structured). Variety also indicates the different data sources and uses. While humans might find variety challenging to manage, machine learning has been of tremendous help in simplifying the process. Velocity is the speed of data creation or data created per given time. There is no doubt that social media companies have very high velocities(referred to as “firehouses” of data). The massive amount of data generated demands vast investments and next-gen data centers and technologies. These companies also process the data in memory and not out in some disk-based systems. Velocity is considered the most critical because people are in a hurry to get their data. I want my data, and I want it now kind of attitude exists among people today,
It is essential to mention here that more Vs. have been added and continue to be added as years go by. Today, there is veracity, value, variability, and visualization. Many companies find it challenging to manage all their data, and most can only manage a very tiny fraction of their data.
What tools can help with data?
Databases are at the core of tools that help with data. The database’s market has evolved significantly. From the 1970s, with Edgar Codd’s famous paper “A Relational Model of Data for Large Shared Data Banks,” Doug Cutting’s Lucene for text searching, Cutting’s new platform enabled sophisticated storage, which used MapReduce for processing across multiple servers. Cutting’s system evolved into Hadoop, which is used to manage Big Data. With Hadoop, we can create data warehouses. Then came NoSQL systems, which are built to handle cloud, on-premises, and hybrid environments.
An example is MongoDB, which can manage petabytes of both structured and unstructured data. Then there is a data lake that can store and format many structured and unstructured data. The data lake is significant to AI systems because it formats data. Companies that use data lake see an average of 9% organic growth(Aberdeen Survey).
Data Process
Companies are shelling out an enormous amount of money on data processing. IDC forecasted that the spending on data and analytics would reach $260 billion by 2022. Despite the big-spending, it is reported in a Gartner study that about 85% of Big Data projects are abandoned before reaching the pilot stage. Some reasons, according to Taulli(2019), including
· Problems with data collection
· Lack of critical stakeholders’ buy-in
· Investing in wrong IT tools
· Lack of clear focus
· Dirty Data
The problems mentioned above mean that there has to be a data process. According to Taulli92019), in the late 1990s, CRISP-DM, a data process, was created by software developers, consultants, experts, and academics. The CRISP-DM consists of 6 steps:
· Step 1- Business Understanding
· Step 2- Data Understanding
· Step 3- Data Preparation
· Step 4- Modelling
· Step 5- Evaluation
· Step 6- Deployment
CRISP-DM process is not necessarily a linear process. Steps 1, 2, and 3 are the most time-consuming and account for about 80% of the data process time (Atif Kureishy, Global VP of Emerging Practices at Teradata).
Business understanding, data understanding, and data preparation are time-consuming because the data is not very organized, comes from different sources, initial planning was not enough for the project scope, and insufficient focus on automation tools. Let us summarize some keys tasks in steps 1, 2, and 3 of the CRISP_DM data process.
Reference
Report: Data warehouse failures commonplace | IT World …. https://www.itworldcanada.com/article/report-data-warehouse-failures-commonplace/18589
Taulli, T(2019). Artificial Intelligence Basics: A Non-Technical Introduction. Apress.