The Devil is in the Data

Greg Knieriemen
Enterprise Te.ch
Published in
4 min readMar 6, 2019

--

Data, and how it’s managed, may be the biggest barrier enterprises face when trying to drive business value from AI technologies.

According to a survey published this week by PwC, poor data reliability “acquired haphazardly” was ranked highest among the 6 major obstacles to monetizing data.

The survey of 300 executives at US companies with revenues of $500 million or more also revealed that new data protection and privacy regulations, data security, siloed data, the lack of talent and poor IT systems are also leading roadblocks to AI adoption.

Source: PwC, Trusted Data Optimization Pulse Survey, February 2019

The challenge of poor data quality has consequences: bad data leads to bad analysis and potential bias. This gets even more complicated as diverse, siloed and aging data types are housed in systems and architectures that weren’t built for ingestion and deep analysis. Even with the best talent, data science can only be as effective as the data and it’s supporting architecture.

And it will get even more complicated. IDC predicts that by 2025, “more than a quarter of the global data set will be real time in nature, and real-time IoT data will make up more than 95% of it”. Data sources will also become more distributed across on-premises, public cloud environments and on the edge.

But doing it right can also reap big benefits. PwC notes that research from the University of Texas demonstrates that by increasing data’s usability even 10%, Fortune 1000 companies could, on average, increase revenue by two billion dollars annually.

To get there, the challenge is to build a foundation that can accommodate and support diverse data types and locations whether it’s on a public cloud, on-premises or broadly distributed at the edge. The architecture has to support a data pipeline purpose built to move data from ingest through data classification, transformation, analytics, machine learning and deep learning model training, and then retraining through inference to produce meaningful insights.

While AI/ML/DL projects can be heavily burdened by poor data, being able to inventory and classify that data is indeed a part of the data pipeline model. The right ecosystem of infrastructure, software and validated blueprints can simplify and automate data preparation process. Data cleansing is part of the data prep stage and RAPIDS, the open source machine learning libraries championed by Nvidia, is one of the popular toolsets used to speed it up through automation and GPU acceleration. Besides improving data scientist productivity, these toolsets deliver better model accuracy while driving down costs.

But some of the most critical components for effective data management rely on compute, storage, file systems, and networking architecture.

Compute: Leveraging purpose-built hardware for the right purpose. While CPU’s can be leveraged in the data preparation stage (where they are suitable for data normalization and transformation), GPU’s are better suited for the parallel computation needed for DL model training.

Storage: To keep the GPU caches loaded with fresh data, you need a storage system that can handle the high bandwidth requirements for model training.

File Systems: Depending on the DL workflow, the characteristics of data streams can vary. In many cases, the data traffic in DL consists of millions of files (images, video, audio, text files). NFS is perfectly suited for delivering high performance across a diverse range of workloads; NFS handles both random and sequential I/O well and can scale up and out seamlessly.

Networking: Having support for high-speed transport links is essential to prevent any networking bottlenecks in the infrastructure. Multisystem scaling of DL workloads requires the network transport to provide extremely low-latency and high-bandwidth communication.

Poor data doesn’t have to be a barrier to implementing a successful AI project that drives real business value, but it requires a focus on the core components:

· A unified data pipeline — from edge to core to cloud

· Validated and proven architecture for data management regardless of location

· Simple deployment that eliminates complexity and guesswork

· Performance and scalability to start small and grow non-disruptively

Additional info:

IDC: “Infrastructure Considerations for AI Data Pipelines”

NetApp: Make Your Data Pipeline Super-Efficient by Unifying Machine Learning and Deep Learning

Technical white paper: Edge to Core to Cloud Architecture for AI

Use case: Cambridge Consultants Breaks Artificial Intelligence Limits

Webcast: Driving Innovation with AI: A Cross-Industry View

Events: NVIDIA GTC San Jose, March 18–21

--

--

NetApp Chief Technologist. Live in The Land, work in The Valley. Opinions here are simply mine.