Galaxy Consulting
  • Home
  • About Us
    • Our Process
    • Meet Us at Industry Events
  • Services
    • Business Analysis and Usability
    • Content and Knowledge Management
    • Records Management
    • Information Architecture
    • Enterprise Search
    • Taxonomy and Metadata Development and Management
    • Document Control
    • Information Governance
  • Solutions
    • Information Overload
    • Compliance
    • E-Discovery
    • Internal and External Websites
    • Enterprise Search
    • Collaboration and New Employees’ Onboarding
    • Customer Service
    • Manual Processes
    • Vulnerability of Sensitive Information
  • Portfolio
    • Our Brochure
    • Our Clients
    • Case Studies
    • Presentations
    • Press Releases >
      • Galaxy Consulting Receives 2016 Best of Redwood City Award
      • Galaxy Consulting Receives 2015 Best of Redwood City Award
    • Videos
  • Testimonials
  • Blog
  • Free Consultation
  • Contact Us
  • Terms of Use/Privacy Policy

Hadoop and Big Data

7/22/2014

1 Comment

 
Picture
During last ten years the volume and diversity of digital information grew at unprecedented rates. Amount of information is doubling every 18 months, and unstructured information volumes grow six times faster than structured.

Big data is the nowadays trend. Big data has been defined as data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Hadoop was created in 2005 by Doug Cutting and Mike Cafarella to address the big data issue. Doug Cutting named it after his son's toy elephant It was originally developed for the Nutch search engine project. Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.

As of 2013, Hadoop adoption is widespread. A number of companies offer commercial implementations or support for Hadoop. For example, more than half of the Fortune 50 use Hadoop. Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing.

Ventana Research, a benchmark research and advisory services firm published the results of its groundbreaking survey on enterprise adoption of Hadoop to manage big data. According to this survey:
  • More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs.
  • More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analysis and better utilization of computing resources.
  • 87% of Hadoop users are performing or planning new types of analysis with large scale data.
  • 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.
  • 63% of organizations use Hadoop in particular to work with unstructured data such as logs and event data.
  • More than two-thirds of Hadoop users perform advanced analysis such as data mining or algorithm development and testing.
Today, Hadoop is being used as a:
  • Staging layer: the most common use of Hadoop in enterprise environments is as “Hadoop ETL” — pre-processing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.
  • Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
  • Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
  • Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop/MapReduce.
Hadoop is particularly useful when:
  • Complex information processing is needed.
  • Unstructured data needs to be turned into structured data.
  • Queries can’t be reasonably expressed using SQL.
  • Heavily recursive algorithms.
  • Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing.
  • Machine learning.
  • Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB).
  • Data value does not justify expense of constant real-time availability, such as archives or special interest information, which can be moved to Hadoop and remain available at lower cost.
  • Results are not needed in real time.
  • Fault tolerance is critical.
  • Significant custom coding would be required to handle job scheduling.

Does Hadoop and Big Data Solve All Our Data Problems?

Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.

1 Comment

Benefits of Automatic Content Classification

7/14/2014

0 Comments

 
Picture
 I had few questions about my posts about automatic content classification. I would like to thank my blog readers for these questions. This post is to follow up on those questions.

Organizations receive countless amounts of paper documents every day. These documents can be mail, invoices, faxes, or email. Even after organizations scan these paper documents, it is still difficult to manage and organize them. 

To overcome the inefficiencies associated with paper and captured documents, companies should implement an intelligent classification system to organize captured documents.

With today’s document processing technology, organizations do not need to rely on manual classification or processing of documents. Organizations that overcome manual sorting and classification in favor of an automated document classification & processing system can realize a significant reduction in manual entry costs, and improve the speed and turnaround time for document processing.

Recent research has shown that two-thirds of organizations cannot access their information assets or find vital enterprise documents because of poor information classification or tagging. The survey suggests that much of the problem may be due to manual tagging of documents with metadata, which can be inconsistent and riddled with errors, if it has been done at all.

There are few solutions for automated document classification and recognition. Some of them are: SmartLogic's Semaphore, OpenText, Interwoven Metatagger, Documentum, CVISION Trapeze, and others. These solutions enable organizations to organize, access, and control their enterprise information. 

They are cost effective and eliminate inconstancy, mistakes, and the huge manpower costs associated with manual classification. Putting an effective and consistent automatic content classification system in place that ensures quick and easy retrieval of the right documents means better access to corporate knowledge, improved risk management and compliance, superior customer relationship management, enhanced findability for key audiences and an improved ability to monetize information.

Specific benefits of automatic content classification are:

More consistency. It produces the same unbiased results over and over. Might not always be 100% accurate or relevant, but if something goes wrong, it is at least is easy to understand why.

Larger context. Enforces classification from the whole organizations perspective, not the individuals. For example, a person interested in sports might tag an article which mentions a specific player, but forget/not consider a team and a country topic.

Persistent. A person can only handle a certain number of incoming documents per day, whilst an automatic classification works round the clock.

Cost effective. Possible to handle thousands of documents much faster than a person.

Automatic document classification can be divided into three types: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents; unsupervised document classification (also known as document clustering) where the classification must be done entirely without reference to external information, and semi-supervised document classification where parts of the documents are labeled by the external mechanism.

Automate your content classification and consider using manual labor mainly for quality checking and approval of content.

In my next post on this topic, I will describe the role of automatic classification in records management and information governance.

0 Comments

    Archives

    January 2019
    December 2018
    October 2018
    July 2018
    June 2018
    May 2018
    March 2018
    February 2018
    January 2018
    December 2017
    September 2017
    July 2017
    June 2017
    May 2017
    April 2017
    January 2017
    December 2016
    November 2016
    September 2016
    July 2016
    June 2016
    May 2016
    April 2016
    March 2016
    February 2016
    January 2016
    December 2015
    November 2015
    October 2015
    September 2015
    July 2015
    June 2015
    May 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    April 2014
    March 2014
    February 2014
    January 2014
    December 2013
    November 2013
    October 2013
    September 2013
    July 2013
    June 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013
    December 2012
    November 2012
    October 2012
    September 2012
    August 2012
    July 2012
    June 2012
    May 2012
    April 2012
    March 2012
    February 2012
    January 2012
    December 2011
    November 2011

    Categories

    All
    Alfresco
    Arena
    Automatic Classification
    Autonomy
    Big Data
    Business Analysis
    Case Studies
    Change Control
    Change Management
    Cloud Content Management
    Cloud Ecm
    Cloud Enterprise Content Management
    Cms
    Collaboration
    Compliance
    Concept Searching
    Confluence
    Content Analysis
    Content Localization
    Content Management
    Content Management Systems
    Content Strategy
    Controlled Vocabulary
    Coveo
    Crisis Management
    Dams
    Data Security
    Digital Asset Management
    Digital Asset Management System
    Dita
    Document Control
    Document Control Systems
    Documents Management
    Documentum
    Drupal
    Dublin Core Metadata
    Ecm
    E Discovery
    Engineering Change Process
    Enterprise Content Management
    Enterprise Search
    ERoom
    Exalead
    Fatwire
    Gamification
    Gmp
    Gxp
    Hadoop
    Information Architecture
    Information Governance
    Information Overload
    Information Technology
    Iso 9001
    Joomla
    Knowledge Management
    Knowledge Management Applications
    Metadata
    Mobile Devices
    Ontology
    Open Source Cms
    Open Text
    Oracle
    OWL
    Personalization
    RDF
    Records Management
    Risk
    Search Applications
    Self Service
    SEO
    Sharepoint
    Social Media
    Structured Content
    Taxonomy
    Teamsite
    Thesaurus
    Tridion
    Twiki
    Unified Data
    Usability
    User Adoption
    User Centered Design
    Vasont
    Vivisimo
    Web Site Content
    Web Site Design
    Wiki

    RSS Feed

Powered by Create your own unique website with customizable templates.