Many organizations are adopting Hadoop in their IT
infrastructure. For vintage huge facts stagers with a sturdy engineering crew,
it is usually now not a big problem to design the target device, chooses a
generation stack, and begins implementation. Those with a variety of experience
can nonetheless every so often face boundaries with all of the complexity;
however Hadoop beginners face a myriad of challenges to get began. Under are
the maximum commonly visible Hadoop challenges which Grid Dynamics resolves for
its clients.
Diversity of Vendors,
which to pickup?
The commonplace first reaction is to apply the unique
Hadoop binaries from the Apache website however this outcomes in the attention
as to why only a few corporations use them “as-is” in a production
Environments. There are quite a few wonderful arguments to not try this.
however then panic comes with the realization of simply how many Hadoop
distributions are freely available from Hortonworks, Cloudera, MapR and
finishing with huge industrial IBM InfoSphere BigInsights and Oracle Big Data
Appliance. Oracle even consists of hardware! Things end up even more tangled after
a few introductory calls with the carriers. Choosing the right distribution
isn't a smooth task, even for experienced staff, due to the fact every of them
embed extraordinary Hadoop components (like Cloudera Impala in CDH),
configuration managers (Ambari, Cloudera manager, and so on.), and a normal
vision of a Hadoop mission.
Map Reduce
programming is not a good match for all problems.
It’s good for
simple information requests and problems that can be divided into independent
units, but it's not efficient for iterative and interactive analytic tasks.
MapReduce is file-intensive. Because the nodes don’t intercommunicate except
through sorts and shuffles, iterative algorithms require multiple
map-shuffle/sort-reduce phases to complete. This creates multiple files between
MapReduce phases and is inefficient for advanced analytic computing.
SQL on Hadoop:
There’s a widely acknowledged talent gap. It can be
difficult to find entry-level programmers who have sufficient Java skills to be
productive with MapReduce. That's one reason distribution providers are racing
to put relational (SQL) technology on top of Hadoop. It is much easier to find
programmers with SQL skills than MapReduce skills. And, Hadoop administration
seems part art and part science, requiring low-level knowledge of operating
systems, hardware and Hadoop kernel settings.
SQL on Hadoop.
Very popular, but not clear
Hadoop stores a lot of data. Apart from processing
according to predefined pipelines, businesses want to get more value by giving
an interactive access to data scientists and business analysts. Marketing buzz
on the Internet even forces them to do this, implying, but not clearly saying,
competitiveness with Enterprise Data Warehouses. The situation here is similar
to the diversity of vendors, since there are too many frameworks that provide
“interactive SQL on top of Hadoop,” but the challenge is not in selecting the
best one. Understand that currently they all are still not an equal replacement
for traditional OLAP databases. Simultaneously with many obvious strategic
advantages, there are disputable shortcomings in performance, SQL-compliance,
and support simplicity. This is a different world and you should either play by
its rules or do not consider it as a replacement for traditional approaches.
Full-fledged data
management and governance:
Hadoop does not
have easy-to-use, full-feature tools for data management, data cleansing,
governance and metadata. Especially lacking are tools for data quality and
standardization.
Data security:
Another challenge centers on the fragmented data security
issues, though new tools and technologies are surfacing. The Kerberos
authentication protocol is a great step forward for making Hadoop environments
secure.
Secured Hadoop
Environment. Point of a headache.
More and more companies are storing sensitive data in
Hadoop. Hopefully not credit cards numbers, but at least data which falls under
security regulations with respective requirements. So this challenge is purely
technical, but often causes issues. Things are simple if there are only HDFS
and MapReduce used. Data-in-the-motion and at-rest encryption are available,
file system permissions are enough for authorization, Kerberos is used for
authentication. Just add perimeter and host level security with explicit edge
nodes and be calm. But once you decide to use other frameworks, especially if
they execute requests under their own system user, you’re diving into troubles.
The first is that not all of them support Kerberized environment. The second is
that they might not have their own authorization features. The third is
frequent absence of data-in-the-motion encryption. And, finally, lots of
trouble if requests are supposed to be submitted outside of the cluster.
Conclusion
We pointed out a few topical challenges as we see them.
Of course, the items above are far from being complete and one could be scared
off by them resulting in a decision to not use Hadoop at all or to postpone its
adoption for some later time. That would not be wise. There is a whole list of
advantages brought by Hadoop to organizations with skillful hands. In
cooperation with other Big Data frameworks and techniques, it can move
capabilities of data-oriented business to an entirely new level of performance.
Awesome blog very well defined stuff thanks for sharing. We are providing Big Data Hadoop Training In Bangalore
ReplyDeleteThe Blog gave me idea when we should choose Hadoop My sincere Thanks for sharing this post and please counitnue to share this kind of post
ReplyDeleteHadoop Training in Chennai
nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
ReplyDeletehadoop training in chennai