HDFS Schema Design

6 Mar

Hadoop’s Schema-on-Read model does not impose any requirements when loading data into Hadoop.

Data can be simply loaded into HDFS without association of a schema or preprocess the data. Although creating a carefully structured and organized repository of your data will provide many benefits. It allows for enforcing access and quota controls to prevent accidental deletion or corruption.

Data model will be highly dependent on the specific use case. For example, data warehouse implementations and other event stores are likely to use a schema similar to the traditional star schema, including structured fact and dimension tables. Unstructured and semi-structured data, on the other hand, are likely to focus more on directory placement and metadata management.

Develop standard practices and enforce them, especially when multiple teams are sharing the data.

Make sure your design will work well with the tools you are planning to use. The schema design is highly dependent on the way the data will be queried.

Keep usage patterns in mind when designing a schema. Different data processing and querying patterns work better with different schema designs. Understanding the main use cases and data retrieval requirements will result in a schema that will be easier to maintain and support in the long term as well as improve data processing performance.

Optimize organisation of data with partitioning, bucketing, and denormalizing strategies. Keeping in mind, storing a large number of small files in Hadoop can lead to excessive memory use for the NameNode.

A good average bucket size is a few multiples of the HDFS block size. Having an even distribution of data when hashed on the bucketing column is important because it leads to consistent bucketing. Also, having the number of buckets as a power of two is quite common.

Hadoop schema consolidates many of the small dimension tables into a few larger dimensions by joining them during the ETL process.

Hadoop File Types – Best practices

5 Mar

Hadoop-specific file formats include columnar format such as Parquet and RCFile, serialization formats like Avro, and file-based data structures such as sequence files.

Splittability and compression are the key consideration for storing data in Hadoop. It allows large files to be split for input to MapReduce and other types of jobs. Splittability is a fundamental part of parallel processing.


SequenceFiles store data as binary key-value pairs and can be uncompressed or compressed. SequenceFiles are well supported within the Hadoop ecosystem, however their support outside of the ecosystem is limited. They are also only supported in Java.

Storing a largenumber of small files in Hadoop can cause a couple of issues. A common use case for SequenceFiles is as a container for smaller files.

Seriaization Formats

Serialization is the process of turning data structures into byte streams. Data storage and transmission are main purpose of serialization. The main serialization format utilized by Hadoop is Writables. Writables are compact and fast, but limited to Java.

However, other serialization frameworks getting more reputation within the Hadoop ecosystem, including Thrift, Protocol Buffers, and Avro. Avro is the most efficient and specifically created to address limitations of Hadoop Writables.


Thrift was designed at Facebook as a framework for developing cross-language interfaces to services. Using Thrift allowed Facebook to implement a single interface that can be used with different languages to access different underlying systes.

Thrift does not support compression of records, it’s not splittable, and have no native MapReduce support.

Protocol Buffers

The Protocol Buffer (protobuf) was developed at Google to facilitate data exchange between services written in different languages. Protocol Buffers are not splittable, do not support internal compression of records, and have no native MapReduce support.


Avro is a language-neutral data serialization system designed to address the downside of Hadoop Writables: lack of language portability. Since Avro stores
the schema in the header of each file, it’s self-describing and Avro files can easily be read from a different language than the one used to write the file. Avro is splittable.

Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient.

Avro supports native MapReduce and schema evolution. The scehma used to read a file does not need to match the schema used to write the file which provides great flexibility with requirement change.

Avro supports a number of data types such as Boolean, int, float, and string. It also supports complex types such as array, map, and enum.

Columnar Formats

Most RDBMS stored data in a row-oriented format. This is efficient when many columns of the record need to be fetched. This
option can also be more efficient when you’re writing data, particularly if all columns of the record are available at write time because the record can be written with a single disk seek.

More recently, a number of databases have introduced columnar data storage which is well suited for data warehousing and queries that only access a small subset of columns. Columnar data sets provides more efficient compression.

Columnar file formats supported on Hadoop include the RCFile format, Optimized Row Columnar (ORC) and Parquet.


The RCFile was developed to provide fast data loading, fast query processing, and efficient processing for MapReduce applications, although it’s only seen use as a Hive storage format.

The RCFile format breaks files into row splits, then within each split uses column-oriented storage. It also has some deficiencies that prevent optimal performance for query times and compression. RCFile is still a fairly common format used with Hive storage.


The ORC format has come to life to address some of the weaknesses with the RCFile format, specifically around storage and query performance efficiency. The ORC provides lightweight, always-on compression provided by type-specific readers
and writers. Supports the Hive type model, including new primitives such as decimal and complex types. Is a splittable storage format.


Parquet documents says: Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Parquet shares many of the same design goals as ORC, but is intended to be a general-purpose storage format for Hadoop. The goal is to create a format that’s suitable for different MapReduce interfaces such as Java, Hive, and Pig, and also suitable for other processing engines

such as Impala and Spark. Parquet provides the following benefits, many of which it shares with ORC:

• Similar to ORC files, Parquet allows for returning only required data fields, thereby reducing I/O and increasing performance.
• Is designed to support complex nested data structures.
• Compression can be specified on a per-column level.
• Fully supports being able to read and write to with Avro and Thrift APIs.
• Stores full metadata at the end of files, so Parquet files are self-documenting.
• Uses efficient and extensible encoding schemas—for example, bit-packaging/run length encoding (RLE).


Having a single interface to all the files in your Hadoop cluster is valuable. Speaking of picking a file format, you will want to pick one with a schema because, in the end, most data in Hadoop will be structured or semistructured data.

So if you need a schema, Avro and Parquet are great options. However, we don’t want to have to worry about making an Avro version of the schema and a Parquet version.

Thankfully, this isn’t an issue because Parquet can be read and written to with Avro APIs and Avro schemas.

We can meet our goal of having one interface to interact with our Avro and Parquet files, and we can have a block and columnar options for storing our data.

Building an Enterprise Data Management Strategy

24 Jan

Mike20The “IT Transformation” of an organisation from its legacy environment to the next generation of technology is one of the most complex and expensive changes an organisation can undergo.

  • How to improve and optimise business processes
  • How to manage information across the enterprise
  • How to safely migrate from the legacy to the contemporary environment
  • How to deliver on a transition strategy that provides incremental functionality while mitigating risk and staying within budget
  • How to define an improvement strategy for your people, processes, and organisation as well as the technology

Of all these factors, how information is managed is often the biggest limiter to success.

In the 21st century, Flexibility in Accessing and Using Information will be King. To solve the Transformation Challenge use a “Balanced View” Model of the Enterprise.

Taking an Information Development approach means that we re-balance the work we do to focus on information as much as we focus on function, processes and infrastructure.

In organisations undergoing significant technology change, the problem isn’t whether the new applications can provide the required functionality, its often the data.

Data Quality is and has been a primary problem in project failures – and the issue isn’t going away.

Business Blueprint, Technology Blueprint, and Roadmap are crucial phases in a successful IT transformation. The Strategic Vision Leads to Continuous Implementation.

The MIKE2.0 Methodology (MIKE stands for Method for an Integrated Knowledge Environment): An Open Source Methodology for Information Development designed by BearingPoint, Inc.

Extraordinary Leadership in Australia and New Zealand

15 Nov

Extraordinary-leadership-in-australia-new-zealandLeadership is for everyone. Leadership is taking responsibility and make a difference.

Leadership engage people and brings out the best in them.

5 Practices of Exemplary Leadership:

Model the Way:

Clarify and share values. What you stand for? What you stand against?

Clarify your personal values and formulate a leadership philosophy.

Credibility is the foundation of leadership.

Inspire a Shared Vision:

Imaging exciting and ennobling possibilities.

People crave being part of something exciting and inspiring. Articulate a common purpose.

Challenge the Process:

Look for innovative ways to improve, experiment and take risks.

Generate small wins and learn from experience.

Willing to take risk. Never waste a failure by not learning from it.

Praise for taking initiative.

Enable Others to Act:

Foster collaboration by building trust and facilitating relationships.

Strengthen others by increasing self-determination and developing competence. Empower others.

Value, respect, and understand talents. Create high-trust climate.

Leadership is about relationships. Communicating with clear expectations and guidance.

Take quality time to coach, mentor and uplift people.

Encourage the Heart:

Recognize contributions by showing appreciation for individual excellence.

Celebrate the values and victories by creating spirit of community. Celebrate along the journey.

Refuel and energize. Create a sense of community.

Working With Emotional Intelligence

29 Jun

Working With Emotional Intelligence takes the concepts from Daniel Goleman’s bestseller, Emotional Intelligence, into the workplace. Business leaders and outstanding performers are not defined by their IQs or even their job skills, but by their “emotional intelligence”: a set of competencies that distinguishes how people manage feelings, interact, and communicate.

Analyses done by dozens of experts in 500 corporations, government agencies, and nonprofit organizations worldwide conclude that emotional intelligence is the barometer of excellence on virtually any job. This book explains what emotional intelligence is and why it counts more than IQ or expertise for excelling on the job. It details 12 personal competencies based on self-mastery (such as accurate self-assessment, self-control, initiative, and optimism) and 13 key relationship skills (such as service orientation, developing others, conflict management, and building bonds). Goleman includes many examples and anecdotes–from Fortune 500 companies to a nonprofit preschool–that show how these competencies lead to or thwart success.

So Good They Can’t Ignore You

21 Sep

In this eye-opening account, Cal Newport debunks the long-held belief that “follow your passion” is good advice.  Not only is the cliché flawed-preexisting passions are rare and have little to do with how most people end up loving their work-but it can also be dangerous, leading to anxiety and chronic job hopping.

After making his case against passion, Newport sets out on a quest to discover the reality of how people end up loving what they do. Spending time with organic farmers, venture capitalists, screenwriters, freelance computer programmers, and others who admitted to deriving great satisfaction from their work, Newport uncovers the strategies they used and the pitfalls they avoided in developing their compelling careers.

Matching your job to a preexisting passion does not matter, he reveals. Passion comes after you put in the hard work to become excellent at something valuable, not before.
In other words, what you do for a living is much less important than how you do it.

With a title taken from the comedian Steve Martin, who once said his advice for aspiring entertainers was to “be so good they can’t ignore you,” Cal Newport’s clearly written manifesto is mandatory reading for anyone fretting about what to do with their life, or frustrated by their current job situation and eager to find a fresh new way to take control of their livelihood. He provides an evidence-based blueprint for creating work you love.

SO GOOD THEY CAN’T IGNORE YOU will change the way we think about our careers, happiness, and the crafting of a remarkable life.

The 5 Elements of Effective Thinking

4 Nov

The 5 Elements of Effective Thinking presents practical, lively, and inspiring ways for you to become more successful through better thinking. The idea is simple: You can learn how to think far better by adopting specific strategies. Brilliant people aren’t a special breed–they just use their minds differently. By using the straightforward and thought-provoking techniques in The 5 Elements of Effective Thinking, you will regularly find imaginative solutions to difficult challenges, and you will discover new ways of looking at your world and yourself–revealing previously hidden opportunities.

Surprisingly inspiring.

Understand deeply. Understand simple things first. See what’s there and what’s missing. Master the basics. See the invisible.

Fail to success. Fail better. Let your errors be your guide. Have a bad day. Lean from those missteps.

Create questions out of the thin air. What’s the real question? Improve the question. Ask meta-questions. Teach to learn.

Seeing the flow of ideas. Creating new ideas from old ones. Think back. Extend ideas.

Transform yourself. Change.



The Icarus Deception

17 May

Everyone knows that Icarus’s father made him wings and told him not to fly too close to the sun; he ignored the warning and plunged to his doom. The lesson: Play it safe. Listen to the experts. It was the perfect propaganda for the industrial economy. What boss wouldn’t want employees to believe that obedience and conformity are the keys to success?
But we tend to forget that Icarus was also warned not to fly too low, because seawater would ruin the lift in his wings. Flying too low is even more dangerous than flying too high, because it feels deceptively safe.
The safety zone has moved. Conformity no longer leads to comfort. But the good news is that creativity is scarce and more valuable than ever. So is choosing to do something unpredictable and brave: Make art. Being an artist isn’t a genetic disposition or a specific talent. It’s an attitude we can all adopt. It’s a hunger to seize new ground, make connections, and work without a map. If you do those things you’re an artist, no matter what it says on your business card.
Godin shows us how it’s possible and convinces us why it’s essential.

Data Analysis Using SQL and Excel

1 Jul

Useful business analysis requires you to effectively transform data into actionable information. This book helps you use SQL and Excel to extract business information from relational databases and use that data to define business dimensions, store transactions about customers, produce results, and more. Each chapter explains when and why to perform a particular type of business analysis in order to obtain useful results, how to design and perform the analysis using SQL and Excel, and what the results should look like.

MCTS Self-Paced Training Kit (Exam 70-433): Microsoft® SQL Server® 2008 Database Development

10 Jun

MS SQL Server® 2008 Database DevelopmentAce your preparation for the skills measured by MCTS Exam 70-433—and on the job. Work at your own pace through a series of lessons and reviews that fully cover each exam objective. Then, reinforce and apply what you’ve learned through real-world case scenarios and practice exercises. This official Microsoft study guide is designed to help you make the most of your study time.

Maximize your performance on the exam by learning to:

  • Create and manage database objects
  • Query and modify data; implement subqueries and CTEs
    • Optimize table structures and data integrity
  • Create stored procedures, functions, and triggers
  • Manage transactions, error handling, and change tracking
  • Tune query performance
  • Implement database mail, full-text search, Service Broker, scripts
  • Work with XML and SQLCLR Assess your skills with the practice tests on CD. You can work through hundreds of questions using multiple testing modes to meet your specific learning needs. You get detailed explanations for right and wrong answers—including a customized learning path that describes how and where to focus your studies.