Ware House Solutions

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Wednesday, 1 May 2013

Big Data in the Real World - Part 1 of 2

Posted on 12:46 by Unknown
There is so much talk of Big Data nowadays, mostly from sales people but also from technologists. Karen Lopez aka DataChick took issue at the Wikipedia entry on Big Data which describes it as "data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications". I think that definition is pretty good.

Over the last few months I've tackled a problem which within my environment counts as a Big Data problem. The solution is outlined in these next 2 articles.

Clickstream Conversion Data Collection Process

The daily process

Clickstream data is collected on a schedule, currently daily, via an SSIS package which is initiated by a step in a SQL Agent job. The underlying package uses the timestamp of the last successful load as a starting point for the next process which allows flexibility in the collection schedule.

The data source

All clickstream data is collected in a Key-Value-Pair (KVP) database. There are 3 main tables used to store the KVP data. Those tables are;

  • LogFlow - information about users web journey 
  • LogQuote - all data related to a quote 
  • LogSales - all data related to a sale 

Only the last 5 months of data stored. This is a hangover from legacy environment where storage was limited. We wish to retain all the data so we collect it each night and store it in our Data Warehouse.

Staging data

As we perform a number of actions on the data once collected and we don’t want to run those actions over highly transactional replicated tables a set of staging tables exists which mirror exactly the schema on both out production and replication servers. Diagram 1 shows the SSIS steps for this section of the process. The staging tables are truncated to remove data from the last process before data flow tasks collect the data since the last successful load. This achieved using a view of the source table filtered to return only the rows created after the last datetime in the process control metadata.

The KVP data is large due to the volume of transactions and because the values for each row are contained in NVARCHAR(1000) format. For each action on our website over 150 rows are currently written to database. Often the data in the NVARCHAR(1000) key value column only uses a fraction of that space. It’s a drawback of the KVP implementation and as a result the database is approximately 90% smaller when converted to a relational schema.

Schema processing

Following on from a constraint for the KVP implementation we have an advantage. The KVP key names can be added to at any time which allows us to collect new data without extensive reconfiguration work. However, this means when collecting the data we cannot transform it into a rigid schema because the schema may change.

In addition, we can’t constrain the data within the key values to definite data types as this may also change therefore we continue to store the data – at least during collection – as NVARCHAR columns.

Finally the length of the data within the NVARCHAR columns can vary. We know the maximum lengths in the current data for each key name but that could change in future consequently we have to allow for potential changes.



To account for these data constraints we must scan the incoming data for new key names and key values which are larger than previously seen before each data load. If we detect new key names we must create a column to store that data from now on. If we detect a key value has been collected which is larger than the current NVARCHAR length in the storage schema then we must expand that column. These two actions require prior knowledge of the key names and the length of the key values which we store in a single metadata table. This table continues to be used throughout the remainder of the collection process.

This post is continued in part 2.
Email ThisBlogThis!Share to XShare to Facebook
Posted in Big Data, Data Cleansing, Data Warehouse, ETL, SSIS | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • SQL Bad Practice - Hardcoding
    Legacy systems present design challenges different to problems faced during greenfield development. One mistake I've seen time and again...
  • Data Warehousing Resources List
    Listed here are the best resources I've found for learning about data warehousing, useful techniques and methodologies. In no particular...
  • Big Data in the Real World - Part 2 of 2
    Part 1 of this series described the first 2 stages in the processing of clickstream data to allow sales conversion analysis. This post desc...
  • A 7-Zip Archiving Task for my SSIS Package Template
    I use an SSIS package template to speed up project delivery. 7-zip is my archiving tool of choice because it provides encryption and passwor...
  • Book Review - The Checklist Manifesto
    Periodically you read books which change your habits and make you more effective. The Checklist Manifesto is one such book. Authors who can ...
  • Pragmatic Principals - Agile
    The Data Warehouse exists to provide Business Intelligence. Its users will not always start with the end in mind. They will browse, discover...
  • Big Data in the Real World - Part 1 of 2
    There is so much talk of Big Data nowadays, mostly from sales people but also from technologists. Karen Lopez aka DataChick took issue at ...
  • Refactoring Legacy Databases
    Know your enemy - It might not always seem so, but legacy code was written with a process in mind. Determining what the code was originally ...
  • Building an Auditing Process for SSIS - Part 1
    An audit schema published in the excellent book SSIS Problem– Design – Solution allows the efficient capture of SSIS progress and errors in...
  • Pragmatic Principals - Focus
    The DW is a business system not a techie playground. Use Business Analyst's to focus all development on revenue generating (or saving) w...

Categories

  • 7-Zip
  • Architecture
  • Auditing
  • Bad Practice
  • Big Data
  • Book Review
  • Collations
  • Data Cleansing
  • Data Profiling
  • Data Warehouse
  • DBA
  • Design
  • Disaster Recovery
  • Error Handling
  • ETL
  • OLTP
  • Package Template
  • Principals
  • Professional Skills
  • Project Management
  • Quote
  • Refactoring
  • Replication
  • Resources
  • sFTP
  • SQL Agent
  • SSIS
  • T-SQL
  • WinSCP

Blog Archive

  • ▼  2013 (8)
    • ►  September (1)
    • ►  July (1)
    • ▼  May (2)
      • Big Data in the Real World - Part 2 of 2
      • Big Data in the Real World - Part 1 of 2
    • ►  March (2)
    • ►  February (1)
    • ►  January (1)
  • ►  2012 (15)
    • ►  December (1)
    • ►  September (1)
    • ►  August (1)
    • ►  July (3)
    • ►  June (2)
    • ►  April (1)
    • ►  March (2)
    • ►  February (3)
    • ►  January (1)
Powered by Blogger.

About Me

Unknown
View my complete profile