Introduction
Nowadays we can see big trend in cloud services which are trying to take our attention almost everywhere around Internet. The question is what exactly the Cloud is? Is it something new what is changing our understanding of the web services as we know? The answer for this question is probably No. The Cloud is something what is here for a while. Not so long time ago the term ASP (Application Service Provider) was identified as the future of the online services. It was almost 12 years ago. The ASP model is based on an idea to provide access to a particular program such as CRM using standard protocol such as HTTP. It’s also sometimes called On-demand software or especially in last time software as a service (SaaS).
The main need for the ASP was in reducing cost for the application maintenance and also in lower price for leasing than in buying expensive product having additional requirements for deployment such as a server and technical support for solving the issues at user side.
The Cloud model goes beyond the original ASP model. It focuses on more fields such as computing and storage. The name comes from the use of clouds as an abstraction for the complex infrastructure it contains in system diagrams. Cloud computing entrusts services with a user’s data, software and computation over a network.
The companies and organizations are moving their infrastructure more and more into Cloud environment. There are three main groups in the Cloud model: Software as a Service (SaaS), Infrastructure as a Service (IaaS) and Platform as a Service (PaaS). The first one is more close to original ASP model. The second one is related to leasing of the IT infrastructure by 3rd parties. The representative example of the IaaS is a hosting provider which is offering virtualization technology to the customers. The last model is combination of the previous.
Cloud computing
The challenge for Cloud environment is the data integration of cloud services and analyzing of collected information. Let’s say we have some cloud providers and each of them is offering some services. In user point of view we would like to access data from all our cloud services and generate some reports and do some data analyzing of services usage.
The good example in data collecting is for instance the US President 2012 election where were collected data from various social platforms such as Twitter, Facebook and Google+. The data are integrated from the various systems into one place where are done some analytical tasks and finally generated reports with the results about progress of the individual candidates.
Another scenario is where we want to be aware of the status of particular cloud services which are critical for us as part of the IT infrastructure overview. For instance we have PaaS hosted at two providers such as Amazon and Rackspace. We need to have an overview about our PaaS infrastructure on weekly basis. What we can do is to use an integration platform which collects data into a database and allows us to do data analyzing and generate reports with some results. The ideal solution would be to use traditional data warehouse which could be capable to do this all. The problem is there isn’t any data warehouse having this capability.
Another big question is what to do with really large data collecting which could reach capacity in petabytes, exabytes or even more. The Big data field is oriented in solving this challenge which is becoming more and more important topic for companies and organizations such as hospitals, governments, telecommunication operators or banks.
Data Warehouse
A data warehouse is a place where data is stored for archival, analysis, and security purposes. Usually a data warehouse is either a single computer or many computers (such as servers) tied together to create one giant computer system.
Data consists of raw data or formatted data. It can be on various types of topics including the organization’s sales, operational data, summaries of data including reports, copies of data, inventory data, and external data to provide simulations and analysis, and so on.
Besides being a storehouse for a large amount of data, they must process systems in place that make it easy to access the data and use it in day to day operations. A data warehouse is sometimes known as a major role player in a decision support system (DSS). It is a technique that organizations use to come up with facts, trends, or relationships that can help them make effective decisions or create effective strategies to accomplish the organizational goals.
There are many different data warehouse models. One of them is Online Transaction Processing which is built for simple use. Another type of data warehouse model is known as Online Analytical processing, which is more complex to use and has an extra step of analysis within the data. It requires additional steps that slows the process down and requires much more data in order to analyze certain queries.
Data Warehouse Architecture
One of the more common data warehouse models include a data warehouse that is subject oriented, time variant, not volatile, and integrated. Subject oriented means that data is linked together and is organized by relationships. Time variant means that any data that is changed in the data warehouse can be tracked. Usually all changes to data are stamped with a time-date with a before and after value, so that the changes throughout a period of time can be shown. Non volatile means that the data is never deleted and it protects the most crucial data. Finally, the data is integrated, which means that a data warehouse uses data that is organizational wide instead of from just one entity such as department or business unit.
Besides the term data warehouse, a frequently used term is data mart. Data marts are smaller and less integrated than data housings. With innovations in using data warehousing techniques and improvements in technology data warehouses have changed from Offline Operational Databases to an Online Integrated data warehouse.
Offline Operational Data Warehouses are data warehouses where data is usually copied and pasted from real time data networks into an offline system where it can be used. It is usually the simplest and less technical type of data warehouse.
Offline Data Warehouses are data warehouses that are updated frequently (on daily, weekly, or monthly basis). Data is stored in an integration structure where others can access it and perform reporting.
Real Time Data Warehouses are data warehouses where it is updated each moment with the influx of new data. For instance, a Real Data Warehouse might incorporate data from a Point of Sales system and is updated with each sale that is made.
Integrated Data Warehouses are data warehouses that can be used for other systems to access them for operational systems. Other data warehouses use some Integrated Data Warehouses, allowing them access to process reports, as well as look up current data.
An operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. The data is then passed back to operational systems for further operations and to the data warehouse for reporting.
Operational data store
One special characteristic of an ODS is the requirement for it to handle mixed workloads. It must be able to respond to complex queries from data-mining facilities, knowledge users, and rules engines. The ODS must be capable of processing extremely high transaction rate, as it is being federated transactions in real-time from many enterprise systems. This processing is the realm of OLTP (OnLine Transaction Processing). The database structures suitable for OLTP are characterized by skinny keys that require a minimum of updating as data is added to the database. Another special characteristic of an ODS is that it is bi-directional.
The number one reason for implementing a data warehouse is so that employees or and users can access data warehouse and use the data for reports, analysis, and decision making. Using the data in a warehouse can help locate trends, focus on relationships, and help users understand more about the environment that a business operates in.
Data warehouse also increase the data’s consistency and allow it to be checked repeatedly to determine how relevant it is. Because most data warehouses are integrated, users can pull data from many different areas of their business, for instance accounting, IT, human resources, and so on.
Cloud Warehouse
The cloud services are operated by different providers which are usually oriented in specific field such as storage or computing. The common requirement from the users of cloud services is to have an option for getting reports and analyses tasks helping them to get the overview about cloud services usage. The challenge in the terms of reporting and analyses is in approach of collecting data from the various systems, store them somewhere and generate reports with some results. The ideal scenario would be in use of a traditional data warehouse which is capable to collect data from the operational systems and pass the data through an operational data store for additional operations before they are used in the warehouse for reporting. Unfortunately the traditional data warehouse implementations are too complex for flexible and easy way how to extend their functionality and mostly oriented for some particular solutions of the 3rd parties. There is missing a way how to do a simple integration with other systems.
The Nsys Cloud Warehouse is a prototype of concept which is trying to implement traditional data warehouse model into Cloud environment where are collected data from cloud services as the source operational systems.
One of the requirements for the prototype implementation of the cloud warehouse is to have an infrastructure providing an agent running on multiple systems allowing to run a plugin for data collecting, subsystem for the data processing where can be done additional work for data calculation such as implement functionality of the operational data store and lastly a web portal which can be used for accessing of the reports.
The prototype of the cloud warehouse is implemented as new extension of the Nsys Framework Infrastructure. The Nsys Platform was primary designed as a multiplatform system which helps to develop information systems and has capability to do management tasks on nodes (such as servers and workstations) through an agent called Nsys Daemon and its plugins.