The word might be confusing, as the idea behind data lakes is still misunderstood. Take a second to imagine a wide blue lake. The difference between this lake and a normal lake is its contents. While a normal lake is filled with arranged H2O molecules, the second one consists in a data cluster full of random information. Behind random is the idea that any kind of information, such as a number, a letter, a character, a name, a distance may appear in it, following no particular rule. This is the starting point of the main challenge for data engineers : how can they avoid drowning in such a situation while still being able to perform good analysis or predictions for top management ?
Current relational databases were built on a principle : all the data is first structured then stored. A user can then access the data through a database language, such as SQL. This scheme is effective in a environment in which relations between data types is know. When thinking of predictive analysis, such as in artificial intelligence, or social media analysis, the main issue is the fact that data mining is done in a unsupervised way, meaning that we actually want to discover the relations between the variables along the process. Here is the thing : it is the contrary of a relational database. It is a real challenge for engineers and will literally transform the conventional IT jobs we know. Technical maturity has been reached in a way that environments such as Hadoop filesystem match with the idea behind data lakes. Merging.
Engineers working on data analysis topics should first know their needs regarding data before going in a direction or another. Sometimes, common data-warehouses or relational databases remain the easiest way to perform fast analysis. One of the main cons in data lakes use is the fact that a particular analysis requires a new data structure to be made each time, which is actually expensive in term of infrastructure use. It would be much simpler to explore a known data architecture. Fortunately, the benefits from this work counterbalances its time and costs, regarding management point-of-view.
A good point to be understood is that upon data lakes, there are some existing tools that can be connected, through a software bridge, as it can be seen on this technical schematics :
From a website logs to sensors’ signals and social media conversations, data is everywhere and every kind. Making data understandable by top-management is an engineer’s work. ETL, extracting, transforming, loading is made before the data is put in the lake. Data analysis are then made in this lake, using mathematical formulas and numerical algorithms. Those analysis must be converted into high level graphs to be understood by business people. Each layer of this « stack » contains software blocks with specific functionalities. The beauty stands in the power of such analysis from business point-of-view. Until recently, all the data collected by digital systems were not exploited. At all. Thanks to collaboration between researchers and engineers, a new vision of the world is now drawn. A data vision.