Data Lake

Data lakes are large repositories of data in their natural storage or transformed formats. They're often used as the data source for analytics, reporting, data visualisations and machine learning models.

Data can be in just about any format:

  • Structured
  • Semi-structured
  • Unstructured

Generations

Gen1 was built independently of Storage. Gen2 was built on on blob storage and introduces "Hierarchical namespaces", which replace the folder emulation in blob containers with a true directory structure. This allows for transactions at the folder level and enforcement of POSIX permissions.

The long term intention appears to be that all features should be supported with this feature enabled, though that isn't the case at the present time. The data is accessible via multiple APIs: both blob.core.windows.net and dfs.core.windows.net endpoints are exposed for storage accounts with HNS enabled.