opensource.google.com

Menu
Showing posts with label google cloud. Show all posts
Showing posts with label google cloud. Show all posts

Get ready for Google I/O: Program lineup revealed

Wednesday, April 23, 2025

The Google I/O agenda is live. We're excited to share Google’s biggest announcements across AI, Android, Web, and Cloud May 20-21. Tune in to learn how we’re making development easier so you can build faster.

We'll kick things off with the Google Keynote at 10:00 AM PT on May 20th, followed by the Developer Keynote at 1:30 PM PT. This year, we're livestreaming two days of sessions directly from Mountain View, bringing more of the I/O experience to you, wherever you are.

Here’s a sneak peek of what we’ll cover:

    • AI advancements: Learn how Gemini models enable you to build new applications and unlock new levels of productivity. Explore the flexibility offered by options like our Gemma open models and on-device capabilities.
    • Build excellent apps, across devices with Android: Crafting exceptional app experiences across devices is now even easier with Android. Dive into sessions focused on building intelligent apps with Google AI and boosting your productivity, alongside creating adaptive user experiences and leveraging the power of Google Play.
    • Powerful web, made easier: Exciting new features continue to accelerate web development, helping you to build richer, more reliable web experiences. We’ll share the latest innovations in web UI, Baseline progress, new multimodal built-in AI APIs using Gemini Nano, and how AI in DevTools streamline building innovative web experiences.

Plan your I/O

Join us online for livestreams May 20-21, followed by on-demand sessions and codelabs on May 22. Register today and explore the full program for sessions like these:

We're excited to share what's next and see what you build!

By the Google I/O team

Producer java library for Data Lineage is now open source

Tuesday, January 28, 2025

Integrating OpenLineage producers with GCP Lineage just got a lot easier


What is Data Lineage

Data Lineage is a GCP feature that allows tracking data movement. This tool helps data owners and analysts detect anomalies in data flows, find connections between data sources and verify the potential consequences of planned changes in data pipelines.

Lineage is injected automatically for some Google Cloud products (BigQuery, Cloud Data Fusion, Cloud Composer, Dataproc, Vertex AI). That means, if Lineage integration with any of those products is enabled in the projects, data movements coming from executing jobs by these products will be reported to GCP Lineage.

For custom integrations, the API can be used to report and fetch lineage.

After injecting, lineage can be viewed in the Google Cloud console (available from DataCatalog UI, BigQuery UI, Vertex UI). There are two representations: graph view, with data sources as nodes and data movements as edges, and list view, a tabular representation. Lineage information can also be fetched from the API.

More information is available in the documentation.


GCP Lineage information model

We describe data flows using the following concepts:

  • Process is a definition of some data transformation. For example, a SQL or Spark script.
  • Run is an execution of a Process.
  • Lineage Event is a data transformation event. It is reported in context of a Run.
  • A Link represents a connection between two data sources, when data in the link’s Target depends on its Source. A Lineage Event contains a list of Links.

OpenLineage support

OpenLineage is an open standard for reporting lineage information. It unifies lineage reporting between systems, which means the events generated in this format can be consumed by any product supporting it. This leads to more flexibility: adding or replacing a lineage producer does not imply changing the consumer, and vice versa.

OpenLineage format is adopted by a number of lineage producers and consumers, meaning there is already tooling available to report lineage from/to those systems. GCP Lineage is one of those consumers: users can report events in OpenLineage format, see the resulting lineage on the UI, and query it via the API.

OpenLineage is the preferred method for reporting lineage in GCP Lineage. It is used by the Dataproc lineage integration. To find out more about sending OpenLineage events to GCP Lineage refer to the documentation.

After injecting lineage in OpenLineage format, it can be accessed in the same way as if it was injected via other API methods or automatically: from the Google Cloud console or the API.


Why producer library

The GCP Lineage producer library is an extension of the client library. Client libraries are recommended for calling Cloud APIs programmatically. They handle low level API call details, leaving the necessary user code simpler and shorter.

The producer library further simplifies integration by providing ready to use code needed to call the API from Java. It adds additional functionality such as synchronous and asynchronous clients, translating OpenLineage JSON messages to the API friendly format, error handling etc.

Using the producer library, all the code needed to send a request to GCP Lineage API is:

SyncLineageProducerClient client = SyncLineageProducerClient.create();
ProcessOpenLineageRunEventRequest request =
        ProcessOpenLineageRunEventRequest.newBuilder()
            .setParent(parent)
            .setOpenLineage(openLineageMessage)
            .build();
client.processOpenLineageRunEvent(request);

The field openLineageMessage here is a protobuf Struct that includes information about job execution, inputs and outputs and other metadata. The object model is described in the documentation. An example message is:

{
  "eventType": "START",
  "eventTime": "2023-04-04T13:21:16.098Z",
  "run": {
    "runId": "502483d6-3e3d-474f-9380-da565eaa7516",
    "facets": {
       "spark_properties": {
        "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.22.0/integration/spark",
        "_schemaURL": "https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet",
        "properties": {
          "spark.master": "yarn",
          "spark.app.name": "sparkJobTest.py"
        }
      }
    }
  },
  "job": {
    "namespace": "project-name",
    "name": "cluster-name",
    "facets": {
    "jobType": {
        "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.22.0/integration/spark",
        "_schemaURL": "https://openlineage.io/spec/facets/2-0-3/JobTypeJobFacet.json#/$defs/JobTypeJobFacet",
        "processingType": "BATCH",
        "integration": "SPARK",
        "jobType": "SQL_JOB"
      },

    }
  },
  "inputs": [
    {
      "namespace": "bigquery",
      "name": "project.dataset.input_table",
    }],
  "outputs": [
   {
      "namespace": "bigquery",
      "name": "project.dataset.output_table",
    }],
  "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.18.0/integration/spark",
  "schemaURL": "https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunEvent"
}

Learn more about building an OpenLineage message.


Best Practices for Constructing OpenLineage Messages

The openLineageMessage should follow the OpenLineage format. The fields that are required for correct parsing by the GCP Lineage API are:

job

mapped to Process

job.namespace

used to construct Process name

job.name

used to construct Process name

run

mapped to Run

run.runId

used to construct Run name

producer

URI identifying the producer of this metadata

eventTime

time of the data movement

schemaURL

URL pointing to the schema definition for this message

In addition to those, the fields used to create lineage are:

eventType

corresponds to the status of the Run

inputs

mapped to sources of links. Must be specified according to the naming conventions

outputs

mapped to targets of links. Must be specified according to the naming conventions

The GCP Lineage API supports OpenLineage major versions 1 and 2. For more information please refer to the documentation.


How to access GCP Lineage?

The code is now publicly available on GitHub. The library is also published to Maven.


GcpLineageTransport

To simplify integration with GCP Lineage, we offer GcpLineageTransport. It is available on the OpenLineage GitHub repository and is built to a separate maven artifact. It is built on top of the producer library mentioned above.

Using the transport minimises the code for sending events to GCP Lineage. The GcpLineageTransport can be configured as the event sink for any existing OpenLineage producer such as Airflow, Spark, and Flink. Find more information and examples on GCP Lineage.

By Mary Idamkina – Data Lineage

Introducing the Open Source Insights Project

Thursday, June 3, 2021

Open Source Insights

Google has been working on software supply-chain security for many years, and transitive dependencies remain one of the most complex and least understood aspects. While we will be integrating this data into our Cloud and internal products in a variety of ways, we believe there is an immediate value in helping developers understand and visualize dependencies. Today, we are excited to share an exploratory visualization site: Open Source Insights, which provides an interactive view of the dependencies of open source projects.

Software development practices have evolved significantly over the last few years. Collaborative development with distributed feature development, consumption of open source and third-party packages, and publicly maintained software libraries have become commonplace, partly as a result of the widespread use of open source software. The advantages of open source are so clear that people and companies that would once have rejected OSS are now adopting it as a critical element of their environment.

But there are challenges brought by OSS too. The pace of change is electric, and it can be hard to keep up. The software packages that a large project depends on might update too frequently to keep a clear picture of what is happening. And those packages, in turn, can change their dependencies to provide new features or fix bugs. Security problems and other issues can arise unexpectedly in your project as a result, and the scale of the problem can make it all difficult to manage. Even a modest OSS project might depend on hundreds of packages.

There are tools to help, of course: vulnerability scanners and dependency audits that can help identify when a package is exposed to a vulnerability. But it can still be difficult to visualize the big picture, to understand what you depend on, and what that implies.

Open Source Insights provides a visualization of a project’s dependencies and their properties. Our exploratory website can be used to get an overview of how a particular software package is put together. Among other features, it provides interactive tools to visualize and analyze full, transitive dependency graphs. It also has a comparison tool to highlight how different versions of a package might affect your dependencies, perhaps by changing their own dependencies, adding licensing requirements, or fixing security problems.

Dependency graph for express 4.17.1

Open Source Insights shows you all this information about a package without asking you to install the package first. You can see instantly what installing a package—or an updated version—might mean for your project, how popular it is, find links to source code and other information, and then decide whether it should be installed. Insights also helps you see the importance of your project by showing the projects that depend on it: its dependents. Even a small project is important if a large number of other projects depend on it, either directly or through transitive dependencies.

Open Source Insights continuously scans millions of projects in the open source software ecosystem, gathering information about packages, including licensing, ownership, security issues, and other metadata such as download counts, popularity signals, and OpenSSF Scorecards. It then constructs a full dependency graph—transitively tracking dependencies, dependencies' dependencies, and so on—and incorporates the metadata, then publishes it so you can see how it all might affect your software. And the information it provides is continually updated.
Filtered dependency graph showing how eslint 7.27.0 depends on chalk 2.4.2 and 4.1.1

Filtered dependency graph showing how eslint 7.27.0 depends on chalk 2.4.2 and 4.1.1

This information can help visualize how software is put together, whether an update is worth doing, or how to fix a problem.

Today, Open Source Insights supports npm, Maven, Go modules, and Cargo. While we work on adding additional packaging systems, we want to hear from you: How could this data fit into your development workflow? What would make it more useful? You can reach the team at deps.dev to share your thoughts; we’ll be collecting feedback for the upcoming months and look forward to hearing your ideas on how best to improve supply-chain security.

Visit our website at deps.dev to try it out.

From the Google Cloud team: What this means for GCP’s open cloud

For users of open source software, this may be the first time you’re seeing dependency and vulnerability information in an organized and accessible way. If you’re using a managed service based on open source, it’s important to remember that you may not be affected by all vulnerabilities listed. Your provider may have taken steps to harden the products you use, and when a new vulnerability is disclosed, your provider may take responsibility for patching this on your behalf.

Google Cloud follows both these steps to help users get the benefits of open cloud while prioritizing security. Multiple layers of hardening create defense-in-depth, which helps protect services like Google Kubernetes Engine (GKE), Cloud Run and Cloud Functions from a container escape vulnerability. For components that are the user’s responsibility, we’re constantly rolling out new services—like GKE Autopilot—that automate these responsibilities.

We’re committed to protecting our customers, both through our patch rewards program and the recently launched cyber insurance partnership, the Risk Protection Program, which moves from shared responsibility to shared fate. We look forward to bringing our customers new information on their open source dependencies.

By Andrew Gerrand, Michael Goddard, Rob Pike and Nicky Ringland of the Open Source Insights Team

The API Registry API

Friday, January 8, 2021

We’ve found that many organizations are challenged by the increasing number of APIs that they make and use. APIs become harder to track, which can lead to duplication rather than reuse. Also, as APIs expand to cover an ever-broadening set of topics, they can proliferate different design styles, at times creating frustrating inefficiencies.

To address this, we’ve designed the Registry API, an experimental approach to organizing information about APIs. The Registry API allows teams to upload and share machine-readable descriptions of APIs that are in use and in development. These descriptions include API specifications in standard formats like OpenAPI, the Google API Discovery Service Format, and the Protocol Buffers Language.

An organized collection of API descriptions can be the foundation for a wide range of tools and services that make APIs better and easier to use.
  • Linters verify that APIs follow standard patterns
  • Documentation generators provide documentation in consistent, easy-to-read, accessible formats
  • Code generators produce API clients and server scaffolding
  • Searchable online catalogs make everything easier to find
But perhaps most importantly, bringing everything about APIs together into one place can accelerate the consistency of an API portfolio. With organization-wide visibility, many find they need less explicit governance even as their APIs become more standardized and easy to use.

The Registry API is a gRPC service that is formally described by Protocol Buffers and that closely follows the Google API Design Guidelines at aip.dev. The Registry API description is annotated to support gRPC HTTP/JSON transcoding, which allows it to be automatically published as a JSON REST API using a proxy. Proxies also enable gRPC web, which allows gRPC calls to be directly made from browser-based applications, and the project includes an experimental GraphQL interface.

We’ve released a reference implementation that can be run locally or deployed in a container with Google Cloud Run or other container-based services. It stores data using the Google Cloud Datastore API or a configurable relational interface layer that currently supports PostgreSQL and SQLite.

Following AIP-181, we’ve set the Registry API’s stability level as "alpha," but our aim is to make it a stable base for API lifecycle applications. We’ve open-sourced our implementation to share progress and gather feedback. Please tell us about your experience if you use it.

By Tim Burks, Tech Lead – Apigee API Lifecycle and Governance

Using MicroK8s with Anthos Config Management in the world of IoT

Friday, December 11, 2020

When dealing with large scale Kubernetes deployments, managing configuration and policy is often very complicated. We discussed why Kubernetes’ declarative approach to configuration as data has become the most popular choice for most users a few weeks ago. Today, we will discuss bringing this approach to your MicroK8 deployments using Anthos Config Management.
Image of Anthos Config Management + Cloud Source Repositories + MicroK8s
Anthos Config Management helps you easily create declarative security and operational policies and implement them at scale for your Kubernetes deployments across hybrid and multi-cloud environments. At a high level, you represent the desired state of your deployment as code committed to a central Git repository. Anthos Config Management will ensure the desired state is achieved and also maintained across all your registered clusters.

You can use Anthos Config Management for both your Kubernetes Engine (GKE) clusters as well as on Anthos attached clusters. Anthos attached clusters is a deployment option that extends Anthos’ reach into Kubernetes clusters running in other clouds as well as edge devices and the world of IoT, the Internet of Things. In this blog you will learn by experimenting with attached clusters with MicroK8s, a conformant Kubernetes platform popular in IoT and edge environments.

Consider an organization with a large number of distributed manufacturing facilities or laboratories that use MicroK8s to provide services to IoT devices. In such a deployment, Anthos can help you manage remote clusters directly from the Anthos Console rather than investing engineering resources to build out a multitude of custom tools.

Consider the diagram below.

Diagram of Anthos Config Management with MicroK8s on the Factory Floor with IoT
This diagram shows a set of “N” factory locations each with a MicroK8s cluster supporting IoT devices such as lights, sensors, or even machines. You register each of the MicroK8s clusters in an Anthos environ: a logical collection of Kubernetes clusters. When you want to deploy the application code to the MicroK8s clusters, you commit the code to the repository and Anthos Config Management takes care of the deployment across all locations. In this blog we will show you how you can quickly try this out using a MicroK8s test deployment.

We will use the following Google Cloud services:
  • Compute Engine provides an Ubuntu instance for a single-node MicroK8s cluster. Ubuntu will use cloud-init to install MicroK8s and generate shell scripts and other files to save time.
  • Cloud Source Repositories will provide the Git-based repository to which we will commit our workload.
  • Anthos Config Management will perform the deployment from the repository to the MicroK8s cluster.

Let’s start with a picture

Here’s a diagram of how these components fit together.

Diagram of how Anthos Config Management works together with MicroK8s
  • A workstation instance is created from which Terraform is used to deploy four components: (1) an IAM service account, (2) a Google Compute Engine Instance with MicroK8s using permissions provided by the service account, (3) a Kubernetes configuration repo provided by Cloud Source Repositories, and (4) a public/private key pair.
  • The GCE instance will use the service account key to register the MicroK8s cluster with an Anthos environ.
  • The public key from the public/ private key pair will be registered to the repository while the private key will be registered with the MicroK8s cluster.
  • Anthos Config Management will be configured to point to the repository and branch to poll for updates.
  • When a Kubernetes YAML document is pushed to the appropriate branch of the repository, Anthos Config Management will use the private key to connect to the repository, detect that a commit has been made against the branch, fetch the files and apply the document to the MicroK8s cluster.
Anthos Config Management enables you to deploy code from a Git repository to Kubernetes clusters that have been registered with Anthos. Google Cloud officially supports GKE, AKS, and EKS clusters, but you can use other conformant clusters such as MicroK8s in accordance with your needs. The repository below shows you how to register a single MicroK8s cluster to receive deployments. You can also scale this to larger numbers of clusters all of which can receive updates from commitments to the repository. If your organization has large numbers of IoT devices supported by Kubernetes clusters you can update all of them from the Anthos console to provide for consistent deployments across the organization regardless of the locations of the clusters, including the IoT edge. If you would like to learn more, you can build this project yourself. Please check out this Git repository and learn firsthand about how Anthos can help you manage Kubernetes deployments in the world of IoT.

By Jeff Levine, Customer Engineer – Google Cloud
.