The Common Solvent for REST APIs – O’Reilly

    0
    37


    Information scientists working in Python or R usually purchase information by the use of REST APIs. Each environments present libraries that enable you make HTTP calls to REST endpoints, then remodel JSON responses into dataframes. However that’s by no means so simple as we’d like. While you’re studying a whole lot of information from a REST API, it’s good to do it a web page at a time, however pagination works in another way from one API to the following. So does unpacking the ensuing JSON constructions. HTTP and JSON are low-level requirements, and REST is a loosely-defined framework, however nothing ensures absolute simplicity, by no means thoughts consistency throughout APIs.

    What if there have been a manner of studying from APIs that abstracted all of the low-level grunt work and labored the identical manner in every single place? Excellent news! That’s precisely what Steampipe does. It’s a software that interprets REST API calls immediately into SQL tables. Listed here are three examples of questions that you would be able to ask and reply utilizing Steampipe.


    Be taught sooner. Dig deeper. See farther.

    1. Twitter: What are latest tweets that point out PySpark?

    Right here’s a SQL question to ask that query:

    choose
      id,
      textual content
    from
      twitter_search_recent
    the place
      question = 'pyspark'
    order by
      created_at desc
    restrict 5;

    Right here’s the reply:

    +---------------------+------------------------------------------------------------------------------------------------>
    | id                  | textual content                                                                                           >
    +---------------------+------------------------------------------------------------------------------------------------>
    | 1526351943249154050 | @dump Tenho trabalhando bastante com Spark, mas especificamente o PySpark. Vale a pena usar um >
    | 1526336147856687105 | RT @MitchellvRijkom: PySpark Tip ⚡                                                            >
    |                     |                                                                                                >
    |                     | When to make use of what StorageLevel for Cache / Persist?                                             >
    |                     |                                                                                                >
    |                     | StorageLevel decides how and the place information must be s…                                           >
    | 1526322757880848385 | Remedy challenges and exceed expectations with a profession as a AWS Pyspark Engineer. https://t.co/>
    | 1526318637485010944 | RT @JosMiguelMoya1: #pyspark #spark #BigData curso completo de Python y Spark con PySpark      >
    |                     |                                                                                                >
    |                     | https://t.co/qf0gIvNmyx                                                                        >
    | 1526318107228524545 | RT @money_personal: PySpark & AWS: Grasp Large Information With PySpark and AWS                    >
    |                     | #ApacheSpark #AWSDatabases #BigData #PySpark #100DaysofCode                                    >
    |                     | -> http…                                                                                    >
    +---------------------+------------------------------------------------------------------------------------------------>

    The desk that’s being queried right here, twitter_search_recent, receives the output from Twitter’s /2/tweets/search/latest endpoint and formulates it as a desk with these columns. You don’t must make an HTTP name to that API endpoint or unpack the outcomes, you simply write a SQL question that refers back to the documented columns. A type of columns, question, is particular: it encapsulates Twitter’s question syntax. Right here, we’re simply on the lookout for tweets that match PySpark however we might as simply refine the question by pinning it to particular customers, URLs, sorts (is:retweetis:reply), properties (has:mentionshas_media), and many others. That question syntax is similar regardless of the way you’re accessing the API: from Python, from R, or from Steampipe. It’s a lot to consider, and all you need to really want to know when crafting queries to mine Twitter information.

    2. GitHub: What are repositories that point out PySpark?

    Right here’s a SQL question to ask that query:

    choose 
      identify, 
      owner_login, 
      stargazers_count 
    from 
      github_search_repository 
    the place 
      question = 'pyspark' 
    order by stargazers_count desc 
    restrict 10;

    Right here’s the reply:

    +----------------------+-------------------+------------------+
    | identify                 | owner_login       | stargazers_count |
    +----------------------+-------------------+------------------+
    | SynapseML            | microsoft         | 3297             |
    | spark-nlp            | JohnSnowLabs      | 2725             |
    | incubator-linkis     | apache            | 2524             |
    | ibis                 | ibis-project      | 1805             |
    | spark-py-notebooks   | jadianes          | 1455             |
    | petastorm            | uber              | 1423             |
    | awesome-spark        | awesome-spark     | 1314             |
    | sparkit-learn        | lensacom          | 1124             |
    | sparkmagic           | jupyter-incubator | 1121             |
    | data-algorithms-book | mahmoudparsian    | 1001             |
    +----------------------+-------------------+------------------+

    This seems similar to the primary instance! On this case, the desk that’s being queried, github_search_repository, receives the output from GitHub’s /search/repositories endpoint and formulates it as a desk with these columns.

    In each instances the Steampipe documentation not solely exhibits you the schemas that govern the mapped tables, it additionally provides examples (TwitterGitHub) of SQL queries that use the tables in varied methods.

    Word that these are simply two of many accessible tables. The Twitter API is mapped to 7 tables, and the GitHub API is mapped to 41 tables.

    3. Twitter + GitHub: What have house owners of PySpark-related repositories tweeted these days?

    To reply this query we have to seek the advice of two completely different APIs, then be a part of their outcomes. That’s even more durable to do, in a constant manner, once you’re reasoning over REST payloads in Python or R. However that is the form of factor SQL was born to do. Right here’s one technique to ask the query in SQL.

    -- discover pyspark repos
    with github_repos as (
      choose 
        identify, 
        owner_login, 
        stargazers_count 
      from 
        github_search_repository 
      the place 
        question = 'pyspark' and identify ~ 'pyspark'
      order by stargazers_count desc 
      restrict 50
    ),
    
    -- discover twitter handles of repo house owners
    github_users as (
      choose
        u.login,
        u.twitter_username
      from
        github_user u
      be a part of
        github_repos r
      on
        r.owner_login = u.login
      the place
        u.twitter_username isn't null
    ),
    
    -- discover corresponding twitter customers
      choose
        id
      from
        twitter_user t
      be a part of
        github_users g
      on
        t.username = g.twitter_username
    )
    
    -- discover tweets from these customers
    choose
      t.author->>'username' as twitter_user,
      'https://twitter.com/' || (t.author->>'username') || '/standing/' || t.id as url,
      t.textual content
    from
      twitter_user_tweet t
    be a part of
      twitter_userids u
    on
      t.user_id = u.id
    the place
      t.created_at > now()::date - interval '1 week'
    order by
      t.creator
    restrict 5

    Right here is the reply:

    +----------------+---------------------------------------------------------------+------------------------------------->
    | twitter_user   | url                                                           | textual content                                >
    +----------------+---------------------------------------------------------------+------------------------------------->
    | idealoTech     | https://twitter.com/idealoTech/standing/1524688985649516544     | Can you discover inventive soluti>
    |                |                                                               |                                     >
    |                |                                                               | Be a part of our @codility Order #API Challe>
    |                |                                                               |                                     >
    |                |                                                               | #idealolife #codility #php          >
    | idealoTech     | https://twitter.com/idealoTech/standing/1526127469706854403     | Our #ProductDiscovery workforce at idealo>
    |                |                                                               |                                     >
    |                |                                                               | Assume you may resolve it? 😎          >
    |                |                                                               | ➡️  https://t.co/ELfUfp94vB https://t>
    | ioannides_alex | https://twitter.com/ioannides_alex/standing/1525049398811574272 | RT @scikit_learn: scikit-learn 1.1 i>
    |                |                                                               | What's new? You possibly can verify the releas>
    |                |                                                               |                                     >
    |                |                                                               | pip set up -U…                     >
    | andfanilo      | https://twitter.com/andfanilo/standing/1524999923665711104      | @edelynn_belle Thanks! Generally it >
    | andfanilo      | https://twitter.com/andfanilo/standing/1523676489081712640      | @juliafmorgado Good luck on the reco>
    |                |                                                               |                                     >
    |                |                                                               | My recommendation: energy by means of it + a lifeless>
    |                |                                                               |                                     >
    |                |                                                               | I hated my first few brief movies bu>
    |                |                                                               |                                     >
    |                |                                                               | Trying ahead to the video 🙂

    When APIs frictionlessly grow to be tables, you may commit your full consideration to reasoning over the abstractions represented by these APIs. Larry Wall, the creator of Perl, famously mentioned: “Straightforward issues must be simple, exhausting issues must be attainable.” The primary two examples are issues that must be, and are, simple: every is simply 10 strains of straightforward, straight-ahead SQL that requires no wizardry in any respect.

    The third instance is a more durable factor. It might be exhausting in any programming language. However SQL makes it attainable in a number of good methods. The answer is product of concise stanzas (CTEs, Widespread Desk Expressions) that kind a pipeline. Every section of the pipeline handles one clearly-defined piece of the issue. You possibly can validate the output of every section earlier than continuing to the following. And you are able to do all this with essentially the most mature and widely-used grammar for choice, filtering, and recombination of knowledge.

    Do I’ve to make use of SQL?

    No! In the event you like the concept of mapping APIs to tables, however you’d fairly purpose over these tables in Python or R dataframes, then Steampipe can oblige. Underneath the covers it’s Postgres, enhanced with international information wrappers that deal with the API-to-table transformation. Something that may connect with Postgres can connect with Steampipe, together with SQL drivers like Python’s psycopg2 and R’s RPostgres in addition to business-intelligence instruments like Metabase, Tableau, and PowerBI. So you should use Steampipe to frictionlessly devour APIs into dataframes, then purpose over the info in Python or R.

    However when you haven’t used SQL on this manner earlier than, it’s price a glance. Take into account this comparability of SQL to Pandas from The right way to rewrite your SQL queries in Pandas.

    SQL Pandas
    choose * from airports airports
    choose * from airports restrict 3 airports.head(3)
    choose id from airports the place ident = ‘KLAX’ airports[airports.ident == ‘KLAX’].id
    choose distinct kind from airport airports.kind.distinctive()
    choose * from airports the place iso_region = ‘US-CA’ and sort = ‘seaplane_base’ airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘seaplane_base’)]
    choose ident, identify, municipality from airports the place iso_region = ‘US-CA’ and sort = ‘large_airport’ airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]]

    We are able to argue the deserves of 1 fashion versus the opposite, however there’s no query that SQL is essentially the most common and widely-implemented technique to categorical these operations on information. So no, you don’t have to make use of SQL to its fullest potential to be able to profit from Steampipe. However you may discover that you just wish to.



    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here