项目作者: coleifer

项目描述 :
Implement SQLite table-valued functions with Python
高级语言: Python
项目地址: git://github.com/coleifer/sqlite-vtfunc.git
创建时间: 2015-12-04T17:31:54Z
项目社区:https://github.com/coleifer/sqlite-vtfunc

开源协议:MIT License

下载


vtfunc

NOTICE: This project is no longer necessary if you are using Peewee 3.0 or
newer, as the relevant code has been included in Peewee’s sqlite extension
module. For more information, see:

If you intend to use this project with an older version of Peewee, or as a
standalone project with the standard library SQLite module, feel free to
continue using this repository.

Requires sqlite >= 3.9.0


Python bindings for the creation of table-valued functions
in SQLite.

A table-valued function:

  • Accepts any number of parameters
  • Can be used in places you would put a normal table or subquery, such as the
    FROM clause or on the right-hand-side of an IN expression.
  • may return an arbitrary number of rows consisting of one or more columns.

Here are some examples of what you can do with Python and sqlite-vtfunc:

  • Write a SELECT query that, when run, will scrape a website and return a table
    of all the outbound links on the page (rows are (href, description)
    tuples. See example below).
  • Accept a file path and return a table of the files in that directory and
    their associated metadata.
  • Use table-valued functions to handle recurring events in a calendaring
    application (by generating the series of recurrances dynamically).
  • Apply a regular expression search to some text and return a row for each
    matching substring.

Scraping pages with SQL

To get an idea of how sqlite-vtfunc works, let’s build the scraper table
function described in the previous section. The function will accept a URL as
the only parameter, and will return a table of the link destinations and text
descriptions.

The Scraper class contains the entire implementation for the scraper:

  1. import re, urllib2
  2. from pysqlite2 import dbapi2 as sqlite3 # Use forked pysqlite.
  3. from vtfunc import TableFunction
  4. class Scraper(TableFunction):
  5. params = ['url'] # Function argument names.
  6. columns = ['href', 'description'] # Result rows have these columns.
  7. name = 'scraper' # Name we use to invoke the function from SQL.
  8. def initialize(self, url):
  9. # When the function is called, download the HTML and create an
  10. # iterator that successively yields `href`/`description` pairs.
  11. fh = urllib2.urlopen(url)
  12. self.html = fh.read()
  13. self._iter = re.finditer(
  14. '<a[^\>]+?href="([^\"]+?)"[^\>]*?>([^\<]+?)</a>',
  15. self.html)
  16. def iterate(self, idx):
  17. # Since row ids would not be meaningful for this particular table-
  18. # function, we can ignore "idx" and just advance the regex iterator.
  19. # Ordinarily, to signal that there are no more rows, the `iterate()`
  20. # method must raise a `StopIteration` exception. This is not necessary
  21. # here because `next()` will raise the exception when the regex
  22. # iterator is finished.
  23. return next(self._iter).groups()

To start using the table function, create a connection and register the table
function with the connection. Note: for SQLite version <= 3.13, the table
function will not remain loaded across connections, so it is necessary to
register it each time you connect to the database.

  1. # Creating a connection and registering our scraper function.
  2. conn = sqlite3.connect(':memory:')
  3. Scraper.register(conn) # Register the function with the new connection.

To test the scraper, start up a python interpreter and enter the above code.
Once that is done, let’s try a query. The following query will fetch the HTML
for the hackernews front-page and extract the three links with the longest
descriptions:

  1. >>> curs = conn.execute('SELECT * FROM scraper(?) '
  2. ... 'ORDER BY length(description) DESC '
  3. ... 'LIMIT 3', ('https://news.ycombinator.com/',))
  4. >>> for (href, description) in curs.fetchall():
  5. ... print description, ':', href
  6. The Diolkos: an ancient Greek paved trackway enabling boats to be moved overland : https://...
  7. The NumPy array: a structure for efficient numerical computation (2011) [pdf] : https://hal...
  8. Restoring Y Combinator's Xerox Alto, day 4: What's running on the system : http://www.right...

Now, suppose you have another table which contains a huge list of URLs that you
need to scrape. Since this is a relational database, it’s incredibly easy to
connect the URLs in one table with another.

The following query will scrape all the URLs in the unvisited_urls table:

  1. SELECT uu.url, href, description
  2. FROM unvisited_urls AS uu, scraper(uu.url)
  3. ORDER BY uu.url, href, description;

Example two: implementing Python’s range()

This function generates a series of integers between given boundaries and at
given intervals.

  1. from vtfunc import TableFunction
  2. class GenerateSeries(TableFunction):
  3. params = ['start', 'stop', 'step']
  4. columns = ['output']
  5. name = 'generate_series'
  6. def initialize(self, start=0, stop=None, step=1):
  7. # Note that when a parameter is optional, the only thing
  8. # you need to do is provide a default value in `initialize()`.
  9. self.start = start
  10. self.stop = stop or float('inf')
  11. self.step = step
  12. self.curr = self.start
  13. def iterate(self, idx):
  14. if self.curr > self.stop:
  15. raise StopIteration
  16. ret = self.curr
  17. self.curr += self.step
  18. return (ret,)

Dependencies

This project is designed to work with the standard library sqlite3 driver, or
alternatively, the latest version of pysqlite2.

Implementation Notes

To create functions that return multiple values, it is necessary to create a
virtual table. SQLite has the concept of
“eponymous” virtual tables, which are virtual tables that can be called like a
function and do not require explicit creation using DDL statements.

The vtfunc module abstracts away the complexity of creating an eponymous
virtual table, allowing you to write your own multi-value SQLite functions in
Python.

TODO: was removing stuff and stopped here.

  1. import re
  2. from vtfunc import TableFunction
  3. class RegexSearch(TableFunction):
  4. params = ['regex', 'search_string']
  5. columns = ['match']
  6. name = 'regex_search'
  7. def initialize(self, regex=None, search_string=None):
  8. self._iter = re.finditer(regex, search_string)
  9. def iterate(self, idx):
  10. # We do not need `idx`, so just ignore it.
  11. return (next(self._iter).group(0),)

To use our function, we need to register the module with a SQLite connection,
then call it using a SELECT query:

  1. import sqlite3
  2. conn = sqlite3.connect(':memory:') # Create an in-memory database.
  3. RegexSearch.register(conn) # Register our module.
  4. query_params = ('[0-9]+', '123 xxx 456 yyy 789 zzz 0')
  5. cursor = conn.execute('SELECT * FROM regex_search(?, ?);', query_params)
  6. print cursor.fetchall()

Let’s say we have a table that contains a list of arbitrary messages and we
want to capture all the e-mail addresses from that table. This is also easy
using our table-valued function. We will query the messages table and pass
the message body into our table-valued function. Then, for each email address
we find, we’ll return a row containing the message ID and the matching email
address:

  1. email_regex = '[\w]+@[\w]+\.[\w]{2,3}' # Stupid simple email regex.
  2. query = ('SELECT messages.id, regex_search.match '
  3. 'FROM messages, regex_search(?, messages.body)')
  4. cursor = conn.execute(query, (email_regex,))

The resulting rows will look something like:

  1. message id | email
  2. -----------+-----------------------
  3. 1 | charlie@example.com
  4. 1 | huey@kitty.cat
  5. 1 | zaizee@morekitties.cat
  6. 3 | mickey@puppies.dog
  7. 3 | huey@throwaway.cat
  8. ... | ...

Important note

In the above example you will note that the parameters for our query actually
change (because each row in the messages table has a different search string).
This means that for this particular query, the RegexSearch.initialize()
function will be called once for each row in the messages table.

How it works

Behind-the-scenes, vtfunc is creating a Virtual Table
and filling in the various callbacks with wrappers around your user-defined
function. There are two important methods that the wrapped virtual table
implements:

  • xBestIndex
  • xFilter

When SQLite attempts to execute a query, it will call the xBestIndex method of
the virtual table (possibly multiple times) trying to come up with the best
query plan. The vtfunc module optimizes for those query plans which include
values for all the parameters of the user-defined function. Since some
user-defined functions may have optional parameters, query plans with only a
subset of param values will be slightly penalized.

Since we have no visibility into what parameters the user actually passed in,
and we don’t know ahead of time which query plan SQLite suggests will be
best, vtfunc just does its best to optimize for plans with the highest
number of usable parameter values.

If you encounter a situation where you pass your function multiple parameters,
but it doesn’t receive all of them, it’s the case that a less-than-optimal
plan was used.

After the plan is chosen by calling xBestIndex, the query will execute by
calling xFilter (possibly multiple times). xFilter has access to the actual
query parameters, and it’s responsibility is to initialize the cursor and call
the user’s initialize() callback with the parameters passed in.