buckaroo

Buckaroo - The Data Table for Jupyter

I will be giving a webinar about Buckaroo this Thursday, October 19 1:00 – 2:00pm EST.

Learn about Buckaroo and how it can be customized to automate your own data analysis workflow. Register Here

We all know how awkward it is to clean data in jupyter notebooks. Multiple cells of exploratory work, trying different transforms, looking up different transforms, adhoc functions that work in one notebook and have to be either copied/pasta-ed to the next notebook, or rewritten from scratch. Buckaro makes all of that better by providing a visual UI for common cleaning operations AND emitting python code that performs the transformation. Specifically, the Buckaroo is a tool built to interactively explore, clean, and transform pandas dataframes.

Buckaroo Screenshot

Installation

If using JupyterLab, buckaroo requires JupyterLab version 3 or higher.

You can install buckaroo using pip

Using pip:

pip install buckaroo

Documentation

To get started with using Buckaroo, check out the full documentation:

https://buckaroo-data.readthedocs.io/en/latest/

Using Buckaroo

in a jupyter lab notebook just add the following to a cell

from buckaroo.buckaroo_widget import BuckarooWidget
BuckarooWidget(df=df)  #df being the dataframe you want to explore

and you will see the UI for Buckaroo

Using commands

At the core Buckaroo commands operate on columns. You must first click on a cell (not a header) in the top pane to select a column.

Next you must click on a command like dropcol, fillna, or groupby to create a new command

After creating a new command, you will see that command in the commands list, now you must edit the details of a command. Select the command by clicking on the bottom cell.

At this point you can either delete the command by clicking the X button or change command parameters.

Writing your own commands

Builtin commands are found in all_transforms.py

Simple example

Here is a simple example command

class DropCol(Command):
    command_default = [s('dropcol'), s('df'), "col"]
    command_pattern = [None]

    @staticmethod 
    def transform(df, col):
        df.drop(col, axis=1, inplace=True)
        return df

    @staticmethod 
    def transform_to_py(df, col):
        return "    df.drop('%s', axis=1, inplace=True)" % col

command_default is the base configuration of the command when first added, s('dropcol') is a special notation for the function name. s('df') is a symbol notation for the dataframe argument (see LISP section for details). "col" is a placeholder for the selected column.

since dropcol does not take any extra arguments, command_pattern is [None]

    def transform(df, col):
        df.drop(col, axis=1, inplace=True)
        return df

This transform is the function that manipulates the dataframe. For dropcol we take two arguments, the dataframe, and the column name.

    def transform_to_py(df, col):
        return "    df.drop('%s', axis=1, inplace=True)" % col

transform_to_py emits equivalent python code for this transform. Code is indented 4 spaces for use in a function.

Complex example

class GroupBy(Transform):
    command_default = [s("groupby"), s('df'), 'col', {}]
    command_pattern = [[3, 'colMap', 'colEnum', ['null', 'sum', 'mean', 'median', 'count']]]
    @staticmethod 
    def transform(df, col, col_spec):
        grps = df.groupby(col)
        df_contents = {}
        for k, v in col_spec.items():
            if v == "sum":
                df_contents[k] = grps[k].apply(lambda x: x.sum())
            elif v == "mean":
                df_contents[k] = grps[k].apply(lambda x: x.mean())
            elif v == "median":
                df_contents[k] = grps[k].apply(lambda x: x.median())
            elif v == "count":
                df_contents[k] = grps[k].apply(lambda x: x.count())
        return pd.DataFrame(df_contents)

The GroupBy command is complex. it takes a 3rd argument of col_spec. col_spec is an argument of type colEnum. A colEnum argument tells the UI to display a table with all column names, and a drop down box of enum options.

In this case each column can have an operation of either sum, mean, median, or count applied to it.

Note also the leading 3 in the command_pattern. That is telling the UI that these are the specs for the 3rd element of the command. Eventually commands will be able to have multiple configured arguments.

Argument types

Arguments can currently be configured as

Order of Operations for data cleaning

The ideal order of operations is as follows

Buckaroo can only work on a single input dataframe shape at a time. Any newly created columns are visible on output, but not available for manipulation in the same Buckaroo Cell.

Components

Related projects

Builtin commands are found in all_transforms.py

There are a couple of projects like Buckaroo that aim to provide a better table widget and pandas editing experience.

  1. Mito Open source table/code editing widget for Jupyter. More aimed at easing transition to pandas from excel users.
  2. Bamboolib An originally open source tool aimed at building a similar experience, more aimed as a low-code tool for beginners. The parent company 8080labs was acquired by Databricks.
  3. Microsoft DataWrangler Open source, provides a very similar experience inside of VSCode’s notebook experience. Only works inside of VSCode.
  4. QGrid Open source, unmaintained. A slick table widget built by Quantopian, no code gen or data manipulation features
  5. IpyDatagrid Open source. Bloomberg’s Jupyter table widget. I used the ipydatagrid repo structure as the basis for buckaroo (js build setup only)
  6. IPyAgGrid Open source. Wraps AG Grid in a jupyter widget. Buckaroo also uses AG Grid.

What works now, what’s coming

Exists now

Next major features

Development installation

For a development installation:

git clone https://github.com/paddymul/buckaroo.git
cd buckaroo
#we need to build against 3.6.5, jupyterlab 4.0 has different JS typing that conflicts
# the installable still works in JL4
pip install build twine pytest sphinx-build jupyterlab==3.6.5
pip install -ve .

Enabling development install for Jupyter notebook:

Enabling development install for JupyterLab:

jupyter labextension develop . --overwrite

Note for developers: the --symlink argument on Linux or OS X allows one to modify the JavaScript code in-place. This feature is not available with Windows. `

Developing the JS side

There are a series of examples of the components in examples/ex.

Instructions

npm install
npm run dev

Contributions

We :heart: contributions.

Have you had a good experience with this project? Why not share some love and contribute code, or just let us know about any issues you had with it?

We welcome issue reports here; be sure to choose the proper issue template for your issue, so that we can be sure you’re providing the necessary information.

Before sending a Pull Request, please make sure you read our