`datascience` Tables¶

class prob140.Table(labels=None, _deprecated=None, *, formatter=<datascience.formats.Formatter object>)[source]

A sequence of string-labeled columns.

class Row[source]

item(index_or_label)[source]: Return the item at an index or label.

class Rows(table)[source]: An iterable view over the rows in a table.

append(row_or_table)[source]: Append a row or all rows of a table. An appended table must have all columns of self.

append_column(label, values)[source]

Appends a column to the table or replaces a column.

__setitem__ is aliased to this method:

table.append_column('new_col', make_array(1, 2, 3)) is equivalent to table['new_col'] = make_array(1, 2, 3).

Args:

label (str): The label of the new column.

values (single value or list/array): If a single value, every

value in the new column is values.

If a list or array, the new column contains the values in values, which must be the same length as the table.

Returns:

Original table with new or replaced column

Raises:

ValueError: If

label is not a string.
values is a list/array and does not have the same length as the number of rows in the table.

>>> table = Table().with_columns(
...     'letter', make_array('a', 'b', 'c', 'z'),
...     'count',  make_array(9, 3, 3, 1),
...     'points', make_array(1, 2, 2, 10))
>>> table
letter | count | points
a      | 9     | 1
b      | 3     | 2
c      | 3     | 2
z      | 1     | 10
>>> table.append_column('new_col1', make_array(10, 20, 30, 40))
>>> table
letter | count | points | new_col1
a      | 9     | 1      | 10
b      | 3     | 2      | 20
c      | 3     | 2      | 30
z      | 1     | 10     | 40
>>> table.append_column('new_col2', 'hello')
>>> table
letter | count | points | new_col1 | new_col2
a      | 9     | 1      | 10       | hello
b      | 3     | 2      | 20       | hello
c      | 3     | 2      | 30       | hello
z      | 1     | 10     | 40       | hello
>>> table.append_column(123, make_array(1, 2, 3, 4))
Traceback (most recent call last):
    ...
ValueError: The column label must be a string, but a int was given
>>> table.append_column('bad_col', [1, 2])
Traceback (most recent call last):
    ...
ValueError: Column length mismatch. New column does not have the same number of rows as table.

apply(fn, *column_or_columns)[source]

Apply fn to each element or elements of column_or_columns. If no column_or_columns provided, fn` is applied to each row.

Args:

fn (function) – The function to apply. column_or_columns: Columns containing the arguments to fn

as either column labels (str) or column indices (int). The number of columns must match the number of arguments that fn expects.

Raises:

ValueError – if column_label is not an existing: column in the table.
TypeError – if insufficent number of column_label passed: to fn.

Returns:

An array consisting of results of applying fn to elements specified by column_label in each row.

>>> t = Table().with_columns(
...     'letter', make_array('a', 'b', 'c', 'z'),
...     'count',  make_array(9, 3, 3, 1),
...     'points', make_array(1, 2, 2, 10))
>>> t
letter | count | points
a      | 9     | 1
b      | 3     | 2
c      | 3     | 2
z      | 1     | 10
>>> t.apply(lambda x: x - 1, 'points')
array([0, 1, 1, 9])
>>> t.apply(lambda x, y: x * y, 'count', 'points')
array([ 9,  6,  6, 10])
>>> t.apply(lambda x: x - 1, 'count', 'points')
Traceback (most recent call last):
    ...
TypeError: <lambda>() takes 1 positional argument but 2 were given
>>> t.apply(lambda x: x - 1, 'counts')
Traceback (most recent call last):
    ...
ValueError: The column "counts" is not in the table. The table contains these columns: letter, count, points

Whole rows are passed to the function if no columns are specified.

>>> t.apply(lambda row: row[1] * 2)
array([18,  6,  6,  2])

as_html(max_rows=0)[source]: Format table as HTML.

as_text(max_rows=0, sep=' | ')[source]: Format table as text.

bar(column_for_categories=None, select=None, overlay=True, width=6, height=4, **vargs)[source]

Plot bar charts for the table.

Each plot is labeled using the values in column_for_categories and one plot is produced for every other column (or for the columns designated by select).

Every selected column except column_for_categories must be numerical.

Args:

column_for_categories (str): A column containing x-axis categories

Kwargs:

overlay (bool): create a chart with one color per data column;: if False, each will be displayed separately.
vargs: Additional arguments that get passed into plt.bar.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.bar for additional arguments that can be passed into vargs.

barh(column_for_categories=None, select=None, overlay=True, width=6, **vargs)[source]

Plot horizontal bar charts for the table.

Args:

column_for_categories (str): A column containing y-axis categories: used to create buckets for bar chart.

Kwargs:

overlay (bool): create a chart with one color per data column;: if False, each will be displayed separately.
vargs: Additional arguments that get passed into plt.barh.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.barh for additional arguments that can be passed into vargs.

Raises:

ValueError – Every selected except column for column_for_categories: must be numerical.

Returns:

Horizontal bar graph with buckets specified by column_for_categories. Each plot is labeled using the values in column_for_categories and one plot is produced for every other column (or for the columns designated by select).

>>> t = Table().with_columns(
...     'Furniture', make_array('chairs', 'tables', 'desks'),
...     'Count', make_array(6, 1, 2),
...     'Price', make_array(10, 20, 30)
...     )
>>> t
Furniture | Count | Price
chairs    | 6     | 10
tables    | 1     | 20
desks     | 2     | 30
>>> furniture_table.barh('Furniture') 
<bar graph with furniture as categories and bars for count and price>
>>> furniture_table.barh('Furniture', 'Price') 
<bar graph with furniture as categories and bars for price>
>>> furniture_table.barh('Furniture', make_array(1, 2)) 
<bar graph with furniture as categories and bars for count and price>

bin(*columns, **vargs)[source]

Group values by bin and compute counts per bin by column.

By default, bins are chosen to contain all values in all columns. The following named arguments from numpy.histogram can be applied to specialize bin widths:

If the original table has n columns, the resulting binned table has n+1 columns, where column 0 contains the lower bound of each bin.

Args:

columns (str or int): Labels or indices of columns to be: binned. If empty, all columns are binned.
bins (int or sequence of scalars): If bins is an int,: it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths.
range ((float, float)): The lower and upper range of: the bins. If not provided, range contains all values in the table. Values outside the range are ignored.
density (bool): If False, the result will contain the number of: samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

boxplot(**vargs)[source]

Plots a boxplot for the table.

Every column must be numerical.

Kwargs:

vargs: Additional arguments that get passed into plt.boxplot.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.boxplot for additional arguments that can be passed into vargs. These include vert and showmeans.

Returns:

None

Raises:

ValueError: The Table contains columns with non-numerical values.

>>> table = Table().with_columns(
...     'test1', make_array(92.5, 88, 72, 71, 99, 100, 95, 83, 94, 93),
...     'test2', make_array(89, 84, 74, 66, 92, 99, 88, 81, 95, 94))
>>> table
test1 | test2
92.5  | 89
88    | 84
72    | 74
71    | 66
99    | 92
100   | 99
95    | 88
83    | 81
94    | 95
93    | 94
>>> table.boxplot() 
<boxplot of test1 and boxplot of test2 side-by-side on the same figure>

cdf(x)

Finds the cdf of the distribution

Parameters:	x : float Value in distribution
Returns:	float Finds P(X<=x)

Examples

>>> dist = Table().with_columns(
...     'Value', make_array(2, 3, 4),
...     'Probability', make_array(0.25, 0.5, 0.25))
>>> dist.cdf(0)
0
>>> dist.cdf(2)
0.25
>>> dist.cdf(3.5)
0.75
>>> dist.cdf(1000)
1.0

column(index_or_label)[source]

Return the values of a column as an array.

table.column(label) is equivalent to table[label].

>>> tiles = Table().with_columns(
...     'letter', make_array('c', 'd'),
...     'count',  make_array(2, 4),
... )

>>> list(tiles.column('letter'))
['c', 'd']
>>> tiles.column(1)
array([2, 4])

Args:: label (int or str): The index or label of a column
Returns:: An instance of numpy.array.
Raises:: ValueError: When the index_or_label is not in the table.

column_index(label)[source]: Return the index of a column by looking up its label.

column_labels: Return a tuple of column labels. [Deprecated]

copy(*, shallow=False)[source]: Return a copy of a table.

drop(*column_or_columns)[source]

Return a Table with only columns other than selected label or labels.

Args:

column_or_columns (string or list of strings): The header names or indices of the columns to be dropped.

column_or_columns must be an existing header name, or a valid column index.

Returns:

An instance of Table with given columns removed.

>>> t = Table().with_columns(
...     'burgers',  make_array('cheeseburger', 'hamburger', 'veggie burger'),
...     'prices',   make_array(6, 5, 5),
...     'calories', make_array(743, 651, 582))
>>> t
burgers       | prices | calories
cheeseburger  | 6      | 743
hamburger     | 5      | 651
veggie burger | 5      | 582
>>> t.drop('prices')
burgers       | calories
cheeseburger  | 743
hamburger     | 651
veggie burger | 582
>>> t.drop(['burgers', 'calories'])
prices
6
5
5
>>> t.drop('burgers', 'calories')
prices
6
5
5
>>> t.drop([0, 2])
prices
6
5
5
>>> t.drop(0, 2)
prices
6
5
5
>>> t.drop(1)
burgers       | calories
cheeseburger  | 743
hamburger     | 651
veggie burger | 582

classmethod empty(labels=None)[source]

Creates an empty table. Column labels are optional. [Deprecated]

Args:

labels (None or list): If None, a table with 0: columns is created. If a list, each element is a column label in a table with 0 rows.

Returns:

A new instance of Table.

ev()

Finds expected value of distribution

Returns:	float Expected value

Examples

>>> dist = Table().values([1, 2, 4]).probability([0.5, 0.4, 0.1])
>>> dist.ev()
1.7
>>> 1 * 0.5 + 2 * 0.4 + 4 * 0.1
1.7

event(x)

Shows the probability that distribution takes on value x or list of values x.

Parameters:	x : float or Iterable An event represented either as a specific value in the domain or a subset of the domain
Returns:	Table Shows the probabilities of each value in the event

Examples

>>> dist = Table().values([1 2, 3, 4]).probability([1/4, 1/4, 1/4, 1/4])
>>> dist.event(2)
Domain | Probability
2      | 0.25
>>> dist.event([2,3])
Domain | Probability
2      | 0.25
3      | 0.25

exclude()[source]

Return a new Table without a sequence of rows excluded by number.

Args:

row_indices_or_slice (integer or list of integers or slice):: The row index, list of row indices or a slice of row indices to be excluded.

Returns:

A new instance of Table.

>>> t = Table().with_columns(
...     'letter grade', make_array('A+', 'A', 'A-', 'B+', 'B', 'B-'),
...     'gpa', make_array(4, 4, 3.7, 3.3, 3, 2.7))
>>> t
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
B+           | 3.3
B            | 3
B-           | 2.7
>>> t.exclude(4)
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
B+           | 3.3
B-           | 2.7
>>> t.exclude(-1)
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
B+           | 3.3
B            | 3
>>> t.exclude(make_array(1, 3, 4))
letter grade | gpa
A+           | 4
A-           | 3.7
B-           | 2.7
>>> t.exclude(range(3))
letter grade | gpa
B+           | 3.3
B            | 3
B-           | 2.7

Note that exclude also supports NumPy-like indexing and slicing:

>>> t.exclude[:3]
letter grade | gpa
B+           | 3.3
B            | 3
B-           | 2.7

>>> t.exclude[1, 3, 4]
letter grade | gpa
A+           | 4
A-           | 3.7
B-           | 2.7

classmethod from_array(arr)[source]: Convert a structured NumPy array into a Table.

classmethod from_columns_dict(columns)[source]: Create a table from a mapping of column labels to column values. [Deprecated]

classmethod from_df(df)[source]: Convert a Pandas DataFrame into a Table.

classmethod from_records(records)[source]: Create a table from a sequence of records (dicts with fixed keys).

classmethod from_rows(rows, labels)[source]: Create a table from a sequence of rows (fixed-length sequences). [Deprecated]

group(column_or_label, collect=None)[source]

Group rows by unique values in a column; count or aggregate others.

Args:

column_or_label: values to group (column label or index, or array)

collect: a function applied to values in other columns for each group

Returns:

A Table with each row corresponding to a unique value in column_or_label, where the first column contains the unique values from column_or_label, and the second contains counts for each of the unique values. If collect is provided, a Table is returned with all original columns, each containing values calculated by first grouping rows according to column_or_label, then applying collect to each set of grouped values in the other columns.

Note:

The grouped column will appear first in the result table. If collect does not accept arguments with one of the column types, that column will be empty in the resulting table.

>>> marbles = Table().with_columns(
...    "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"),
...    "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"),
...    "Amount", make_array(4, 6, 12, 7, 9, 2),
...    "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00))
>>> marbles
Color | Shape       | Amount | Price
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 1.4
Green | Round       | 2      | 1
>>> marbles.group("Color") # just gives counts
Color | count
Blue  | 1
Green | 3
Red   | 2
>>> marbles.group("Color", max) # takes the max of each grouping, in each column
Color | Shape max   | Amount max | Price max
Blue  | Rectangular | 12         | 2
Green | Round       | 9          | 1.4
Red   | Round       | 7          | 1.75
>>> marbles.group("Shape", sum) # sum doesn't make sense for strings
Shape       | Color sum | Amount sum | Price sum
Rectangular |           | 27         | 4.7
Round       |           | 13         | 4.05

group_bar(column_label, **vargs)[source]

Plot a bar chart for the table.

The values of the specified column are grouped and counted, and one bar is produced for each group.

Note: This differs from bar in that there is no need to specify bar heights; the height of a category’s bar is the number of copies of that category in the given column. This method behaves more like hist in that regard, while bar behaves more like plot or scatter (which require the height of each point to be specified).

Args:

column_label (str or int): The name or index of a column

Kwargs:

overlay (bool): create a chart with one color per data column;: if False, each will be displayed separately.

width (float): The width of the plot, in inches height (float): The height of the plot, in inches

vargs: Additional arguments that get passed into plt.bar.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.bar for additional arguments that can be passed into vargs.

group_barh(column_label, **vargs)[source]

Plot a horizontal bar chart for the table.

The values of the specified column are grouped and counted, and one bar is produced for each group.

Note: This differs from barh in that there is no need to specify bar heights; the size of a category’s bar is the number of copies of that category in the given column. This method behaves more like hist in that regard, while barh behaves more like plot or scatter (which require the second coordinate of each point to be specified in another column).

Args:

column_label (str or int): The name or index of a column

Kwargs:

overlay (bool): create a chart with one color per data column;: if False, each will be displayed separately.

width (float): The width of the plot, in inches height (float): The height of the plot, in inches

vargs: Additional arguments that get passed into plt.bar.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.bar for additional arguments that can be passed into vargs.

groups(labels, collect=None)[source]

Group rows by multiple columns, count or aggregate others.

Args:

labels: list of column names (or indices) to group on

collect: a function applied to values in other columns for each group

Returns: A Table with each row corresponding to a unique combination of values in

the columns specified in labels, where the first columns are those specified in labels, followed by a column of counts for each of the unique values. If collect is provided, a Table is returned with all original columns, each containing values calculated by first grouping rows according to to values in the labels column, then applying collect to each set of grouped values in the other columns.

Note:

The grouped columns will appear first in the result table. If collect does not accept arguments with one of the column types, that column will be empty in the resulting table.

>>> marbles = Table().with_columns(
...    "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"),
...    "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"),
...    "Amount", make_array(4, 6, 12, 7, 9, 2),
...    "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00))
>>> marbles
Color | Shape       | Amount | Price
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 1.4
Green | Round       | 2      | 1
>>> marbles.groups(["Color", "Shape"])
Color | Shape       | count
Blue  | Rectangular | 1
Green | Rectangular | 2
Green | Round       | 1
Red   | Round       | 2
>>> marbles.groups(["Color", "Shape"], sum)
Color | Shape       | Amount sum | Price sum
Blue  | Rectangular | 12         | 2
Green | Rectangular | 15         | 2.7
Green | Round       | 2          | 1
Red   | Round       | 11         | 3.05

hist(*columns, overlay=True, bins=None, bin_column=None, unit=None, counts=None, group=None, side_by_side=False, width=6, height=4, **vargs)[source]

Plots one histogram for each column in columns. If no column is specified, plot all columns.

Kwargs:

overlay (bool): If True, plots 1 chart with all the histograms: overlaid on top of each other (instead of the default behavior of one histogram for each column in the table). Also adds a legend that matches each bar color to its column. Note that if the histograms are not overlaid, they are not forced to the same scale.
bins (list or int): Lower bound for each bin in the: histogram or number of bins. If None, bins will be chosen automatically.
bin_column (column name or index): A column of bin lower bounds.: All other columns are treated as counts of these bins. If None, each value in each row is assigned a count of 1.

counts (column name or index): Deprecated name for bin_column.

unit (string): A name for the units of the plotted column (e.g.: ‘kg’), to be used in the plot.
group (column name or index): A column of categories. The rows are: grouped by the values in this column, and a separate histogram is generated for each group. The histograms are overlaid or plotted separately depending on the overlay argument. If None, no such grouping is done.
side_by_side (bool): Whether histogram bins should be plotted side by: side (instead of directly overlaid). Makes sense only when plotting multiple histograms, either by passing several columns or by using the group option.
vargs: Additional arguments that get passed into :func:plt.hist.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist for additional arguments that can be passed into vargs. These include: range, normed, cumulative, and orientation, to name a few.

>>> t = Table().with_columns(
...     'count',  make_array(9, 3, 3, 1),
...     'points', make_array(1, 2, 2, 10))
>>> t
count | points
9     | 1
3     | 2
3     | 2
1     | 10
>>> t.hist() 
<histogram of values in count>
<histogram of values in points>

>>> t = Table().with_columns(
...     'value',      make_array(101, 102, 103),
...     'proportion', make_array(0.25, 0.5, 0.25))
>>> t.hist(bin_column='value') 
<histogram of values weighted by corresponding proportions>

>>> t = Table().with_columns(
...     'value',    make_array(1,   2,   3,   2,   5  ),
...     'category', make_array('a', 'a', 'a', 'b', 'b'))
>>> t.hist('value', group='category') 
<two overlaid histograms of the data [1, 2, 3] and [2, 5]>

index_by(column_or_label)[source]: Return a dict keyed by values in a column that contains lists of rows corresponding to each value.

join(column_label, other, other_label=None)[source]

Creates a new table with the columns of self and other, containing rows for all values of a column that appear in both tables.

Args:

column_label (str): label of column in self that is used to: join rows of other.
other: Table object to join with self on matching values of: column_label.

Kwargs:

other_label (str): default None, assumes column_label.: Otherwise in other used to join rows.

Returns:

New table self joined with other by matching values in column_label and other_label. If the resulting join is empty, returns None.

>>> table = Table().with_columns('a', make_array(9, 3, 3, 1),
...     'b', make_array(1, 2, 2, 10),
...     'c', make_array(3, 4, 5, 6))
>>> table
a    | b    | c
9    | 1    | 3
3    | 2    | 4
3    | 2    | 5
1    | 10   | 6
>>> table2 = Table().with_columns( 'a', make_array(9, 1, 1, 1),
... 'd', make_array(1, 2, 2, 10),
... 'e', make_array(3, 4, 5, 6))
>>> table2
a    | d    | e
9    | 1    | 3
1    | 2    | 4
1    | 2    | 5
1    | 10   | 6
>>> table.join('a', table2)
a    | b    | c    | d    | e
1    | 10   | 6    | 2    | 4
1    | 10   | 6    | 2    | 5
1    | 10   | 6    | 10   | 6
9    | 1    | 3    | 1    | 3
>>> table.join('a', table2, 'a') # Equivalent to previous join
a    | b    | c    | d    | e
1    | 10   | 6    | 2    | 4
1    | 10   | 6    | 2    | 5
1    | 10   | 6    | 10   | 6
9    | 1    | 3    | 1    | 3
>>> table.join('a', table2, 'd') # Repeat column labels relabeled
a    | b    | c    | a_2  | e
1    | 10   | 6    | 9    | 3
>>> table2 #table2 has three rows with a = 1
a    | d    | e
9    | 1    | 3
1    | 2    | 4
1    | 2    | 5
1    | 10   | 6
>>> table #table has only one row with a = 1
a    | b    | c
9    | 1    | 3
3    | 2    | 4
3    | 2    | 5
1    | 10   | 6

labels: Return a tuple of column labels.

move_to_end(column_label)[source]: Move a column to the last in order.

move_to_start(column_label)[source]: Move a column to the first in order.

normalized()

Returns the distribution by making the proabilities sum to 1

Returns:	Table A distribution with the probabilities normalized

Examples

>>> Table().values([1, 2, 3]).probability([1, 1, 1])
Value | Probability
1     | 1
2     | 1
3     | 1
>>> Table().values([1, 2, 3]).probability([1, 1, 1]).normalized()
Value | Probability
1     | 0.333333
2     | 0.333333
3     | 0.333333

num_columns: Number of columns.

num_rows: Number of rows.

percentile(p)[source]

Return a new table with one row containing the pth percentile for each column.

Assumes that each column only contains one type of value.

Returns a new table with one row and the same column labels. The row contains the pth percentile of the original column, where the pth percentile of a column is the smallest value that at at least as large as the p% of numbers in the column.

>>> table = Table().with_columns(
...     'count',  make_array(9, 3, 3, 1),
...     'points', make_array(1, 2, 2, 10))
>>> table
count | points
9     | 1
3     | 2
3     | 2
1     | 10
>>> table.percentile(80)
count | points
9     | 10

pivot(columns, rows, values=None, collect=None, zero=None)[source]

Generate a table with a column for each unique value in columns, with rows for each unique value in rows. Each row counts/aggregates the values that match both row and column based on collect.

Args:

columns – a single column label or index, (str or int),: used to create new columns, based on its unique values.
rows – row labels or indices, (str or int or list),: used to create new rows based on it’s unique values.
values – column label in table for use in aggregation.: Default None.
collect – aggregation function, used to group values: over row-column combinations. Default None.

zero – zero value for non-existent row-column combinations.

Raises:

TypeError – if collect is passed in and values is not,: vice versa.

Returns:

New pivot table, with row-column combinations, as specified, with aggregated values by collect across the intersection of columns and rows. Simple counts provided if values and collect are None, as default.

>>> titanic = Table().with_columns('age', make_array(21, 44, 56, 89, 95
...    , 40, 80, 45), 'survival', make_array(0,0,0,1, 1, 1, 0, 1),
...    'gender',  make_array('M', 'M', 'M', 'M', 'F', 'F', 'F', 'F'),
...    'prediction', make_array(0, 0, 1, 1, 0, 1, 0, 1))
>>> titanic
age  | survival | gender | prediction
21   | 0        | M      | 0
44   | 0        | M      | 0
56   | 0        | M      | 1
89   | 1        | M      | 1
95   | 1        | F      | 0
40   | 1        | F      | 1
80   | 0        | F      | 0
45   | 1        | F      | 1
>>> titanic.pivot('survival', 'gender')
gender | 0    | 1
F      | 1    | 3
M      | 3    | 1
>>> titanic.pivot('prediction', 'gender')
gender | 0    | 1
F      | 2    | 2
M      | 2    | 2
>>> titanic.pivot('survival', 'gender', values='age', collect = np.mean)
gender | 0       | 1
F      | 80      | 60
M      | 40.3333 | 89
>>> titanic.pivot('survival', make_array('prediction', 'gender'))
prediction | gender | 0    | 1
0          | F      | 1    | 1
0          | M      | 2    | 0
1          | F      | 0    | 2
1          | M      | 1    | 1
>>> titanic.pivot('survival', 'gender', values = 'age')
Traceback (most recent call last):
   ...
TypeError: values requires collect to be specified
>>> titanic.pivot('survival', 'gender', collect = np.mean)
Traceback (most recent call last):
   ...
TypeError: collect requires values to be specified

pivot_bin(pivot_columns, value_column, bins=None, **vargs)[source]

Form a table with columns formed by the unique tuples in pivot_columns containing counts per bin of the values associated with each tuple in the value_column.

By default, bins are chosen to contain all values in the value_column. The following named arguments from numpy.histogram can be applied to specialize bin widths:

Args:

bins (int or sequence of scalars): If bins is an int,: it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths.
range ((float, float)): The lower and upper range of: the bins. If not provided, range contains all values in the table. Values outside the range are ignored.
normed (bool): If False, the result will contain the number of: samples in each bin. If True, the result is normalized such that the integral over the range is 1.

pivot_hist(pivot_column_label, value_column_label, overlay=True, width=6, height=4, **vargs)[source]: Draw histograms of each category in a column.

plot(column_for_xticks=None, select=None, overlay=True, width=6, height=4, **vargs)[source]

Plot line charts for the table.

Args:

column_for_xticks (str/array): A column containing x-axis labels

Kwargs:

overlay (bool): create a chart with one color per data column;: if False, each plot will be displayed separately.
vargs: Additional arguments that get passed into plt.plot.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot for additional arguments that can be passed into vargs.

Raises:

ValueError – Every selected column must be numerical.

Returns:

Returns a line plot (connected scatter). Each plot is labeled using the values in column_for_xticks and one plot is produced for all other columns in self (or for the columns designated by select).

>>> table = Table().with_columns(
...     'days',  make_array(0, 1, 2, 3, 4, 5),
...     'price', make_array(90.5, 90.00, 83.00, 95.50, 82.00, 82.00),
...     'projection', make_array(90.75, 82.00, 82.50, 82.50, 83.00, 82.50))
>>> table
days | price | projection
0    | 90.5  | 90.75
1    | 90    | 82
2    | 83    | 82.5
3    | 95.5  | 82.5
4    | 82    | 83
5    | 82    | 82.5
>>> table.plot('days') 
<line graph with days as x-axis and lines for price and projection>
>>> table.plot('days', overlay=False) 
<line graph with days as x-axis and line for price>
<line graph with days as x-axis and line for projection>
>>> table.plot('days', 'price') 
<line graph with days as x-axis and line for price>

prob_event(x)

Finds the probability of an event x.

Parameters:	x : float or Iterable An event represented either as a specific value in the domain or a subset of the domain
Returns:	float Probability of the event

Examples

>>> dist = Table().values([1, 2, 3, 4]).probability([1/4, 1/4, 1/4, 1/4])
>>> dist.prob_event(2)
0.25
>>> dist.prob_event([2, 3])
0.5
>>> dist.prob_event(np.arange(1, 5))
1.0

probability(values)

Assigns probabilities to domain values.

Parameters:	values : List or Array Values that must correspond to the domain in the same order.
Returns:	Table A probability distribution with those probabilities

probability_function(pfunc)

Assigns probabilities to a Distribution via a probability function. The probability function is applied to each value of the domain. Must have domain values in the first columns.

Parameters:	pfunc : func Probability function of the distribution.
Returns:	Table Table with probabilities in its last column.

classmethod read_table(filepath_or_buffer, *args, **vargs)[source]

Read a table from a file or web address.

filepath_or_buffer – string or file handle / StringIO; The string: could be a URL. Valid URL schemes include http, ftp, s3, and file.

relabel(column_label, new_label)[source]

Changes the label(s) of column(s) specified by column_label to labels in new_label.

Args:

column_label – (single str or array of str) The label(s) of: columns to be changed to new_label.
new_label – (single str or array of str): The label name(s): of columns to replace column_label.

Raises:

ValueError – if column_label is not in table, or if: column_label and new_label are not of equal length.
TypeError – if column_label and/or new_label is not: str.

Returns:

Original table with new_label in place of column_label.

>>> table = Table().with_columns(
...     'points', make_array(1, 2, 3),
...     'id',     make_array(12345, 123, 5123))
>>> table.relabel('id', 'yolo')
points | yolo
1      | 12345
2      | 123
3      | 5123
>>> table.relabel(make_array('points', 'yolo'),
...   make_array('red', 'blue'))
red  | blue
1    | 12345
2    | 123
3    | 5123
>>> table.relabel(make_array('red', 'green', 'blue'),
...   make_array('cyan', 'magenta', 'yellow', 'key'))
Traceback (most recent call last):
    ...
ValueError: Invalid arguments. column_label and new_label must be of equal length.

relabeled(label, new_label)[source]

Return a new table with label specifying column label(s) replaced by corresponding new_label.

Args:

label – (str or array of str) The label(s) of: columns to be changed.
new_label – (str or array of str): The new label(s) of: columns to be changed. Same number of elements as label.

Raises:

ValueError – if label does not exist in: table, or if the label and new_label are not not of equal length. Also, raised if label and/or new_label are not str.

Returns:

New table with new_label in place of label.

>>> tiles = Table().with_columns('letter', make_array('c', 'd'),
...    'count', make_array(2, 4))
>>> tiles
letter | count
c      | 2
d      | 4
>>> tiles.relabeled('count', 'number')
letter | number
c      | 2
d      | 4
>>> tiles  # original table unmodified
letter | count
c      | 2
d      | 4
>>> tiles.relabeled(make_array('letter', 'count'),
...   make_array('column1', 'column2'))
column1 | column2
c       | 2
d       | 4
>>> tiles.relabeled(make_array('letter', 'number'),
...  make_array('column1', 'column2'))
Traceback (most recent call last):
    ...
ValueError: Invalid labels. Column labels must already exist in table in order to be replaced.

remove(row_or_row_indices)[source]: Removes a row or multiple rows of a table in place.

remove_zeros()

Removes all values with zero probability from the Distribution.

Returns:	Distribution

Examples

>>> dist = Table().values([2, 3, 4, 5]).probability([0.5, 0, 0.5, 0])
>>> dist
Value | Probability
2     | 0.5
3     | 0
4     | 0.5
5     | 0
>>> dist.remove_zeros()
Value | Probability
2     | 0.5
4     | 0.5

row(index)[source]: Return a row.

rows: Return a view of all rows.

sample(k=None, with_replacement=True, weights=None)[source]

Return a new table where k rows are randomly sampled from the original table.

Args:

k – specifies the number of rows (int) to be sampled from: the table. Default is k equal to number of rows in the table.
with_replacement – (bool) By default True;: Samples k rows with replacement from table, else samples k rows without replacement.
weights – Array specifying probability the ith row of the: table is sampled. Defaults to None, which samples each row with equal probability. weights must be a valid probability distribution – i.e. an array the length of the number of rows, summing to 1.

Raises:

ValueError – if weights is not length equal to number of rows: in the table; or, if weights does not sum to 1.

Returns:

A new instance of Table with k rows resampled.

>>> jobs = Table().with_columns(
...     'job',  make_array('a', 'b', 'c', 'd'),
...     'wage', make_array(10, 20, 15, 8))
>>> jobs
job  | wage
a    | 10
b    | 20
c    | 15
d    | 8
>>> jobs.sample() 
job  | wage
b    | 20
b    | 20
a    | 10
d    | 8
>>> jobs.sample(with_replacement=True) 
job  | wage
d    | 8
b    | 20
c    | 15
a    | 10
>>> jobs.sample(k = 2) 
job  | wage
b    | 20
c    | 15
>>> ws =  make_array(0.5, 0.5, 0, 0)
>>> jobs.sample(k=2, with_replacement=True, weights=ws) 
job  | wage
a    | 10
a    | 10
>>> jobs.sample(k=2, weights=make_array(1, 0, 1, 0))
Traceback (most recent call last):
    ...
ValueError: probabilities do not sum to 1

# Weights must be length of table. >>> jobs.sample(k=2, weights=make_array(1, 0, 0)) Traceback (most recent call last):

…

ValueError: a and p must have same size

sample_from_dist(n=1)

Randomly samples from the distribution.

Note that this function was previously named sample but was renamed due to naming conflicts with the datascience library.

Parameters:	n : int Number of times to sample from the distribution (default: 1)
Returns:	float or array Samples from the distribution

Examples

>>> dist = Table().with_columns(
...    'Value', make_array(2, 3, 4),
...    'Probability', make_array(0.25, 0.5, 0.25))
>>> dist.sample_from_dist()
3
>>> dist.sample_from_dist()
2
>>> dist.sample_from_dist(10)
array([3, 2, 2, 4, 3, 4, 3, 4, 3, 3])

sample_from_distribution(distribution, k, proportions=False)[source]

Return a new table with the same number of rows and a new column. The values in the distribution column are define a multinomial. They are replaced by sample counts/proportions in the output.

>>> sizes = Table(['size', 'count']).with_rows([
...     ['small', 50],
...     ['medium', 100],
...     ['big', 50],
... ])
>>> sizes.sample_from_distribution('count', 1000) 
size   | count | count sample
small  | 50    | 239
medium | 100   | 496
big    | 50    | 265
>>> sizes.sample_from_distribution('count', 1000, True) 
size   | count | count sample
small  | 50    | 0.24
medium | 100   | 0.51
big    | 50    | 0.25

scatter(column_for_x, select=None, overlay=True, fit_line=False, colors=None, labels=None, sizes=None, width=5, height=5, s=20, **vargs)[source]

Creates scatterplots, optionally adding a line of best fit.

Args:

column_for_x (str): The column to use for the x-axis values: and label of the scatter plots.

Kwargs:

overlay (bool): If true, creates a chart with one color: per data column; if False, each plot will be displayed separately.

fit_line (bool): draw a line of best fit for each set of points.

vargs: Additional arguments that get passed into plt.scatter.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter for additional arguments that can be passed into vargs. These include: marker and norm, to name a couple.

colors: A column of categories to be used for coloring dots.

labels: A column of text labels to annotate dots.

sizes: A column of values to set the relative areas of dots.

s: Size of dots. If sizes is also provided, then dots will be: in the range 0 to 2 * s.

Raises:

ValueError – Every column, column_for_x or select, must be numerical

Returns:

Scatter plot of values of column_for_x plotted against values for all other columns in self. Each plot uses the values in column_for_x for horizontal positions. One plot is produced for all other columns in self as y (or for the columns designated by select).

>>> table = Table().with_columns(
...     'x', make_array(9, 3, 3, 1),
...     'y', make_array(1, 2, 2, 10),
...     'z', make_array(3, 4, 5, 6))
>>> table
x    | y    | z
9    | 1    | 3
3    | 2    | 4
3    | 2    | 5
1    | 10   | 6
>>> table.scatter('x') 
<scatterplot of values in y and z on x>

>>> table.scatter('x', overlay=False) 
<scatterplot of values in y on x>
<scatterplot of values in z on x>

>>> table.scatter('x', fit_line=True) 
<scatterplot of values in y and z on x with lines of best fit>

sd()

Finds standard deviation of Distribution.

Returns:	float Standard Deviation

Examples

>>> dist = Table().values([1, 2, 4]).probability([0.5, 0.4, 0.1])
>>> dist.sd()
0.9

select(*column_or_columns)[source]

Return a table with only the columns in column_or_columns.

Args:: column_or_columns: Columns to select from the Table as either column labels (str) or column indices (int).
Returns:: A new instance of Table containing only selected columns. The columns of the new Table are in the order given in column_or_columns.
Raises:: KeyError if any of column_or_columns are not in the table.

>>> flowers = Table().with_columns(
...     'Number of petals', make_array(8, 34, 5),
...     'Name', make_array('lotus', 'sunflower', 'rose'),
...     'Weight', make_array(10, 5, 6)
... )

>>> flowers
Number of petals | Name      | Weight
8                | lotus     | 10
34               | sunflower | 5
5                | rose      | 6

>>> flowers.select('Number of petals', 'Weight')
Number of petals | Weight
8                | 10
34               | 5
5                | 6

>>> flowers  # original table unchanged
Number of petals | Name      | Weight
8                | lotus     | 10
34               | sunflower | 5
5                | rose      | 6

>>> flowers.select(0, 2)
Number of petals | Weight
8                | 10
34               | 5
5                | 6

set_format(column_or_columns, formatter)[source]: Set the format of a column.

show(max_rows=0)[source]: Display the table.

sort(column_or_label, descending=False, distinct=False)[source]

Return a Table of rows sorted according to the values in a column.

Args:

column_or_label: the column whose values are used for sorting.

descending: if True, sorting will be in descending, rather than: ascending order.
distinct: if True, repeated values in column_or_label will: be omitted.

Returns:

An instance of Table containing rows sorted based on the values in column_or_label.

>>> marbles = Table().with_columns(
...    "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"),
...    "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"),
...    "Amount", make_array(4, 6, 12, 7, 9, 2),
...    "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00))
>>> marbles
Color | Shape       | Amount | Price
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 1.4
Green | Round       | 2      | 1
>>> marbles.sort("Amount")
Color | Shape       | Amount | Price
Green | Round       | 2      | 1
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 1.4
Blue  | Rectangular | 12     | 2
>>> marbles.sort("Amount", descending = True)
Color | Shape       | Amount | Price
Blue  | Rectangular | 12     | 2
Green | Rectangular | 9      | 1.4
Red   | Round       | 7      | 1.75
Green | Rectangular | 6      | 1.3
Red   | Round       | 4      | 1.3
Green | Round       | 2      | 1
>>> marbles.sort(3) # the Price column
Color | Shape       | Amount | Price
Green | Round       | 2      | 1
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Green | Rectangular | 9      | 1.4
Red   | Round       | 7      | 1.75
Blue  | Rectangular | 12     | 2
>>> marbles.sort(3, distinct = True)
Color | Shape       | Amount | Price
Green | Round       | 2      | 1
Red   | Round       | 4      | 1.3
Green | Rectangular | 9      | 1.4
Red   | Round       | 7      | 1.75
Blue  | Rectangular | 12     | 2

split(k)[source]

Return a tuple of two tables where the first table contains k rows randomly sampled and the second contains the remaining rows.

Args:

k (int): The number of rows randomly sampled into the first: table. k must be between 1 and num_rows - 1.

Raises:

ValueError: k is not between 1 and num_rows - 1.

Returns:

A tuple containing two instances of Table.

>>> jobs = Table().with_columns(
...     'job',  make_array('a', 'b', 'c', 'd'),
...     'wage', make_array(10, 20, 15, 8))
>>> jobs
job  | wage
a    | 10
b    | 20
c    | 15
d    | 8
>>> sample, rest = jobs.split(3)
>>> sample 
job  | wage
c    | 15
a    | 10
b    | 20
>>> rest 
job  | wage
d    | 8

stack(key, labels=None)[source]: Takes k original columns and returns two columns, with col. 1 of all column names and col. 2 of all associated data.

stats(ops=(<built-in function min>, <built-in function max>, <function median>, <built-in function sum>))[source]: Compute statistics for each column and place them in a table.

take()[source]

Return a new Table with selected rows taken by index.

Args:: row_indices_or_slice (integer or array of integers): The row index, list of row indices or a slice of row indices to be selected.
Returns:: A new instance of Table with selected rows in order corresponding to row_indices_or_slice.
Raises:: IndexError, if any row_indices_or_slice is out of bounds with respect to column length.

>>> grades = Table().with_columns('letter grade',
...     make_array('A+', 'A', 'A-', 'B+', 'B', 'B-'),
...     'gpa', make_array(4, 4, 3.7, 3.3, 3, 2.7))
>>> grades
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
B+           | 3.3
B            | 3
B-           | 2.7
>>> grades.take(0)
letter grade | gpa
A+           | 4
>>> grades.take(-1)
letter grade | gpa
B-           | 2.7
>>> grades.take(make_array(2, 1, 0))
letter grade | gpa
A-           | 3.7
A            | 4
A+           | 4
>>> grades.take[:3]
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
>>> grades.take(np.arange(0,3))
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
>>> grades.take(10)
Traceback (most recent call last):
    ...
IndexError: index 10 is out of bounds for axis 0 with size 6

to_array()[source]: Convert the table to a structured NumPy array.

to_csv(filename)[source]

Creates a CSV file with the provided filename.

The CSV is created in such a way that if we run table.to_csv('my_table.csv') we can recreate the same table with Table.read_table('my_table.csv').

Args:: filename (str): The filename of the output CSV file.
Returns:: None, outputs a file with name filename.

>>> jobs = Table().with_columns(
...     'job',  make_array('a', 'b', 'c', 'd'),
...     'wage', make_array(10, 20, 15, 8))
>>> jobs
job  | wage
a    | 10
b    | 20
c    | 15
d    | 8
>>> jobs.to_csv('my_table.csv') 
<outputs a file called my_table.csv in the current directory>

to_df()[source]: Convert the table to a Pandas DataFrame.

to_joint(X_column_label=None, Y_column_label=None, probability_column_label=None, reverse=True)

Converts a table of probabilities associated with two variables into a JointDistribution object

Parameters:

Parameters:	table : Table You can either call pass in a Table directly or call the toJoint() method of that Table. See examples. X_column_label (optional) : str Label for the first variable. Defaults to the same label as that of first variable of Table. Y_column_label (optional) : str Label for the second variable. Defaults to the same label as that of second variable of Table. probability_column_label (optional) : str Label for probabilities. reverse (optional) : bool If True, the vertical values will be reversed.
Returns:	JointDistribution A JointDistribution object.

table : Table: You can either call pass in a Table directly or call the toJoint() method of that Table. See examples.
X_column_label (optional) : str: Label for the first variable. Defaults to the same label as that of first variable of Table.
Y_column_label (optional) : str: Label for the second variable. Defaults to the same label as that of second variable of Table.
probability_column_label (optional) : str: Label for probabilities.
reverse (optional) : bool: If True, the vertical values will be reversed.

Returns:

JointDistribution: A JointDistribution object.

Examples

>>> dist1 = Table().values([0,1],[2,3])
>>> dist1['Probability'] = make_array(0.1, 0.2, 0.3, 0.4)
>>> dist1.to_joint()
     X=0  X=1
Y=3  0.2  0.4
Y=2  0.1  0.3
>>> dist2 = Table().values('Coin1',['H','T'], 'Coin2', ['H','T'])
>>> dist2['Probability'] = np.array([0.4*0.6, 0.6*0.6, 0.4*0.4, 0.6*0.4])
>>> dist2.toJoint()
         Coin1=H  Coin1=T
Coin2=T     0.36     0.24
Coin2=H     0.24     0.16

to_markov_chain()

Constructs a Markov Chain from the Table.

Returns:	MarkovChain

transition_function(pfunc)

Assigns transition probabilities to a Distribution via a probability function. The probability function is applied to each value of the domain. Must have domain values in the first column first.

Parameters:	pfunc : variate function Conditional probability function of the distribution ( P(Y \| X))
Returns:	Table Table with those probabilities in its final column

transition_probability(values)

For a multivariate probability distribution, assigns transition probabilities, ie P(Y | X).

Parameters:	values : List or Array Values that must correspond to the domain in the same order
Returns:	Table A probability distribution with those probabilities

var()

Finds variance of distribution

Returns:	float Variance

Examples

>>> dist = Table().values([1, 2, 4]).probability([0.5, 0.4, 0.1])
>>> dist.var()
0.81
>>> (1 * 0.5 + 4 * 0.4 + 16 * 0.1) - (1.7) ** 2
0.81

where(column_or_label, value_or_predicate=None, other=None)[source]

Return a new Table containing rows where value_or_predicate returns True for values in column_or_label.

Args:

column_or_label: A column of the Table either as a label (str) or an index (int). Can also be an array of booleans; only the rows where the array value is True are kept.

value_or_predicate: If a function, it is applied to every value in column_or_label. Only the rows where value_or_predicate returns True are kept. If a single value, only the rows where the values in column_or_label are equal to value_or_predicate are kept.

other: Optional additional column label for value_or_predicate to make pairwise comparisons. See the examples below for usage. When other is supplied, value_or_predicate must be a callable function.

Returns:

If value_or_predicate is a function, returns a new Table containing only the rows where value_or_predicate(val) is True for the val``s in ``column_or_label.

If value_or_predicate is a value, returns a new Table containing only the rows where the values in column_or_label are equal to value_or_predicate.

If column_or_label is an array of booleans, returns a new Table containing only the rows where column_or_label is True.

>>> marbles = Table().with_columns(
...    "Color", make_array("Red", "Green", "Blue",
...                        "Red", "Green", "Green"),
...    "Shape", make_array("Round", "Rectangular", "Rectangular",
...                        "Round", "Rectangular", "Round"),
...    "Amount", make_array(4, 6, 12, 7, 9, 2),
...    "Price", make_array(1.30, 1.20, 2.00, 1.75, 0, 3.00))

>>> marbles
Color | Shape       | Amount | Price
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.2
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 0
Green | Round       | 2      | 3

Use a value to select matching rows

>>> marbles.where("Price", 1.3)
Color | Shape | Amount | Price
Red   | Round | 4      | 1.3

In general, a higher order predicate function such as the functions in datascience.predicates.are can be used.

>>> from datascience.predicates import are
>>> # equivalent to previous example
>>> marbles.where("Price", are.equal_to(1.3))
Color | Shape | Amount | Price
Red   | Round | 4      | 1.3

>>> marbles.where("Price", are.above(1.5))
Color | Shape       | Amount | Price
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Round       | 2      | 3

Use the optional argument other to apply predicates to compare columns.

>>> marbles.where("Price", are.above, "Amount")
Color | Shape | Amount | Price
Green | Round | 2      | 3

>>> marbles.where("Price", are.equal_to, "Amount") # empty table
Color | Shape | Amount | Price

with_column(label, values, *rest)[source]

Return a new table with an additional or replaced column.

Args:

label (str): The column label. If an existing label is used,: the existing column will be replaced in the new table.
values (single value or sequence): If a single value, every: value in the new column is values. If sequence of values, new column takes on values in values.
rest: An alternating list of labels and values describing: additional columns. See with_columns for a full description.

Raises:

ValueError: If

label is not a valid column name
if label is not of type (str)
values is a list/array that does not have the same

length as the number of rows in the table.

Returns:

copy of original table with new or replaced column

>>> alphabet = Table().with_column('letter', make_array('c','d'))
>>> alphabet = alphabet.with_column('count', make_array(2, 4))
>>> alphabet
letter | count
c      | 2
d      | 4
>>> alphabet.with_column('permutes', make_array('a', 'g'))
letter | count | permutes
c      | 2     | a
d      | 4     | g
>>> alphabet
letter | count
c      | 2
d      | 4
>>> alphabet.with_column('count', 1)
letter | count
c      | 1
d      | 1
>>> alphabet.with_column(1, make_array(1, 2))
Traceback (most recent call last):
    ...
ValueError: The column label must be a string, but a int was given
>>> alphabet.with_column('bad_col', make_array(1))
Traceback (most recent call last):
    ...
ValueError: Column length mismatch. New column does not have the same number of rows as table.

with_columns(*labels_and_values)[source]

Return a table with additional or replaced columns.

Args:

labels_and_values: An alternating list of labels and values or: a list of label-value pairs. If one of the labels is in existing table, then every value in the corresponding column is set to that value. If label has only a single value (int), every row of corresponding column takes on that value.

Raises:

ValueError: If

any label in labels_and_values is not a valid column

name, i.e if label is not of type (str).
if any value in labels_and_values is a list/array and

does not have the same length as the number of rows in the table.

AssertionError:

‘incorrect columns format’, if passed more than one sequence

(iterables) for labels_and_values.
‘even length sequence required’ if missing a pair in

label-value pairs.

Returns:

Copy of original table with new or replaced columns. Columns added in order of labels. Equivalent to with_column(label, value) when passed only one label-value pair.

>>> players = Table().with_columns('player_id',
...     make_array(110234, 110235), 'wOBA', make_array(.354, .236))
>>> players
player_id | wOBA
110234    | 0.354
110235    | 0.236
>>> players = players.with_columns('salaries', 'N/A', 'season', 2016)
>>> players
player_id | wOBA  | salaries | season
110234    | 0.354 | N/A      | 2016
110235    | 0.236 | N/A      | 2016
>>> salaries = Table().with_column('salary',
...     make_array('$500,000', '$15,500,000'))
>>> players.with_columns('salaries', salaries.column('salary'),
...     'years', make_array(6, 1))
player_id | wOBA  | salaries    | season | years
110234    | 0.354 | $500,000    | 2016   | 6
110235    | 0.236 | $15,500,000 | 2016   | 1
>>> players.with_columns(2, make_array('$600,000', '$20,000,000'))
Traceback (most recent call last):
    ...
ValueError: The column label must be a string, but a int was given
>>> players.with_columns('salaries', make_array('$600,000'))
Traceback (most recent call last):
    ...
ValueError: Column length mismatch. New column does not have the same number of rows as table.

with_row(row)[source]

Return a table with an additional row.

Args:: row (sequence): A value for each column.
Raises:: ValueError: If the row length differs from the column count.

>>> tiles = Table(make_array('letter', 'count', 'points'))
>>> tiles.with_row(['c', 2, 3]).with_row(['d', 4, 2])
letter | count | points
c      | 2     | 3
d      | 4     | 2

with_rows(rows)[source]

Return a table with additional rows.

Args:

rows (sequence of sequences): Each row has a value per column.

If rows is a 2-d array, its shape must be (_, n) for n columns.

Raises:

ValueError: If a row length differs from the column count.

>>> tiles = Table(make_array('letter', 'count', 'points'))
>>> tiles.with_rows(make_array(make_array('c', 2, 3),
...     make_array('d', 4, 2)))
letter | count | points
c      | 2     | 3
d      | 4     | 2

`prob140` JointDistribution¶

class prob140.JointDistribution(data=None, index=None, columns=None, dtype=None, copy=False)[source]

both_marginals()[source]

Finds the marginal distribution of both variables.

Returns:	JointDistribution Table.

Examples

>>> dist1 = Table().values([0, 1], [2, 3]).probability([0.1, 0.2, 0.3, 0.4]).to_joint()
>>> dist1.both_marginals()
                    X=0  X=1  Sum: Marginal of Y
Y=3                 0.2  0.4                 0.6
Y=2                 0.1  0.3                 0.4
Sum: Marginal of X  0.3  0.7                 1.0

conditional_dist(label, given='', show_ev=False)[source]

Given the random variable label, finds the conditional distribution of the other variable.

Parameters:	label : String Variable given.
Returns:	JointDistribution Table

Examples

>>> coins = Table().values('Coin1', ['H', 'T'], 'Coin2', ['H','T']).probability(np.array([0.24, 0.36, 0.16,0.24])).to_joint()
>>> coins.conditional_dist('Coin1', 'Coin2')
                          Coin1=H  Coin1=T  Sum
Dist. of Coin1 | Coin2=H      0.6      0.4  1.0
Dist. of Coin1 | Coin2=T      0.6      0.4  1.0
Marginal of Coin1             0.6      0.4  1.0
>>> coins.conditional_dist('Coin2', 'Coin1')
         Dist. of Coin2 | Coin1=H  Dist. of Coin2 | Coin1=T  Marginal of Coin2
Coin2=H                       0.4                       0.4                0.4
Coin2=T                       0.6                       0.6                0.6
Sum                           1.0                       1.0                1.0

classmethod from_table(table, reverse=True)[source]

Constructs a JointDistribution from a Table.

Parameters:	table : Table 3-column table with RV1, RV2, and joint probability reverse : bool (optional) If True, vertical random variables are reversed. (Default: True)
Returns:	JointDistribution

get_possible_values(label='')[source]

Returns the possible values. If a label is given, returns the values for that random variable. Automatically converts to float/int if relevant.

Parameters:	label : str Name of random variable.
Returns:	List of values.

marginal(label)[source]

Returns the marginal distribution of label.

Parameters:	label : String The label of the variable of which we want to find the marginal distribution.
Returns:	JointDistribution Table

Examples

>>> dist2 = Table().values('Coin1', ['H', 'T'], 'Coin2', ['H', 'T']).probability(np.array([0.24, 0.36, 0.16, 0.24])).to_joint()
>>> dist2.marginal('Coin1')
                        Coin1=H  Coin1=T
Coin2=T                    0.36     0.24
Coin2=H                    0.24     0.16
Sum: Marginal of Coin1     0.60     0.40
>>> dist2.marginal('Coin2')
         Coin1=H  Coin1=T  Sum: Marginal of Coin2
Coin2=T     0.36     0.24                     0.6
Coin2=H     0.24     0.16                     0.4

marginal_dist(label)[source]

Finds the marginal marginal distribution of label, returns as a single variable distribution.

Parameters:	label The label of the variable of which we want to find the marginal distribution.
Returns:	Table Single variable distribution of label.

`prob140` MarkovChain¶

class prob140.MarkovChain(states, transition_matrix)[source]

A class for representing, simulating, and computing Markov Chains.

distribution(starting_condition, steps=1)[source]

Finds the distribution of states after n steps given a starting condition.

Parameters:	starting_condition : state or Table The initial distribution or the original state. n : integer Number of transition steps.
Returns:	Table Shows the distribution after n steps

Examples

>>> states = make_array('A', 'B')
>>> transition_matrix = np.array([[0.1, 0.9],
...                               [0.8, 0.2]])
>>> mc = MarkovChain.from_matrix(states, transition_matrix)
>>> mc.distribution(start)
State | Probability
A     | 0.24
B     | 0.76
>>> mc.distribution(start, 0)
State | Probability
A     | 0.8
B     | 0.2
>>> mc.distribution(start, 3)
State | Probability
A     | 0.3576
B     | 0.6424

expected_return_time()[source]

Finds the expected return time of the Markov Chain (1 / steady state).

Returns:	Table Expected Return Time

Examples

>>> states = ['A', 'B']
>>> transition_matrix = np.array([[0.1, 0.9],
...                               [0.8, 0.2]])
>>> mc = MarkovChain.from_matrix(states, transition_matrix)
>>> mc.expected_return_time()
Value | Expected Return Time
A     | 1.5
B     | 3

classmethod from_matrix(states, transition_matrix)[source]

Constructs a MarkovChain from a transition matrix.

Parameters:	states : iterable List of states. transition_matrix : ndarray Square transition matrix.
Returns:	MarkovChain

Examples

>>> states = [1, 2]
>>> transition_matrix = np.array([[0.1, 0.9],
...                               [0.8, 0.2]])
>>> MarkovChain.from_matrix(states, transition_matrix)
     1    2
1  0.1  0.9
2  0.8  0.2

classmethod from_table(table)[source]

Constructs a Markov Chain from a Table

Parameters:	table : Table A table with three columns for source state, target state, and probability.
Returns:	MarkovChain

Examples

>>> table = Table().states(make_array('A', 'B'))         ...     .transition_probability(make_array(0.5, 0.5, 0.3, 0.7))
>>> table
Source | Target | Probability
A      | A      | 0.5
A      | B      | 0.5
B      | A      | 0.3
B      | B      | 0.7
>>> MarkovChain.from_table(table)
     A    B
A  0.5  0.5
B  0.3  0.7

classmethod from_transition_function(states, transition_function)[source]

Constructs a MarkovChain from a transition function.

Parameters:	states : iterable List of states. transition_function : function Bivariate transition function that maps two states to a probability.
Returns:	MarkovChain

Examples

>>> states = make_array(1, 2)
>>> def transition(s1, s2):
...    if s1 == s2:
...        return 0.7
...    else:
...        return 0.3
>>> MarkovChain.from_transition_function(states, transition)
     1    2
1  0.7  0.3
2  0.3  0.7

get_transition_matrix(steps=1)[source]

Returns the transition matrix after n steps as a numpy matrix.

Parameters:	steps : int (optional) Number of steps. (default: 1)
Returns:	Transition matrix

log_prob_of_path(starting_condition, path)[source]

Finds the log-probability of a path given a starting condition.

May have better precision than prob_of_path.

Parameters:	starting_condition : state or Distribution If a state, finds the log-probability of the path starting at that state. If a Distribution, finds the probability of the path with the first element sampled from the Distribution path : ndarray Array of states
Returns:	float log of probability

Examples

>>> states = make_array('A', 'B')
>>> transition_matrix = np.array([[0.1, 0.9],
...                               [0.8, 0.2]])
>>> mc = MarkovChain.from_matrix(states, transition_matrix)
>>> mc.log_prob_of_path('A', ['A', 'B', 'A'])
-2.6310891599660815
>>> start = Table().states(['A', 'B']).probability([0.8, 0.2])
>>> mc.log_prob_of_path(start, ['A', 'B', 'A'])
-0.55164761828624576

plot_path(starting_condition, path)[source]

Plots a Markov Chain’s path.

Parameters:	starting_condition : state State to start at. path : iterable List of valid states.

Examples

>>> states = ['A', 'B']  # Works with all state data types!
>>> transition_matrix = np.array([[0.1, 0.9],
...                               [0.8, 0.2]])
>>> mc = MarkovChain.from_matrix(states, transition_matrix)
>>> mc.plot_path(mc.simulate_path('B', 20))
<Plot of a Markov Chain that starts at 'B' and takes 20 steps>

prob_of_path(starting_condition, path)[source]

Finds the probability of a path given a starting condition.

Parameters:	starting_condition : state or Distribution If a state, finds the probability of the path starting at that state. If a Distribution, finds the probability of the path with the first element sampled from the Distribution. path : ndarray Array of states
Returns:	float probability

Examples

>>> states = ['A', 'B']
>>> transition_matrix = np.array([[0.1, 0.9],
...                               [0.8, 0.2]])
>>> mc = MarkovChain.from_matrix(states, transition_matrix)
>>> mc.prob_of_path('A', ['A', 'B', 'A'])
0.072
>>> 0.1 * 0.9 * 0.8
0.072
>>> start = Table().states(['A', 'B']).probability([0.8, 0.2])
>>> mc.prob_of_path(start, ['A', 'B', 'A'])
0.576
>>> 0.8 * 0.9 * 0.8
0.576

simulate_path(starting_condition, steps, plot_path=False)[source]

Simulates a path of n steps with a specific starting condition.

Parameters:	starting_condition : state or Distribution If a state, simulates n steps starting at that state. If a Distribution, samples from that distribution to find the starting state. steps : int Number of steps to take. plot_path : bool If True, plots the simulated path.
Returns:	ndarray Array of sampled states.

Examples

>>> states = ['A', 'B']
>>> transition_matrix = np.array([[0.1, 0.9],
...                               [0.8, 0.2]])
>>> mc = MarkovChain.from_matrix(states, transition_matrix)
>>> mc.simulate_path('A', 10)
array(['A', 'A', 'B', 'A', 'B', 'A', 'B', 'B', 'A', 'B', 'B'])

steady_state()[source]

Finds the stationary distribution of the Markov Chain.

Returns:	Table Distribution.

Examples

>>> states = ['A', 'B']
>>> transition_matrix = np.array([[0.1, 0.9],
...                               [0.8, 0.2]])
>>> mc = MarkovChain.from_matrix(states, transition_matrix)
>>> mc.steady_state()
Value | Probability
A     | 0.666667
B     | 0.333333

to_pandas()[source]: Returns the Pandas DataFrame representation of the MarkovChain.

transition_matrix(steps=1)[source]

Returns the transition matrix after n steps visually as a Pandas df.

Parameters:	steps : int (optional) Number of steps. (default: 1)
Returns:	Pandas DataFrame

datascience Tables¶

prob140 JointDistribution¶

prob140 MarkovChain¶

`datascience` Tables¶

`prob140` JointDistribution¶

`prob140` MarkovChain¶