Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update develop from master for v3.7 #13011

Conversation

adrianeboyd
Copy link
Contributor

Description

Update develop from master for v3.7.

Types of change

Chore.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

adrianeboyd and others added 30 commits July 24, 2023 10:41
* Update most recent release
* Switch from azure to GHA CI tests badge
* Remove link to survey
* Format
* initial commit

* update for v0.4.0

* Apply suggestions from code review

* Fix formatting

* Apply suggestions from code review

* Update website/docs/api/large-language-models.mdx

* Update website/docs/api/large-language-models.mdx

* update usage page

* Apply suggestions from review

* Apply suggestions from review

* fix links

* fix relative links

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <[email protected]>

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <[email protected]>

* Apply suggestions from review

* Add section on Llama 2. Format.

---------

Co-authored-by: Raphael Mitsch <[email protected]>
Co-authored-by: Sofie Van Landeghem <[email protected]>
Additionally remove outdated `is_new_osx` check and settings.
* Added OdyCy to spaCy Universe

* Replaced template tags

Co-authored-by: Adriane Boyd <[email protected]>

---------

Co-authored-by: Adriane Boyd <[email protected]>
* Add cli for finding locations of registered func

* fixes: naming and typing

* isort

* update naming

* remove to find-function

* remove file:// bit

* use registry name if given and exit gracefully if a registry was not found

* clean up failure msg

* specify registry_name options

* mypy fixes

* return location for internal usage

* add documentation

* more mypy fixes

* clean up example

* add section to menu

* add tests

---------

Co-authored-by: svlandeg <[email protected]>
* Add data structures to docs

* Adjusted descriptions for more consistency

* Add _optional_ flag to parameters

* Add tests and adjust optional title key in doc

* Add title to dep visualizations

* fix typo

---------

Co-authored-by: thomashacker <[email protected]>
* Update universe.json

added entry for Sayswho

* Update universe.json

updated sayswho entry

* Update universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

---------

Co-authored-by: Adriane Boyd <[email protected]>
…osion#12173)

* remove migration support form

* initial test commit

* add fixture

* add combo test

* pull out parameter example data

* fix formatting on examples

* remove unused import

* remove unncessary fmt:off instructions

* only set logger level if verbose flag is explicitly set

---------

Co-authored-by: svlandeg <[email protected]>
There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
* Fix displacy br tag

* Prefer <br>, also update package CLI
* Add `cuda12x` for `cupy-cuda12x`.
* Drop `cuda-autodetect` from quickstart, set default to `cuda11x`
instead.
Add: example sentences to improve the Turkish model. Let's get the tr_web_core_sm out in the the world yaa
* Update universe.json

added hobbit-spacy to the universe json

* Update universe.json

removed displacy from hobbit-spacy and added a default text.
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.

For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```

Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.

However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.

Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```

This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.

The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```

These changes update the docs to correct the receptive field formula and
the example receptive field size.
* fix typo in link

* fix REL.v1 parameter
* fix usage example

* revert back to v2 to allow hot fix on main
* Update incorrect example config. (explosion#12893)

* spacy-llm docs cleanup (explosion#12945)

* Shorten NER section

* fix template references

* simplify sections

* set temperature to 0.0 in examples

* condense model information

* fix parameters for REST models

* set temperature to 0.0

* spelling fix

* trigger preview

* fix quotes

* add small note on noop.v1

* move up example noop config

* set appropriate model example configs

* explain config

* fix

Co-authored-by: Raphael Mitsch <[email protected]>

---------

Co-authored-by: Raphael Mitsch <[email protected]>

* Docs for ner.v3 and spancat.v3 spacy-llm tasks (explosion#12949)

* formatting

* update usage table with NER.v3

* fix typo in links

* v3 overview of parameters

* add spancat.v3

* add further v3 explanations

* remove TODO comment

* few more small fixes

* Add doc section on LLM + task factories (explosion#12905)

* Add section on LLM + task factories.

* Apply suggestions from code review

---------

Co-authored-by: Sofie Van Landeghem <[email protected]>

* add default config to openai models (explosion#12961)

* Docs for spacy-llm 0.5.0 (explosion#12967)

* simplify Python example

* simplify Python example

* Refer only to latest OpenAI model versions from usage doc

* Typo fix

Co-authored-by: Raphael Mitsch <[email protected]>

* clarify accuracy claim

---------

Co-authored-by: Raphael Mitsch <[email protected]>

---------

Co-authored-by: Raphael Mitsch <[email protected]>
* fix construction example

* shorten task-specific factory list

* small edits to HF models

* small edit to API models

* typo

* fix space

Co-authored-by: Raphael Mitsch <[email protected]>

---------

Co-authored-by: Raphael Mitsch <[email protected]>
* fix BertWordPieceTokenizer constructor call

* fix

* Update website/docs/usage/linguistic-features.mdx

---------

Co-authored-by: Adriane Boyd <[email protected]>
@adrianeboyd adrianeboyd added the v3.7 Related to v3.7 label Sep 25, 2023
evornov and others added 2 commits September 25, 2023 11:25
* add span key option for CLI evaluation

* Rephrase CLI help to refer to Doc.spans instead of spancat

* Rephrase docs to refer to Doc.spans instead of spancat

---------

Co-authored-by: Adriane Boyd <[email protected]>
@adrianeboyd
Copy link
Contributor Author

Let me redo this in a bit...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v3.7 Related to v3.7
Projects
None yet
Development

Successfully merging this pull request may close these issues.